Re: Log file location in Spark on K8s

2023-10-09 Thread Prashant Sharma
Hi Sanket,

Driver and executor logs are written to stdout by default, it can be
configured using SPARK_HOME/conf/log4j.properties file. The file including
the entire SPARK_HOME/conf is auto propogateded to all driver and executor
container and mounted as volume.

Thanks

On Mon, 9 Oct, 2023, 5:37 pm Agrawal, Sanket,
 wrote:

> Hi All,
>
>
>
> We are trying to send the spark logs using fluent-bit. We validated that
> fluent-bit is able to move logs of all other pods except the
> driver/executor pods.
>
>
>
> It would be great if someone can guide us where should I look for spark
> logs in Spark on Kubernetes with client/cluster mode deployment.
>
>
>
> Thanks,
> Sanket A.
>
> This message (including any attachments) contains confidential information
> intended for a specific individual and purpose, and is protected by law. If
> you are not the intended recipient, you should delete this message and any
> disclosure, copying, or distribution of this message, or the taking of any
> action based on it, by you is strictly prohibited.
>
> Deloitte refers to a Deloitte member firm, one of its related entities, or
> Deloitte Touche Tohmatsu Limited ("DTTL"). Each Deloitte member firm is a
> separate legal entity and a member of DTTL. DTTL does not provide services
> to clients. Please see www.deloitte.com/about to learn more.
>
> v.E.1
>


Re: Clarification with Spark Structured Streaming

2023-10-09 Thread Danilo Sousa
Unsubscribe

> Em 9 de out. de 2023, à(s) 07:03, Mich Talebzadeh  
> escreveu:
> 
> Hi,
> 
> Please see my responses below:
> 
> 1) In Spark Structured Streaming does commit mean streaming data has been 
> delivered to the sink like Snowflake?
> 
> No. a commit does not refer to data being delivered to a sink like Snowflake 
> or bigQuery. The term commit refers to Spark Structured Streaming (SS) 
> internals. Specifically it means that a micro-batch of data has been 
> processed by SSS. In the checkpoint directory there is a subdirectory called 
> commits that marks the micro-batch process as completed.
> 
> 2) if sinks like Snowflake  cannot absorb or digest streaming data in a 
> timely manner, will there be an impact on spark streaming itself?
> 
> Yes, it can potentially impact SSS. If the sink cannot absorb data in a 
> timely manner, the batches will start to back up in SSS. This can cause Spark 
> to run out of memory and the streaming job to fail. As I understand, Spark 
> will use a combination of memory and disk storage (checkpointing). This can 
> also happen if the network interface between Spark and the sink is disrupted. 
> On the other hand Spark may slow down, as it tries to process the backed-up 
> batches of data. You want to avoid these scenarios.
> 
> HTH
> 
> Mich Talebzadeh,
> Distinguished Technologist, Solutions Architect & Engineer
> London
> United Kingdom
> 
>view my Linkedin profile 
> 
> 
>  https://en.everybodywiki.com/Mich_Talebzadeh
> 
>  
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> 
> On Sun, 8 Oct 2023 at 19:50, ashok34...@yahoo.com.INVALID 
>  wrote:
>> Hello team
>> 
>> 1) In Spark Structured Streaming does commit mean streaming data has been 
>> delivered to the sink like Snowflake?
>> 
>> 2) if sinks like Snowflake  cannot absorb or digest streaming data in a 
>> timely manner, will there be an impact on spark streaming itself?
>> 
>> Thanks
>> 
>> AK



Re: Clarification with Spark Structured Streaming

2023-10-09 Thread Mich Talebzadeh
Your mileage varies. Often there is a flavour of Cloud Data warehouse
already there. CDWs like BigQuery, Redshift, Snowflake and so forth. They
can all do a good job for various degrees

   - Use efficient data types. Choose data types that are efficient for
   Spark to process. For example, use integer data types for columns that
   store integer values.
   - Avoid using complex data types. Complex data types, such as nested
   structs and arrays, can be less efficient for Spark to process.
   - Opt for columnar storage format like Parquet for your sink table.
   Columnar storage is highly efficient for analytical workloads as it allows
   for column-level compression and predicate pushdown.
   - These CDW come with partitioning options. Popular are date or time
   formats that can be used for partitioning. This will reduce the amount of
   data scanned during queries.
   - Some of these CDWs come with native streaming capabilities like
   BigQuery Streaming, I believe Snowflake has Snowpipe Streaming API as well
   (don't know much about it) . These options  enable real-time data ingestion
   and processing, No need for manual batch processing etc.
   - You can batch incoming data for efficiency processing, which can
   improve performance and simplify data handling. Start by configuring your
   Spark Streaming context with an appropriate batch interval. The batch
   interval defines how often Spark will process a batch of data. Choose a
   batch interval that balances latency and throughput based on the
   application's needs. Spark can process batches of data more efficiently
   than it can process individual records.
   - Snowflake says it is serverless and so is BigQuery. They are designed
   to provide a uniform performance regardless of workload. Serverless CDWs
   can efficiently handle both batch and streaming workloads without the need
   for manual resource provisioning.
   - Use materialized views to pre-compute query results, which can improve
   the performance of frequently executed queries. This has been around from
   classics RDBMs

HTH

Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 9 Oct 2023 at 17:50, ashok34...@yahoo.com 
wrote:

> Thank you for your feedback Mich.
>
> In general how can one optimise the cloud data warehouses (the sink part),
> to handle streaming Spark data efficiently, avoiding bottlenecks that
> discussed.
>
>
> AK
> On Monday, 9 October 2023 at 11:04:41 BST, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>
> Hi,
>
> Please see my responses below:
>
> 1) In Spark Structured Streaming does commit mean streaming data has been
> delivered to the sink like Snowflake?
>
> No. a commit does not refer to data being delivered to a sink like
> Snowflake or bigQuery. The term commit refers to Spark Structured Streaming
> (SS) internals. Specifically it means that a micro-batch of data has been
> processed by SSS. In the checkpoint directory there is a
> subdirectory called commits that marks the micro-batch process as completed.
>
> 2) if sinks like Snowflake  cannot absorb or digest streaming data in a
> timely manner, will there be an impact on spark streaming itself?
>
> Yes, it can potentially impact SSS. If the sink cannot absorb data in a
> timely manner, the batches will start to back up in SSS. This can cause
> Spark to run out of memory and the streaming job to fail. As I understand,
> Spark will use a combination of memory and disk storage (checkpointing).
> This can also happen if the network interface between Spark and the sink is
> disrupted. On the other hand Spark may slow down, as it tries to process
> the backed-up batches of data. You want to avoid these scenarios.
>
> HTH
>
> Mich Talebzadeh,
> Distinguished Technologist, Solutions Architect & Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sun, 8 Oct 2023 at 19:50, ashok34...@yahoo.com.INVALID
>  wrote:
>
> Hello team
>
> 1) In Spark Structured Streaming does commit mean streaming data has

Re: Clarification with Spark Structured Streaming

2023-10-09 Thread ashok34...@yahoo.com.INVALID
 Thank you for your feedback Mich.
In general how can one optimise the cloud data warehouses (the sink part), to 
handle streaming Spark data efficiently, avoiding bottlenecks that discussed.

AKOn Monday, 9 October 2023 at 11:04:41 BST, Mich Talebzadeh 
 wrote:  
 
 Hi,
Please see my responses below:
1) In Spark Structured Streaming does commit mean streaming data has been 
delivered to the sink like Snowflake?

No. a commit does not refer to data being delivered to a sink like Snowflake or 
bigQuery. The term commit refers to Spark Structured Streaming (SS) internals. 
Specifically it means that a micro-batch of data has been processed by SSS. In 
the checkpoint directory there is a subdirectory called commits that marks the 
micro-batch process as completed.
2) if sinks like Snowflake  cannot absorb or digest streaming data in a timely 
manner, will there be an impact on spark streaming itself?

Yes, it can potentially impact SSS. If the sink cannot absorb data in a timely 
manner, the batches will start to back up in SSS. This can cause Spark to run 
out of memory and the streaming job to fail. As I understand, Spark will use a 
combination of memory and disk storage (checkpointing). This can also happen if 
the network interface between Spark and the sink is disrupted. On the other 
hand Spark may slow down, as it tries to process the backed-up batches of data. 
You want to avoid these scenarios.
HTH
Mich Talebzadeh,Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom



   view my Linkedin profile




 https://en.everybodywiki.com/Mich_Talebzadeh

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destructionof data or any other property which may arise from relying 
on this email's technical content is explicitly disclaimed.The author will in 
no case be liable for any monetary damages arising from suchloss, damage or 
destruction. 

 


On Sun, 8 Oct 2023 at 19:50, ashok34...@yahoo.com.INVALID 
 wrote:

Hello team
1) In Spark Structured Streaming does commit mean streaming data has been 
delivered to the sink like Snowflake?
2) if sinks like Snowflake  cannot absorb or digest streaming data in a timely 
manner, will there be an impact on spark streaming itself?
Thanks

AK
  

Re: Updating delta file column data

2023-10-09 Thread Mich Talebzadeh
In a nutshell, is this what you are trying to do?


   1. Read the Delta table into a Spark DataFrame.
   2. Explode the string column into a struct column.
   3. Convert the hexadecimal field to an integer.
   4. Write the DataFrame back to the Delta table in merge mode with a
   unique key.

Is this a fair assessment

HTH

Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 9 Oct 2023 at 14:46, Karthick Nk  wrote:

> Hi All,
>
> I have  mentioned the sample data below and the operation I need to
> perform over there,
>
> I have delta tables with columns, in that columns I have the data in the
> string data type(contains the struct data),
>
> So, I need to update one key value in the struct field data in the string
> column of the delta table.
>
> Note: I can able to explode the string column into the struct field and
> into the individual field by using the following operation in the spark,
>
> [image: image.png]
>
> df_new = spark.read.json(df.rdd.map(lambda x: '{"data": x.data }')
>
> Could you suggest a possible way to perform the required action in an
> optimistic way?
>
> Note: Please feel free to ask, if you need further information.
>
> Thanks & regards,
> Karthick
>
> On Mon, Oct 2, 2023 at 10:53 PM Karthick Nk  wrote:
>
>> Hi community members,
>>
>> In databricks adls2 delta tables, I need to perform the below operation,
>> could you help me with your thoughts
>>
>>  I have the delta tables with one colum with data type string , which
>> contains the json data in string data type, I need to do the following
>> 1. I have to update one particular field value in the json and update it
>> back in the same column of the data.
>>
>> Example :
>>
>> In string column, inside json I have one field with value in hexadecimal.
>> Like { version : ''0xabcd1234"}
>>
>> I have to convert this field into corresponding integer value and update
>> back into same strong column json value.
>> Note: I have to perform this operation within this column. This column is
>> basically with data type string in delta table.
>>
>> Could you suggest some sample example.
>>
>> Thanks in advance.
>>
>


Log file location in Spark on K8s

2023-10-09 Thread Agrawal, Sanket
Hi All,

We are trying to send the spark logs using fluent-bit. We validated that 
fluent-bit is able to move logs of all other pods except the driver/executor 
pods.

It would be great if someone can guide us where should I look for spark logs in 
Spark on Kubernetes with client/cluster mode deployment.

Thanks,
Sanket A.

This message (including any attachments) contains confidential information 
intended for a specific individual and purpose, and is protected by law. If you 
are not the intended recipient, you should delete this message and any 
disclosure, copying, or distribution of this message, or the taking of any 
action based on it, by you is strictly prohibited.

Deloitte refers to a Deloitte member firm, one of its related entities, or 
Deloitte Touche Tohmatsu Limited ("DTTL"). Each Deloitte member firm is a 
separate legal entity and a member of DTTL. DTTL does not provide services to 
clients. Please see www.deloitte.com/about to learn more.

v.E.1


Re: Clarification with Spark Structured Streaming

2023-10-09 Thread Mich Talebzadeh
Hi,

Please see my responses below:

1) In Spark Structured Streaming does commit mean streaming data has been
delivered to the sink like Snowflake?

No. a commit does not refer to data being delivered to a sink like
Snowflake or bigQuery. The term commit refers to Spark Structured Streaming
(SS) internals. Specifically it means that a micro-batch of data has been
processed by SSS. In the checkpoint directory there is a
subdirectory called commits that marks the micro-batch process as completed.

2) if sinks like Snowflake  cannot absorb or digest streaming data in a
timely manner, will there be an impact on spark streaming itself?

Yes, it can potentially impact SSS. If the sink cannot absorb data in a
timely manner, the batches will start to back up in SSS. This can cause
Spark to run out of memory and the streaming job to fail. As I understand,
Spark will use a combination of memory and disk storage (checkpointing).
This can also happen if the network interface between Spark and the sink is
disrupted. On the other hand Spark may slow down, as it tries to process
the backed-up batches of data. You want to avoid these scenarios.

HTH

Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sun, 8 Oct 2023 at 19:50, ashok34...@yahoo.com.INVALID
 wrote:

> Hello team
>
> 1) In Spark Structured Streaming does commit mean streaming data has been
> delivered to the sink like Snowflake?
>
> 2) if sinks like Snowflake  cannot absorb or digest streaming data in a
> timely manner, will there be an impact on spark streaming itself?
>
> Thanks
>
> AK
>