Re: Question related to parallelism using structed streaming parallelism

2023-03-14 Thread Mich Talebzadeh
In spark structured streaming we cannot perform repartition() without
stopping the streaming process unless otherwise.

Admittedly, It is not a parameter that I have played around with. I
still think Spark GUI should provide some insight.








   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 14 Mar 2023 at 16:42, Sean Owen  wrote:

> That's incorrect, it's spark.default.parallelism, but as the name
> suggests, that is merely a default. You control partitioning directly with
> .repartition()
>
> On Tue, Mar 14, 2023 at 11:37 AM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Check this link
>>
>>
>> https://sparkbyexamples.com/spark/difference-between-spark-sql-shuffle-partitions-and-spark-default-parallelism/
>>
>> You can set it
>>
>> spark.conf.set("sparkDefaultParallelism", value])
>>
>>
>> Have a look at Streaming statistics in Spark GUI, especially *Processing
>> Tim*e, defined by Spark GUI as Time taken to process all jobs of a
>> batch.  *The **Scheduling Dela*y and *the **Total Dela*y are additional
>> indicators of health.
>>
>>
>> then decide how to set the value.
>>
>>
>> HTH
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 14 Mar 2023 at 16:04, Emmanouil Kritharakis <
>> kritharakismano...@gmail.com> wrote:
>>
>>> Yes I need to check the performance of my streaming job in terms of
>>> latency and throughput. Is there any working example of how to increase the
>>> parallelism with spark structured streaming  using Dataset data structures?
>>> Thanks in advance.
>>>
>>> Kind regards,
>>>
>>> --
>>>
>>> Emmanouil (Manos) Kritharakis
>>>
>>> Ph.D. candidate in the Department of Computer Science
>>> 
>>>
>>> Boston University
>>>
>>>
>>> On Tue, Mar 14, 2023 at 12:01 PM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 What benefits are you going with increasing parallelism? Better
 througput



view my Linkedin profile
 


  https://en.everybodywiki.com/Mich_Talebzadeh



 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.




 On Tue, 14 Mar 2023 at 15:58, Emmanouil Kritharakis <
 kritharakismano...@gmail.com> wrote:

> Hello,
>
> I hope this email finds you well!
>
> I have a simple dataflow in which I read from a kafka topic, perform a
> map transformation and then I write the result to another topic. Based on
> your documentation here
> ,
> I need to work with Dataset data structures. Even though my solution 
> works,
> I need to increase the parallelism. The spark documentation includes a lot
> of parameters that I can change based on specific data structures like
> *spark.default.parallelism* or *spark.sql.shuffle.partitions*. The
> former is the default number of partitions in RDDs returned by
> transformations like join, reduceByKey while the later is not recommended
> for structured streaming as it is described in documentation: "Note: For
> structured streaming, this configuration cannot be changed between query
> restarts from the same checkpoint location".
>
> So my question is how can I increase the parallelism for a simple
> dataflow based on datasets with a map transformation only?
>
> I am looking forward to hearing from you as soon as possible. Thanks
> in advance!
>
> Kind regards,
>
> --
>
> Emmanouil (Manos) Kritharakis
>
> 

Re: Question related to parallelism using structed streaming parallelism

2023-03-14 Thread Sean Owen
That's incorrect, it's spark.default.parallelism, but as the name suggests,
that is merely a default. You control partitioning directly with
.repartition()

On Tue, Mar 14, 2023 at 11:37 AM Mich Talebzadeh 
wrote:

> Check this link
>
>
> https://sparkbyexamples.com/spark/difference-between-spark-sql-shuffle-partitions-and-spark-default-parallelism/
>
> You can set it
>
> spark.conf.set("sparkDefaultParallelism", value])
>
>
> Have a look at Streaming statistics in Spark GUI, especially *Processing
> Tim*e, defined by Spark GUI as Time taken to process all jobs of a batch.
>  *The **Scheduling Dela*y and *the **Total Dela*y are additional
> indicators of health.
>
>
> then decide how to set the value.
>
>
> HTH
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 14 Mar 2023 at 16:04, Emmanouil Kritharakis <
> kritharakismano...@gmail.com> wrote:
>
>> Yes I need to check the performance of my streaming job in terms of
>> latency and throughput. Is there any working example of how to increase the
>> parallelism with spark structured streaming  using Dataset data structures?
>> Thanks in advance.
>>
>> Kind regards,
>>
>> --
>>
>> Emmanouil (Manos) Kritharakis
>>
>> Ph.D. candidate in the Department of Computer Science
>> 
>>
>> Boston University
>>
>>
>> On Tue, Mar 14, 2023 at 12:01 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> What benefits are you going with increasing parallelism? Better througput
>>>
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Tue, 14 Mar 2023 at 15:58, Emmanouil Kritharakis <
>>> kritharakismano...@gmail.com> wrote:
>>>
 Hello,

 I hope this email finds you well!

 I have a simple dataflow in which I read from a kafka topic, perform a
 map transformation and then I write the result to another topic. Based on
 your documentation here
 ,
 I need to work with Dataset data structures. Even though my solution works,
 I need to increase the parallelism. The spark documentation includes a lot
 of parameters that I can change based on specific data structures like
 *spark.default.parallelism* or *spark.sql.shuffle.partitions*. The
 former is the default number of partitions in RDDs returned by
 transformations like join, reduceByKey while the later is not recommended
 for structured streaming as it is described in documentation: "Note: For
 structured streaming, this configuration cannot be changed between query
 restarts from the same checkpoint location".

 So my question is how can I increase the parallelism for a simple
 dataflow based on datasets with a map transformation only?

 I am looking forward to hearing from you as soon as possible. Thanks in
 advance!

 Kind regards,

 --

 Emmanouil (Manos) Kritharakis

 Ph.D. candidate in the Department of Computer Science
 

 Boston University

>>>


Re: Question related to parallelism using structed streaming parallelism

2023-03-14 Thread Mich Talebzadeh
Check this link

https://sparkbyexamples.com/spark/difference-between-spark-sql-shuffle-partitions-and-spark-default-parallelism/

You can set it

spark.conf.set("sparkDefaultParallelism", value])


Have a look at Streaming statistics in Spark GUI, especially *Processing
Tim*e, defined by Spark GUI as Time taken to process all jobs of a batch.
*The **Scheduling Dela*y and *the **Total Dela*y are additional indicators
of health.


then decide how to set the value.


HTH


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 14 Mar 2023 at 16:04, Emmanouil Kritharakis <
kritharakismano...@gmail.com> wrote:

> Yes I need to check the performance of my streaming job in terms of
> latency and throughput. Is there any working example of how to increase the
> parallelism with spark structured streaming  using Dataset data structures?
> Thanks in advance.
>
> Kind regards,
>
> --
>
> Emmanouil (Manos) Kritharakis
>
> Ph.D. candidate in the Department of Computer Science
> 
>
> Boston University
>
>
> On Tue, Mar 14, 2023 at 12:01 PM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> What benefits are you going with increasing parallelism? Better througput
>>
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 14 Mar 2023 at 15:58, Emmanouil Kritharakis <
>> kritharakismano...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> I hope this email finds you well!
>>>
>>> I have a simple dataflow in which I read from a kafka topic, perform a
>>> map transformation and then I write the result to another topic. Based on
>>> your documentation here
>>> ,
>>> I need to work with Dataset data structures. Even though my solution works,
>>> I need to increase the parallelism. The spark documentation includes a lot
>>> of parameters that I can change based on specific data structures like
>>> *spark.default.parallelism* or *spark.sql.shuffle.partitions*. The
>>> former is the default number of partitions in RDDs returned by
>>> transformations like join, reduceByKey while the later is not recommended
>>> for structured streaming as it is described in documentation: "Note: For
>>> structured streaming, this configuration cannot be changed between query
>>> restarts from the same checkpoint location".
>>>
>>> So my question is how can I increase the parallelism for a simple
>>> dataflow based on datasets with a map transformation only?
>>>
>>> I am looking forward to hearing from you as soon as possible. Thanks in
>>> advance!
>>>
>>> Kind regards,
>>>
>>> --
>>>
>>> Emmanouil (Manos) Kritharakis
>>>
>>> Ph.D. candidate in the Department of Computer Science
>>> 
>>>
>>> Boston University
>>>
>>


Re: Question related to parallelism using structed streaming parallelism

2023-03-14 Thread Sean Owen
Are you just looking for DataFrame.repartition()?

On Tue, Mar 14, 2023 at 10:57 AM Emmanouil Kritharakis <
kritharakismano...@gmail.com> wrote:

> Hello,
>
> I hope this email finds you well!
>
> I have a simple dataflow in which I read from a kafka topic, perform a map
> transformation and then I write the result to another topic. Based on your
> documentation here
> ,
> I need to work with Dataset data structures. Even though my solution works,
> I need to increase the parallelism. The spark documentation includes a lot
> of parameters that I can change based on specific data structures like
> *spark.default.parallelism* or *spark.sql.shuffle.partitions*. The former
> is the default number of partitions in RDDs returned by transformations
> like join, reduceByKey while the later is not recommended for structured
> streaming as it is described in documentation: "Note: For structured
> streaming, this configuration cannot be changed between query restarts from
> the same checkpoint location".
>
> So my question is how can I increase the parallelism for a simple dataflow
> based on datasets with a map transformation only?
>
> I am looking forward to hearing from you as soon as possible. Thanks in
> advance!
>
> Kind regards,
>
> --
>
> Emmanouil (Manos) Kritharakis
>
> Ph.D. candidate in the Department of Computer Science
> 
>
> Boston University
>


Question related to asynchronously map transformation using java spark structured streaming

2023-03-14 Thread Emmanouil Kritharakis
Hello,

I hope this email finds you well!

I have a simple dataflow in which I read from a kafka topic, perform a map
transformation and then I write the result to another topic. Based on your
documentation here
,
I need to work with Dataset data structures. Even though my solution works,
I need to utilize map transformation asynchronously. So my question is how
can I asynchronously call map transformation with Dataset data structures
in a java structured streaming environment? Can you please share a working
example?

I am looking forward to hearing from you as soon as possible. Thanks in
advance!

Kind regards

--

Emmanouil (Manos) Kritharakis

Ph.D. candidate in the Department of Computer Science


Boston University


Re: Question related to parallelism using structed streaming parallelism

2023-03-14 Thread Mich Talebzadeh
What benefits are you going with increasing parallelism? Better througput



   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 14 Mar 2023 at 15:58, Emmanouil Kritharakis <
kritharakismano...@gmail.com> wrote:

> Hello,
>
> I hope this email finds you well!
>
> I have a simple dataflow in which I read from a kafka topic, perform a map
> transformation and then I write the result to another topic. Based on your
> documentation here
> ,
> I need to work with Dataset data structures. Even though my solution works,
> I need to increase the parallelism. The spark documentation includes a lot
> of parameters that I can change based on specific data structures like
> *spark.default.parallelism* or *spark.sql.shuffle.partitions*. The former
> is the default number of partitions in RDDs returned by transformations
> like join, reduceByKey while the later is not recommended for structured
> streaming as it is described in documentation: "Note: For structured
> streaming, this configuration cannot be changed between query restarts from
> the same checkpoint location".
>
> So my question is how can I increase the parallelism for a simple dataflow
> based on datasets with a map transformation only?
>
> I am looking forward to hearing from you as soon as possible. Thanks in
> advance!
>
> Kind regards,
>
> --
>
> Emmanouil (Manos) Kritharakis
>
> Ph.D. candidate in the Department of Computer Science
> 
>
> Boston University
>


Question related to parallelism using structed streaming parallelism

2023-03-14 Thread Emmanouil Kritharakis
Hello,

I hope this email finds you well!

I have a simple dataflow in which I read from a kafka topic, perform a map
transformation and then I write the result to another topic. Based on your
documentation here
,
I need to work with Dataset data structures. Even though my solution works,
I need to increase the parallelism. The spark documentation includes a lot
of parameters that I can change based on specific data structures like
*spark.default.parallelism* or *spark.sql.shuffle.partitions*. The former
is the default number of partitions in RDDs returned by transformations
like join, reduceByKey while the later is not recommended for structured
streaming as it is described in documentation: "Note: For structured
streaming, this configuration cannot be changed between query restarts from
the same checkpoint location".

So my question is how can I increase the parallelism for a simple dataflow
based on datasets with a map transformation only?

I am looking forward to hearing from you as soon as possible. Thanks in
advance!

Kind regards,

--

Emmanouil (Manos) Kritharakis

Ph.D. candidate in the Department of Computer Science


Boston University


Re: Topics for Spark online classes & webinars

2023-03-14 Thread Mich Talebzadeh
Hi Denny,

That Apache Spark Linkedin page
https://www.linkedin.com/company/apachespark/ looks fine. It also allows a
wider audience to benefit from it.

+1 for me



   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 14 Mar 2023 at 14:23, Denny Lee  wrote:

> In the past, we've been using the Apache Spark LinkedIn page
>  and group to broadcast
> these type of events - if you're cool with this?  Or we could go through
> the process of submitting and updating the current
> https://spark.apache.org or request to leverage the original Spark
> confluence page .
>  WDYT?
>
> On Mon, Mar 13, 2023 at 9:34 AM Mich Talebzadeh 
> wrote:
>
>> Well that needs to be created first for this purpose. The appropriate
>> name etc. to be decided. Maybe @Denny Lee   can
>> facilitate this as he offered his help.
>>
>>
>> cheers
>>
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Mon, 13 Mar 2023 at 16:29, asma zgolli  wrote:
>>
>>> Hello Mich,
>>>
>>> Can you please provide the link for the confluence page?
>>>
>>> Many thanks
>>> Asma
>>> Ph.D. in Big Data - Applied Machine Learning
>>>
>>> Le lun. 13 mars 2023 à 17:21, Mich Talebzadeh 
>>> a écrit :
>>>
 Apologies I missed the list.

 To move forward I selected these topics from the thread "Online classes
 for spark topics".

 To take this further I propose a confluence page to be seup.


1. Spark UI
2. Dynamic allocation
3. Tuning of jobs
4. Collecting spark metrics for monitoring and alerting
5.  For those who prefer to use Pandas API on Spark since the
release of Spark 3.2, What are some important notes for those users? For
example, what are the additional factors affecting the Spark performance
using Pandas API on Spark? How to tune them in addition to the 
 conventional
Spark tuning methods applied to Spark SQL users.
6. Spark internals and/or comparing spark 3 and 2
7. Spark Streaming & Spark Structured Streaming
8. Spark on notebooks
9. Spark on serverless (for example Spark on Google Cloud)
10. Spark on k8s

 Opinions and how to is welcome


view my Linkedin profile
 


  https://en.everybodywiki.com/Mich_Talebzadeh



 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.




 On Mon, 13 Mar 2023 at 16:16, Mich Talebzadeh <
 mich.talebza...@gmail.com> wrote:

> Hi guys
>
> To move forward I selected these topics from the thread "Online
> classes for spark topics".
>
> To take this further I propose a confluence page to be seup.
>
> Opinions and how to is welcome
>
> Cheers
>
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for
> any loss, damage or destruction of data or any other property which may
> arise from relying on this email's technical content is explicitly
> disclaimed. The author will in no case be liable for any monetary damages
> arising from such loss, damage or destruction.
>
>
>

>>>
>>>
>>>


Re: org.apache.spark.shuffle.FetchFailedException in dataproc

2023-03-14 Thread Gary Liu
Hi Mich,

The y-axis is the number of executors. The code ran on dataproc serverless
spark on 3.3.2.

I tried closing autoscaling by setting the following:

spark.dynamicAllocation.enabled=false
spark.executor.instances=60

And still got the FetchFailedException error. I Wonder why it can run
without problem in a vertex notebook with local mode, which has less
resources. Of course it ran much longer time (8 hours local mode vs. 30 min
in serverless)

Will try to break the jobs into smaller parts, and see which step exactly
caused the problem.

Thanks!

On Mon, Mar 13, 2023 at 11:26 AM Mich Talebzadeh 
wrote:

> Hi Gary
>
> Thanks for the update. So  this serverless dataproc. on 3.3.1. Maybe an
> autoscaling policy could be an option. What is y-axis? Is that the capacity?
>
> Can you break down the join into multiple parts and save the intermediate
> result set?
>
>
> HTH
>
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 13 Mar 2023 at 14:56, Gary Liu  wrote:
>
>> Hi Mich,
>> I used the serverless spark session, not the local mode in the notebook.
>> So machine type does not matter in this case. Below is the chart for
>> serverless spark session execution. I also tried to increase executor
>> memory and core, but the issue did got get resolved. I will try shutting
>> down autoscaling, and see what will happen.
>> [image: Serverless Session Executors-4core.png]
>>
>>
>> On Fri, Mar 10, 2023 at 11:55 AM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> for your dataproc what type of machines are you using for example
>>> n2-standard-4 with 4vCPU and 16GB or something else? how many nodes and if
>>> autoscaling turned on.
>>>
>>> most likely executor memory limit?
>>>
>>>
>>> HTH
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Fri, 10 Mar 2023 at 15:35, Gary Liu  wrote:
>>>
 Hi ,

 I have a job in GCP dataproc server spark session (spark 3.3.2), it is
 a job involving multiple joinings, as well as a complex UDF. I always got
 the below FetchFailedException, but the job can be done and the results
 look right. Neither of 2 input data is very big (one is 6.5M rows*11
 columns, ~150M in orc format and 17.7M rows*11 columns, ~400M in orc
 format). It ran very smoothly on and on-premise spark environment though.

 According to Google's document (
 https://cloud.google.com/dataproc/docs/support/spark-job-tuning#shuffle_fetch_failures),
 it has 3 solutions:
 1. Using EFM mode
 2. Increase executor memory
 3, decrease the number of job partitions.

 1. I started the session from a vertex notebook, so I don't know how to
 use EFM mode.
 2. I increased executor memory from the default 12GB to 25GB, and the
 number of cores from 4 to 8, but it did not solve the problem.
 3. Wonder how to do this? repartition the input dataset to have less
 partitions? I used df.rdd.getNumPartitions() to check the input data
 partitions, they have 9 and 17 partitions respectively, should I decrease
 them further? I also read a post on StackOverflow (
 https://stackoverflow.com/questions/34941410/fetchfailedexception-or-metadatafetchfailedexception-when-processing-big-data-se),
 saying increasing partitions may help.Which one makes more sense? I
 repartitioned the input data to 20 and 30 partitions, but still no luck.

 Any suggestions?

 23/03/10 14:32:19 WARN TaskSetManager: Lost task 58.1 in stage 27.0 (TID 
 3783) (10.1.0.116 executor 33): FetchFailed(BlockManagerId(72, 
 10.1.15.199, 36791, None), shuffleId=24, mapIndex=77, mapId=3457, 
 reduceId=58, message=
 org.apache.spark.shuffle.FetchFailedException
at 
 org.apache.spark.errors.SparkCoreErrors$.fetchFailedError(SparkCoreErrors.scala:312)
at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:1180)
at 
 

Re: Topics for Spark online classes & webinars

2023-03-14 Thread Denny Lee
In the past, we've been using the Apache Spark LinkedIn page
 and group to broadcast
these type of events - if you're cool with this?  Or we could go through
the process of submitting and updating the current https://spark.apache.org
or request to leverage the original Spark confluence page
.WDYT?

On Mon, Mar 13, 2023 at 9:34 AM Mich Talebzadeh 
wrote:

> Well that needs to be created first for this purpose. The appropriate name
> etc. to be decided. Maybe @Denny Lee   can
> facilitate this as he offered his help.
>
>
> cheers
>
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 13 Mar 2023 at 16:29, asma zgolli  wrote:
>
>> Hello Mich,
>>
>> Can you please provide the link for the confluence page?
>>
>> Many thanks
>> Asma
>> Ph.D. in Big Data - Applied Machine Learning
>>
>> Le lun. 13 mars 2023 à 17:21, Mich Talebzadeh 
>> a écrit :
>>
>>> Apologies I missed the list.
>>>
>>> To move forward I selected these topics from the thread "Online classes
>>> for spark topics".
>>>
>>> To take this further I propose a confluence page to be seup.
>>>
>>>
>>>1. Spark UI
>>>2. Dynamic allocation
>>>3. Tuning of jobs
>>>4. Collecting spark metrics for monitoring and alerting
>>>5.  For those who prefer to use Pandas API on Spark since the
>>>release of Spark 3.2, What are some important notes for those users? For
>>>example, what are the additional factors affecting the Spark performance
>>>using Pandas API on Spark? How to tune them in addition to the 
>>> conventional
>>>Spark tuning methods applied to Spark SQL users.
>>>6. Spark internals and/or comparing spark 3 and 2
>>>7. Spark Streaming & Spark Structured Streaming
>>>8. Spark on notebooks
>>>9. Spark on serverless (for example Spark on Google Cloud)
>>>10. Spark on k8s
>>>
>>> Opinions and how to is welcome
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Mon, 13 Mar 2023 at 16:16, Mich Talebzadeh 
>>> wrote:
>>>
 Hi guys

 To move forward I selected these topics from the thread "Online classes
 for spark topics".

 To take this further I propose a confluence page to be seup.

 Opinions and how to is welcome

 Cheers



view my Linkedin profile
 


  https://en.everybodywiki.com/Mich_Talebzadeh



 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.



>>>
>>
>>
>>


Re: Topics for Spark online classes & webinars

2023-03-14 Thread Joris Billen
This is a very good idea-would love to read such a confluence page.
Adding a section “common mistakes/misconceptions” might be useful for many of 
these sections. It would describe undesired behaviour/errors one would get in 
case of not following some best practices.


On 13 Mar 2023, at 17:20, Mich Talebzadeh  wrote:

Apologies I missed the list.

To move forward I selected these topics from the thread "Online classes for 
spark topics".

To take this further I propose a confluence page to be seup.


  1.  Spark UI
  2.  Dynamic allocation
  3.  Tuning of jobs
  4.  Collecting spark metrics for monitoring and alerting
  5.   For those who prefer to use Pandas API on Spark since the release of 
Spark 3.2, What are some important notes for those users? For example, what are 
the additional factors affecting the Spark performance using Pandas API on 
Spark? How to tune them in addition to the conventional Spark tuning methods 
applied to Spark SQL users.
  6.  Spark internals and/or comparing spark 3 and 2
  7.  Spark Streaming & Spark Structured Streaming
  8.  Spark on notebooks
  9.  Spark on serverless (for example Spark on Google Cloud)
  10. Spark on k8s










Opinions and how to is welcome

 
[https://ci3.googleusercontent.com/mail-sig/AIorK4zholKucR2Q9yMrKbHNn-o1TuS4mYXyi2KO6Xmx6ikHPySa9MLaLZ8t2hrA6AUcxSxDgHIwmKE]
   view my Linkedin 
profile

 
https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Mon, 13 Mar 2023 at 16:16, Mich Talebzadeh 
mailto:mich.talebza...@gmail.com>> wrote:
Hi guys

To move forward I selected these topics from the thread "Online classes for 
spark topics".

To take this further I propose a confluence page to be seup.

Opinions and how to is welcome

Cheers


 
[https://ci3.googleusercontent.com/mail-sig/AIorK4zholKucR2Q9yMrKbHNn-o1TuS4mYXyi2KO6Xmx6ikHPySa9MLaLZ8t2hrA6AUcxSxDgHIwmKE]
   view my Linkedin 
profile

 
https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.





Re: Spark 3.3.2 not running with Antlr4 runtime latest version

2023-03-14 Thread yangjie01
From the release notes of antl4 , there are two key changes in antl4 4.10:

1. 4.10-generated parsers incompatible with previous runtimes

2. Increasing minimum java version to Java 11

So I personally think it is temporarily impossible for Spark to upgrade to the 
antl4 version above 4.10,  because Spark still needs to support Java 8.

Yang Jie



发件人: Sean Owen 
日期: 2023年3月14日 星期二 21:33
收件人: "karuna.s...@accenture.com" 
抄送: "user@spark.apache.org" , "Misra Parashar, Jyoti" 
, "Ratra, Neelima" 
, "Jain, Neha T." , 
"George, Rejish" , "APP.Security.CoE" 

主题: Re: Spark 3.3.2 not running with Antlr4 runtime latest version

You want Antlr 3 and Spark is on 4? no I don't think Spark would downgrade. You 
can shade your app's dependencies maybe.

On Tue, Mar 14, 2023 at 8:21 AM Sahu, Karuna 
 wrote:
Hi Team

We are upgrading a legacy application using Spring boot , Spark and Hibernate. 
While upgrading Hibernate to 6.1.6.Final version there is a mismatch for antlr4 
runtime jar with Hibernate and latest Spark version. Details for the issue are 
posted on StackOverflow as well:
Issue in running Spark 3.3.2 with Antlr 4.10.1 - Stack 
Overflow

Please let us know if upgrades for this is being planned for latest Spark 
version.

Thanks
Karuna



This message is for the designated recipient only and may contain privileged, 
proprietary, or otherwise confidential information. If you have received it in 
error, please notify the sender immediately and delete the original. Any other 
use of the e-mail by you is prohibited. Where allowed by local law, electronic 
communications with Accenture and its affiliates, including e-mail and instant 
messaging (including content), may be scanned by our systems for the purposes 
of information security and assessment of internal compliance with Accenture 
policy. Your privacy is important to us. Accenture uses your personal data only 
in compliance with data protection laws. For further information on how 
Accenture processes your personal data, please see our privacy statement at 
https://www.accenture.com/us-en/privacy-policy.
__

www.accenture.com


Re: Spark 3.3.2 not running with Antlr4 runtime latest version

2023-03-14 Thread Sean Owen
You want Antlr 3 and Spark is on 4? no I don't think Spark would downgrade.
You can shade your app's dependencies maybe.

On Tue, Mar 14, 2023 at 8:21 AM Sahu, Karuna
 wrote:

> Hi Team
>
>
>
> We are upgrading a legacy application using Spring boot , Spark and
> Hibernate. While upgrading Hibernate to 6.1.6.Final version there is a
> mismatch for antlr4 runtime jar with Hibernate and latest Spark version.
> Details for the issue are posted on StackOverflow as well:
>
> Issue in running Spark 3.3.2 with Antlr 4.10.1 - Stack Overflow
> 
>
>
>
> Please let us know if upgrades for this is being planned for latest Spark
> version.
>
>
>
> Thanks
>
> Karuna
>
> --
>
> This message is for the designated recipient only and may contain
> privileged, proprietary, or otherwise confidential information. If you have
> received it in error, please notify the sender immediately and delete the
> original. Any other use of the e-mail by you is prohibited. Where allowed
> by local law, electronic communications with Accenture and its affiliates,
> including e-mail and instant messaging (including content), may be scanned
> by our systems for the purposes of information security and assessment of
> internal compliance with Accenture policy. Your privacy is important to us.
> Accenture uses your personal data only in compliance with data protection
> laws. For further information on how Accenture processes your personal
> data, please see our privacy statement at
> https://www.accenture.com/us-en/privacy-policy.
>
> __
>
> www.accenture.com
>


Spark 3.3.2 not running with Antlr4 runtime latest version

2023-03-14 Thread Sahu, Karuna
Hi Team

We are upgrading a legacy application using Spring boot , Spark and Hibernate. 
While upgrading Hibernate to 6.1.6.Final version there is a mismatch for antlr4 
runtime jar with Hibernate and latest Spark version. Details for the issue are 
posted on StackOverflow as well:
Issue in running Spark 3.3.2 with Antlr 4.10.1 - Stack 
Overflow

Please let us know if upgrades for this is being planned for latest Spark 
version.

Thanks
Karuna



This message is for the designated recipient only and may contain privileged, 
proprietary, or otherwise confidential information. If you have received it in 
error, please notify the sender immediately and delete the original. Any other 
use of the e-mail by you is prohibited. Where allowed by local law, electronic 
communications with Accenture and its affiliates, including e-mail and instant 
messaging (including content), may be scanned by our systems for the purposes 
of information security and assessment of internal compliance with Accenture 
policy. Your privacy is important to us. Accenture uses your personal data only 
in compliance with data protection laws. For further information on how 
Accenture processes your personal data, please see our privacy statement at 
https://www.accenture.com/us-en/privacy-policy.
__

www.accenture.com


Re: spark on k8s daemonset collect log

2023-03-14 Thread Cheng Pan
The filebeat supports multiline matching, here is an example[1]

BTW, I’m working on External Log Service integration[2], it may be useful
in your case, feel free to review/left comments

[1]
https://www.elastic.co/guide/en/beats/filebeat/current/multiline-examples.html#multiline
[2] https://github.com/apache/spark/pull/38357

Thanks,
Cheng Pan


On Mar 14, 2023 at 16:36:45, 404  wrote:

> hi, all
>
> Spark runs on k8s, uses daemonset filebeat to collect logs, and writes
> them to elasticsearch. The docker logs are in json format, and each line is
> a json string. How to merge multi-line exceptions?
>
>


spark on k8s daemonset collect log

2023-03-14 Thread 404
hi, all


Spark runs on k8s, uses daemonset filebeat to collect logs, and writes them to 
elasticsearch. The docker logs are in json format, and each line is a json 
string. How to merge multi-line exceptions?