Re: Question related to parallelism using structed streaming parallelism

2023-03-21 Thread Mich Talebzadeh
or download it from here

https://pages.databricks.com/rs/094-YMS-629/images/LearningSpark2.0.pdf

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 21 Mar 2023 at 15:38, Mich Talebzadeh 
wrote:

> Hi Emmanouil and anyone else interested
>
> Sounds like you may benefit from this booklet.Not the latest but good
> enough.
>
> HTH
>
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies Limited
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 21 Mar 2023 at 12:21, Sean Owen  wrote:
>
>> Yes more specifically, you can't ask for executors once the app starts,
>> in SparkConf like that. You set this when you launch it against a Spark
>> cluster in spark-submit or otherwise.
>>
>> On Tue, Mar 21, 2023 at 4:23 AM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hi Emmanouil,
>>>
>>> This means that your job is running on the driver as a single JVM, hence
>>> active(1)
>>>
>>>


Re: Question related to parallelism using structed streaming parallelism

2023-03-21 Thread Sean Owen
Yes more specifically, you can't ask for executors once the app starts,
in SparkConf like that. You set this when you launch it against a Spark
cluster in spark-submit or otherwise.

On Tue, Mar 21, 2023 at 4:23 AM Mich Talebzadeh 
wrote:

> Hi Emmanouil,
>
> This means that your job is running on the driver as a single JVM, hence
> active(1)
>
>


Re: Question related to parallelism using structed streaming parallelism

2023-03-14 Thread Mich Talebzadeh
In spark structured streaming we cannot perform repartition() without
stopping the streaming process unless otherwise.

Admittedly, It is not a parameter that I have played around with. I
still think Spark GUI should provide some insight.








   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 14 Mar 2023 at 16:42, Sean Owen  wrote:

> That's incorrect, it's spark.default.parallelism, but as the name
> suggests, that is merely a default. You control partitioning directly with
> .repartition()
>
> On Tue, Mar 14, 2023 at 11:37 AM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Check this link
>>
>>
>> https://sparkbyexamples.com/spark/difference-between-spark-sql-shuffle-partitions-and-spark-default-parallelism/
>>
>> You can set it
>>
>> spark.conf.set("sparkDefaultParallelism", value])
>>
>>
>> Have a look at Streaming statistics in Spark GUI, especially *Processing
>> Tim*e, defined by Spark GUI as Time taken to process all jobs of a
>> batch.  *The **Scheduling Dela*y and *the **Total Dela*y are additional
>> indicators of health.
>>
>>
>> then decide how to set the value.
>>
>>
>> HTH
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 14 Mar 2023 at 16:04, Emmanouil Kritharakis <
>> kritharakismano...@gmail.com> wrote:
>>
>>> Yes I need to check the performance of my streaming job in terms of
>>> latency and throughput. Is there any working example of how to increase the
>>> parallelism with spark structured streaming  using Dataset data structures?
>>> Thanks in advance.
>>>
>>> Kind regards,
>>>
>>> --
>>>
>>> Emmanouil (Manos) Kritharakis
>>>
>>> Ph.D. candidate in the Department of Computer Science
>>> 
>>>
>>> Boston University
>>>
>>>
>>> On Tue, Mar 14, 2023 at 12:01 PM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 What benefits are you going with increasing parallelism? Better
 througput



view my Linkedin profile
 


  https://en.everybodywiki.com/Mich_Talebzadeh



 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.




 On Tue, 14 Mar 2023 at 15:58, Emmanouil Kritharakis <
 kritharakismano...@gmail.com> wrote:

> Hello,
>
> I hope this email finds you well!
>
> I have a simple dataflow in which I read from a kafka topic, perform a
> map transformation and then I write the result to another topic. Based on
> your documentation here
> ,
> I need to work with Dataset data structures. Even though my solution 
> works,
> I need to increase the parallelism. The spark documentation includes a lot
> of parameters that I can change based on specific data structures like
> *spark.default.parallelism* or *spark.sql.shuffle.partitions*. The
> former is the default number of partitions in RDDs returned by
> transformations like join, reduceByKey while the later is not recommended
> for structured streaming as it is described in documentation: "Note: For
> structured streaming, this configuration cannot be changed between query
> restarts from the same checkpoint location".
>
> So my question is how can I increase the parallelism for a simple
> dataflow based on datasets with a map transformation only?
>
> I am looking forward to hearing from you as soon as possible. Thanks
> in advance!
>
> Kind regards,
>
> --
>
> Emmanouil (Manos) Kritharakis
>
> 

Re: Question related to parallelism using structed streaming parallelism

2023-03-14 Thread Sean Owen
That's incorrect, it's spark.default.parallelism, but as the name suggests,
that is merely a default. You control partitioning directly with
.repartition()

On Tue, Mar 14, 2023 at 11:37 AM Mich Talebzadeh 
wrote:

> Check this link
>
>
> https://sparkbyexamples.com/spark/difference-between-spark-sql-shuffle-partitions-and-spark-default-parallelism/
>
> You can set it
>
> spark.conf.set("sparkDefaultParallelism", value])
>
>
> Have a look at Streaming statistics in Spark GUI, especially *Processing
> Tim*e, defined by Spark GUI as Time taken to process all jobs of a batch.
>  *The **Scheduling Dela*y and *the **Total Dela*y are additional
> indicators of health.
>
>
> then decide how to set the value.
>
>
> HTH
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 14 Mar 2023 at 16:04, Emmanouil Kritharakis <
> kritharakismano...@gmail.com> wrote:
>
>> Yes I need to check the performance of my streaming job in terms of
>> latency and throughput. Is there any working example of how to increase the
>> parallelism with spark structured streaming  using Dataset data structures?
>> Thanks in advance.
>>
>> Kind regards,
>>
>> --
>>
>> Emmanouil (Manos) Kritharakis
>>
>> Ph.D. candidate in the Department of Computer Science
>> 
>>
>> Boston University
>>
>>
>> On Tue, Mar 14, 2023 at 12:01 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> What benefits are you going with increasing parallelism? Better througput
>>>
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Tue, 14 Mar 2023 at 15:58, Emmanouil Kritharakis <
>>> kritharakismano...@gmail.com> wrote:
>>>
 Hello,

 I hope this email finds you well!

 I have a simple dataflow in which I read from a kafka topic, perform a
 map transformation and then I write the result to another topic. Based on
 your documentation here
 ,
 I need to work with Dataset data structures. Even though my solution works,
 I need to increase the parallelism. The spark documentation includes a lot
 of parameters that I can change based on specific data structures like
 *spark.default.parallelism* or *spark.sql.shuffle.partitions*. The
 former is the default number of partitions in RDDs returned by
 transformations like join, reduceByKey while the later is not recommended
 for structured streaming as it is described in documentation: "Note: For
 structured streaming, this configuration cannot be changed between query
 restarts from the same checkpoint location".

 So my question is how can I increase the parallelism for a simple
 dataflow based on datasets with a map transformation only?

 I am looking forward to hearing from you as soon as possible. Thanks in
 advance!

 Kind regards,

 --

 Emmanouil (Manos) Kritharakis

 Ph.D. candidate in the Department of Computer Science
 

 Boston University

>>>


Re: Question related to parallelism using structed streaming parallelism

2023-03-14 Thread Mich Talebzadeh
Check this link

https://sparkbyexamples.com/spark/difference-between-spark-sql-shuffle-partitions-and-spark-default-parallelism/

You can set it

spark.conf.set("sparkDefaultParallelism", value])


Have a look at Streaming statistics in Spark GUI, especially *Processing
Tim*e, defined by Spark GUI as Time taken to process all jobs of a batch.
*The **Scheduling Dela*y and *the **Total Dela*y are additional indicators
of health.


then decide how to set the value.


HTH


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 14 Mar 2023 at 16:04, Emmanouil Kritharakis <
kritharakismano...@gmail.com> wrote:

> Yes I need to check the performance of my streaming job in terms of
> latency and throughput. Is there any working example of how to increase the
> parallelism with spark structured streaming  using Dataset data structures?
> Thanks in advance.
>
> Kind regards,
>
> --
>
> Emmanouil (Manos) Kritharakis
>
> Ph.D. candidate in the Department of Computer Science
> 
>
> Boston University
>
>
> On Tue, Mar 14, 2023 at 12:01 PM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> What benefits are you going with increasing parallelism? Better througput
>>
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 14 Mar 2023 at 15:58, Emmanouil Kritharakis <
>> kritharakismano...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> I hope this email finds you well!
>>>
>>> I have a simple dataflow in which I read from a kafka topic, perform a
>>> map transformation and then I write the result to another topic. Based on
>>> your documentation here
>>> ,
>>> I need to work with Dataset data structures. Even though my solution works,
>>> I need to increase the parallelism. The spark documentation includes a lot
>>> of parameters that I can change based on specific data structures like
>>> *spark.default.parallelism* or *spark.sql.shuffle.partitions*. The
>>> former is the default number of partitions in RDDs returned by
>>> transformations like join, reduceByKey while the later is not recommended
>>> for structured streaming as it is described in documentation: "Note: For
>>> structured streaming, this configuration cannot be changed between query
>>> restarts from the same checkpoint location".
>>>
>>> So my question is how can I increase the parallelism for a simple
>>> dataflow based on datasets with a map transformation only?
>>>
>>> I am looking forward to hearing from you as soon as possible. Thanks in
>>> advance!
>>>
>>> Kind regards,
>>>
>>> --
>>>
>>> Emmanouil (Manos) Kritharakis
>>>
>>> Ph.D. candidate in the Department of Computer Science
>>> 
>>>
>>> Boston University
>>>
>>


Re: Question related to parallelism using structed streaming parallelism

2023-03-14 Thread Sean Owen
Are you just looking for DataFrame.repartition()?

On Tue, Mar 14, 2023 at 10:57 AM Emmanouil Kritharakis <
kritharakismano...@gmail.com> wrote:

> Hello,
>
> I hope this email finds you well!
>
> I have a simple dataflow in which I read from a kafka topic, perform a map
> transformation and then I write the result to another topic. Based on your
> documentation here
> ,
> I need to work with Dataset data structures. Even though my solution works,
> I need to increase the parallelism. The spark documentation includes a lot
> of parameters that I can change based on specific data structures like
> *spark.default.parallelism* or *spark.sql.shuffle.partitions*. The former
> is the default number of partitions in RDDs returned by transformations
> like join, reduceByKey while the later is not recommended for structured
> streaming as it is described in documentation: "Note: For structured
> streaming, this configuration cannot be changed between query restarts from
> the same checkpoint location".
>
> So my question is how can I increase the parallelism for a simple dataflow
> based on datasets with a map transformation only?
>
> I am looking forward to hearing from you as soon as possible. Thanks in
> advance!
>
> Kind regards,
>
> --
>
> Emmanouil (Manos) Kritharakis
>
> Ph.D. candidate in the Department of Computer Science
> 
>
> Boston University
>


Re: Question related to parallelism using structed streaming parallelism

2023-03-14 Thread Mich Talebzadeh
What benefits are you going with increasing parallelism? Better througput



   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 14 Mar 2023 at 15:58, Emmanouil Kritharakis <
kritharakismano...@gmail.com> wrote:

> Hello,
>
> I hope this email finds you well!
>
> I have a simple dataflow in which I read from a kafka topic, perform a map
> transformation and then I write the result to another topic. Based on your
> documentation here
> ,
> I need to work with Dataset data structures. Even though my solution works,
> I need to increase the parallelism. The spark documentation includes a lot
> of parameters that I can change based on specific data structures like
> *spark.default.parallelism* or *spark.sql.shuffle.partitions*. The former
> is the default number of partitions in RDDs returned by transformations
> like join, reduceByKey while the later is not recommended for structured
> streaming as it is described in documentation: "Note: For structured
> streaming, this configuration cannot be changed between query restarts from
> the same checkpoint location".
>
> So my question is how can I increase the parallelism for a simple dataflow
> based on datasets with a map transformation only?
>
> I am looking forward to hearing from you as soon as possible. Thanks in
> advance!
>
> Kind regards,
>
> --
>
> Emmanouil (Manos) Kritharakis
>
> Ph.D. candidate in the Department of Computer Science
> 
>
> Boston University
>


Question related to parallelism using structed streaming parallelism

2023-03-14 Thread Emmanouil Kritharakis
Hello,

I hope this email finds you well!

I have a simple dataflow in which I read from a kafka topic, perform a map
transformation and then I write the result to another topic. Based on your
documentation here
,
I need to work with Dataset data structures. Even though my solution works,
I need to increase the parallelism. The spark documentation includes a lot
of parameters that I can change based on specific data structures like
*spark.default.parallelism* or *spark.sql.shuffle.partitions*. The former
is the default number of partitions in RDDs returned by transformations
like join, reduceByKey while the later is not recommended for structured
streaming as it is described in documentation: "Note: For structured
streaming, this configuration cannot be changed between query restarts from
the same checkpoint location".

So my question is how can I increase the parallelism for a simple dataflow
based on datasets with a map transformation only?

I am looking forward to hearing from you as soon as possible. Thanks in
advance!

Kind regards,

--

Emmanouil (Manos) Kritharakis

Ph.D. candidate in the Department of Computer Science


Boston University