Re: Question related to parallelism using structed streaming parallelism

Sean Owen Tue, 14 Mar 2023 09:42:33 -0700

That's incorrect, it's spark.default.parallelism, but as the name suggests,
that is merely a default. You control partitioning directly with
.repartition()


On Tue, Mar 14, 2023 at 11:37 AM Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Check this link
>
>
> https://sparkbyexamples.com/spark/difference-between-spark-sql-shuffle-partitions-and-spark-default-parallelism/
>
> You can set it
>
>         spark.conf.set("sparkDefaultParallelism", value])
>
>
> Have a look at Streaming statistics in Spark GUI, especially *Processing
> Tim*e, defined by Spark GUI as Time taken to process all jobs of a batch.
>  *The **Scheduling Dela*y and *the **Total Dela*y are additional
> indicators of health.
>
>
> then decide how to set the value.
>
>
> HTH
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 14 Mar 2023 at 16:04, Emmanouil Kritharakis <
> kritharakismano...@gmail.com> wrote:
>
>> Yes I need to check the performance of my streaming job in terms of
>> latency and throughput. Is there any working example of how to increase the
>> parallelism with spark structured streaming  using Dataset data structures?
>> Thanks in advance.
>>
>> Kind regards,
>>
>> ------------------------------------------------------------------
>>
>> Emmanouil (Manos) Kritharakis
>>
>> Ph.D. candidate in the Department of Computer Science
>> <https://sites.bu.edu/casp/people/ekritharakis/>
>>
>> Boston University
>>
>>
>> On Tue, Mar 14, 2023 at 12:01 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> What benefits are you going with increasing parallelism? Better througput
>>>
>>>
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Tue, 14 Mar 2023 at 15:58, Emmanouil Kritharakis <
>>> kritharakismano...@gmail.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> I hope this email finds you well!
>>>>
>>>> I have a simple dataflow in which I read from a kafka topic, perform a
>>>> map transformation and then I write the result to another topic. Based on
>>>> your documentation here
>>>> <https://spark.apache.org/docs/3.3.2/structured-streaming-kafka-integration.html#content>,
>>>> I need to work with Dataset data structures. Even though my solution works,
>>>> I need to increase the parallelism. The spark documentation includes a lot
>>>> of parameters that I can change based on specific data structures like
>>>> *spark.default.parallelism* or *spark.sql.shuffle.partitions*. The
>>>> former is the default number of partitions in RDDs returned by
>>>> transformations like join, reduceByKey while the later is not recommended
>>>> for structured streaming as it is described in documentation: "Note: For
>>>> structured streaming, this configuration cannot be changed between query
>>>> restarts from the same checkpoint location".
>>>>
>>>> So my question is how can I increase the parallelism for a simple
>>>> dataflow based on datasets with a map transformation only?
>>>>
>>>> I am looking forward to hearing from you as soon as possible. Thanks in
>>>> advance!
>>>>
>>>> Kind regards,
>>>>
>>>> ------------------------------------------------------------------
>>>>
>>>> Emmanouil (Manos) Kritharakis
>>>>
>>>> Ph.D. candidate in the Department of Computer Science
>>>> <https://sites.bu.edu/casp/people/ekritharakis/>
>>>>
>>>> Boston University
>>>>
>>>

Re: Question related to parallelism using structed streaming parallelism

Reply via email to