In spark structured streaming we cannot perform repartition() without
stopping the streaming process unless otherwise.

Admittedly, It is not a parameter that I have played around with. I
still think Spark GUI should provide some insight.








   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 14 Mar 2023 at 16:42, Sean Owen <sro...@gmail.com> wrote:

> That's incorrect, it's spark.default.parallelism, but as the name
> suggests, that is merely a default. You control partitioning directly with
> .repartition()
>
> On Tue, Mar 14, 2023 at 11:37 AM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Check this link
>>
>>
>> https://sparkbyexamples.com/spark/difference-between-spark-sql-shuffle-partitions-and-spark-default-parallelism/
>>
>> You can set it
>>
>>         spark.conf.set("sparkDefaultParallelism", value])
>>
>>
>> Have a look at Streaming statistics in Spark GUI, especially *Processing
>> Tim*e, defined by Spark GUI as Time taken to process all jobs of a
>> batch.  *The **Scheduling Dela*y and *the **Total Dela*y are additional
>> indicators of health.
>>
>>
>> then decide how to set the value.
>>
>>
>> HTH
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 14 Mar 2023 at 16:04, Emmanouil Kritharakis <
>> kritharakismano...@gmail.com> wrote:
>>
>>> Yes I need to check the performance of my streaming job in terms of
>>> latency and throughput. Is there any working example of how to increase the
>>> parallelism with spark structured streaming  using Dataset data structures?
>>> Thanks in advance.
>>>
>>> Kind regards,
>>>
>>> ------------------------------------------------------------------
>>>
>>> Emmanouil (Manos) Kritharakis
>>>
>>> Ph.D. candidate in the Department of Computer Science
>>> <https://sites.bu.edu/casp/people/ekritharakis/>
>>>
>>> Boston University
>>>
>>>
>>> On Tue, Mar 14, 2023 at 12:01 PM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> What benefits are you going with increasing parallelism? Better
>>>> througput
>>>>
>>>>
>>>>
>>>>    view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, 14 Mar 2023 at 15:58, Emmanouil Kritharakis <
>>>> kritharakismano...@gmail.com> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> I hope this email finds you well!
>>>>>
>>>>> I have a simple dataflow in which I read from a kafka topic, perform a
>>>>> map transformation and then I write the result to another topic. Based on
>>>>> your documentation here
>>>>> <https://spark.apache.org/docs/3.3.2/structured-streaming-kafka-integration.html#content>,
>>>>> I need to work with Dataset data structures. Even though my solution 
>>>>> works,
>>>>> I need to increase the parallelism. The spark documentation includes a lot
>>>>> of parameters that I can change based on specific data structures like
>>>>> *spark.default.parallelism* or *spark.sql.shuffle.partitions*. The
>>>>> former is the default number of partitions in RDDs returned by
>>>>> transformations like join, reduceByKey while the later is not recommended
>>>>> for structured streaming as it is described in documentation: "Note: For
>>>>> structured streaming, this configuration cannot be changed between query
>>>>> restarts from the same checkpoint location".
>>>>>
>>>>> So my question is how can I increase the parallelism for a simple
>>>>> dataflow based on datasets with a map transformation only?
>>>>>
>>>>> I am looking forward to hearing from you as soon as possible. Thanks
>>>>> in advance!
>>>>>
>>>>> Kind regards,
>>>>>
>>>>> ------------------------------------------------------------------
>>>>>
>>>>> Emmanouil (Manos) Kritharakis
>>>>>
>>>>> Ph.D. candidate in the Department of Computer Science
>>>>> <https://sites.bu.edu/casp/people/ekritharakis/>
>>>>>
>>>>> Boston University
>>>>>
>>>>

Reply via email to