Hello,

I hope this email finds you well!

I have a simple dataflow in which I read from a kafka topic, perform a map
transformation and then I write the result to another topic. Based on your
documentation here
<https://spark.apache.org/docs/3.3.2/structured-streaming-kafka-integration.html#content>,
I need to work with Dataset data structures. Even though my solution works,
I need to increase the parallelism. The spark documentation includes a lot
of parameters that I can change based on specific data structures like
*spark.default.parallelism* or *spark.sql.shuffle.partitions*. The former
is the default number of partitions in RDDs returned by transformations
like join, reduceByKey while the later is not recommended for structured
streaming as it is described in documentation: "Note: For structured
streaming, this configuration cannot be changed between query restarts from
the same checkpoint location".

So my question is how can I increase the parallelism for a simple dataflow
based on datasets with a map transformation only?

I am looking forward to hearing from you as soon as possible. Thanks in
advance!

Kind regards,

------------------------------------------------------------------

Emmanouil (Manos) Kritharakis

Ph.D. candidate in the Department of Computer Science
<https://sites.bu.edu/casp/people/ekritharakis/>

Boston University

Reply via email to