Hello, I hope this email finds you well!
I have a simple dataflow in which I read from a kafka topic, perform a map transformation and then I write the result to another topic. Based on your documentation here <https://spark.apache.org/docs/3.3.2/structured-streaming-kafka-integration.html#content>, I need to work with Dataset data structures. Even though my solution works, I need to increase the parallelism. The spark documentation includes a lot of parameters that I can change based on specific data structures like *spark.default.parallelism* or *spark.sql.shuffle.partitions*. The former is the default number of partitions in RDDs returned by transformations like join, reduceByKey while the later is not recommended for structured streaming as it is described in documentation: "Note: For structured streaming, this configuration cannot be changed between query restarts from the same checkpoint location". So my question is how can I increase the parallelism for a simple dataflow based on datasets with a map transformation only? I am looking forward to hearing from you as soon as possible. Thanks in advance! Kind regards, ------------------------------------------------------------------ Emmanouil (Manos) Kritharakis Ph.D. candidate in the Department of Computer Science <https://sites.bu.edu/casp/people/ekritharakis/> Boston University