Are you just looking for DataFrame.repartition()? On Tue, Mar 14, 2023 at 10:57 AM Emmanouil Kritharakis < kritharakismano...@gmail.com> wrote:
> Hello, > > I hope this email finds you well! > > I have a simple dataflow in which I read from a kafka topic, perform a map > transformation and then I write the result to another topic. Based on your > documentation here > <https://spark.apache.org/docs/3.3.2/structured-streaming-kafka-integration.html#content>, > I need to work with Dataset data structures. Even though my solution works, > I need to increase the parallelism. The spark documentation includes a lot > of parameters that I can change based on specific data structures like > *spark.default.parallelism* or *spark.sql.shuffle.partitions*. The former > is the default number of partitions in RDDs returned by transformations > like join, reduceByKey while the later is not recommended for structured > streaming as it is described in documentation: "Note: For structured > streaming, this configuration cannot be changed between query restarts from > the same checkpoint location". > > So my question is how can I increase the parallelism for a simple dataflow > based on datasets with a map transformation only? > > I am looking forward to hearing from you as soon as possible. Thanks in > advance! > > Kind regards, > > ------------------------------------------------------------------ > > Emmanouil (Manos) Kritharakis > > Ph.D. candidate in the Department of Computer Science > <https://sites.bu.edu/casp/people/ekritharakis/> > > Boston University >