Re: implement a distribution without shuffle like RDD.coalesce for DataSource V2 write

Mich Talebzadeh Sun, 18 Jun 2023 03:10:35 -0700

Is this the point you are trying to implement?

I have state data source which enables the state in SS --> Structured
Streaming to be rewritten, which enables repartitioning, schema
evolution, etc via batch query. The writer requires hash partitioning
against group key, with the "desired number of partitions", which is
same as what Spark does read and write against state.


This is now implemented as DSv1, and the requirement is *simply done
by calling repartition with the "desired number".*

```
val fullPathsForKeyColumns = keySchema.map(key => new
Column(s"key.${key.name}"))
data
  .repartition(*newPartitions*, fullPathsForKeyColumns: _*)
  .queryExecution
  .toRdd
  .foreachPartition(
    writeFn(resolvedCpLocation, version, operatorId, storeName,
keySchema, valueSchema,
      storeConf, hadoopConfBroadcast, queryId))

Well Spark will not know the optimum value of newPartitions and you
will need to work out that from SS size.

Is that a correct understanding?

HTH


Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sun, 18 Jun 2023 at 10:12, Pengfei Li <lpengfei2...@gmail.com> wrote:

> Hi All,
>
> I'm developing a DataSource on Spark 3.2 to write data to our system,
> and using DataSource V2 API. I want to implement the interface
> RequiresDistributionAndOrdering
> <https://github.com/apache/spark/blob/branch-3.2/sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/RequiresDistributionAndOrdering.java>
>  to
> set the number of partitions used for write. But I don't know how to
> implement a distribution without shuffle as  RDD.coalesce does. Is there
> any example or advice?
>
> Thank You
> Best Regards
>

Re: implement a distribution without shuffle like RDD.coalesce for DataSource V2 write

Reply via email to