Serhii created SPARK-31718: ------------------------------ Summary: DataSourceV2 unexpected behavior with partition data distribution Key: SPARK-31718 URL: https://issues.apache.org/jira/browse/SPARK-31718 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 2.4.0 Reporter: Serhii Fix For: 2.4.0
Hi team, We are using DataSourceV2. We have a queston regarding using interface org.apache.spark.sql.sources.v2.writer.DataWriter<T> We have faced with following unexpected behavior. When we use a repartion on dataframe we expect that for each partion Spark will create new instance of DataWriter interface and sends the repartition data to appropriate instances but sometimes we observe that Spark sends the data from different partitions to the same instance of DataWriter interface. It behavior sometimes occures on Yarn cluster. If we run Spark job as Local run Spark really creates a new instance of DataWriter interface for each partiion after repartion and publishes the repartion data to appropriate instances. Can you explain it is a bug or expected behavior? -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org