[jira] [Created] (SPARK-31718) DataSourceV2 unexpected behavior with partition data distribution

Serhii (Jira) Fri, 15 May 2020 01:58:24 -0700

Serhii created SPARK-31718:
------------------------------

             Summary:  DataSourceV2 unexpected behavior with partition data 
distribution
                 Key: SPARK-31718
                 URL: https://issues.apache.org/jira/browse/SPARK-31718
             Project: Spark
          Issue Type: Bug
          Components: Java API
    Affects Versions: 2.4.0
            Reporter: Serhii
             Fix For: 2.4.0



Hi team,
 
We are using DataSourceV2.
 
We have a queston regarding using interface 
org.apache.spark.sql.sources.v2.writer.DataWriter<T>
 
We have faced with following unexpected behavior.
When we use a repartion on dataframe we expect that for each partion Spark will 
create new instance of DataWriter interface and sends the repartition data to 
appropriate instances but sometimes we observe that Spark sends the data from 
different partitions to the same instance of DataWriter interface.
 It behavior sometimes occures on Yarn cluster.
 
If we run Spark job as Local run Spark really creates a new instance of 
DataWriter interface for each partiion after repartion and publishes the 
repartion data to appropriate instances.
 
Can you explain it is a bug or expected behavior?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31718) DataSourceV2 unexpected behavior with partition data distribution

Reply via email to