[jira] [Updated] (SPARK-31718) DataSourceV2 unexpected behavior with partition data distribution

Serhii (Jira) Fri, 15 May 2020 02:33:27 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-31718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Serhii updated SPARK-31718:
---------------------------
    Description: 
Hi team,
  
 We are using DataSourceV2.
  
 We have a queston regarding using interface 
org.apache.spark.sql.sources.v2.writer.DataWriter<T>
  
 We have faced with following unexpected behavior.
 When we use a repartion on dataframe we expect that for each partion Spark 
will create new instance of DataWriter interface and sends the repartition data 
to appropriate instances but sometimes we observe that Spark sends the data 
from different partitions to the same instance of DataWriter interface.
 It behavior sometimes occures on Yarn cluster.
  
 If we run Spark job as Local run Spark really creates a new instance of 
DataWriter interface for each partiion after repartion and publishes the 
repartion data to appropriate instances.
  

Possible there is a Spark limit a number of  DataWriter instances?


 Can you explain it is a bug or expected behavior?

  was:
Hi team,
 
We are using DataSourceV2.
 
We have a queston regarding using interface 
org.apache.spark.sql.sources.v2.writer.DataWriter<T>
 
We have faced with following unexpected behavior.
When we use a repartion on dataframe we expect that for each partion Spark will 
create new instance of DataWriter interface and sends the repartition data to 
appropriate instances but sometimes we observe that Spark sends the data from 
different partitions to the same instance of DataWriter interface.
 It behavior sometimes occures on Yarn cluster.
 
If we run Spark job as Local run Spark really creates a new instance of 
DataWriter interface for each partiion after repartion and publishes the 
repartion data to appropriate instances.
 
Can you explain it is a bug or expected behavior?


>  DataSourceV2 unexpected behavior with partition data distribution
> ------------------------------------------------------------------
>
>                 Key: SPARK-31718
>                 URL: https://issues.apache.org/jira/browse/SPARK-31718
>             Project: Spark
>          Issue Type: Bug
>          Components: Java API
>    Affects Versions: 2.4.0
>            Reporter: Serhii
>            Priority: Major
>             Fix For: 2.4.0
>
>
> Hi team,
>   
>  We are using DataSourceV2.
>   
>  We have a queston regarding using interface 
> org.apache.spark.sql.sources.v2.writer.DataWriter<T>
>   
>  We have faced with following unexpected behavior.
>  When we use a repartion on dataframe we expect that for each partion Spark 
> will create new instance of DataWriter interface and sends the repartition 
> data to appropriate instances but sometimes we observe that Spark sends the 
> data from different partitions to the same instance of DataWriter interface.
>  It behavior sometimes occures on Yarn cluster.
>   
>  If we run Spark job as Local run Spark really creates a new instance of 
> DataWriter interface for each partiion after repartion and publishes the 
> repartion data to appropriate instances.
>   
> Possible there is a Spark limit a number of  DataWriter instances?
>  Can you explain it is a bug or expected behavior?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31718) DataSourceV2 unexpected behavior with partition data distribution

Reply via email to