[jira] [Commented] (SPARK-20144) spark.read.parquet no long maintains ordering of the data

sam (JIRA) Wed, 30 May 2018 01:04:30 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-20144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16494852#comment-16494852
 ]


sam commented on SPARK-20144:
-----------------------------

Regarding the original issue of sorting, I agree with [~srowen] in that it 
should be up to the user to explicitly ask for sorted data. This is because 
fundamentally Spark implements the Map Reduce programming paradigm which is 
defined in terms of multisets. [~icexelloss] Please read  
[http://theory.stanford.edu/~sergei/papers/soda10-mrc.pdf]

Regarding my issue of Spark reducing the number of partitions without any ask 
from the user I've created a separate issue: 
https://issues.apache.org/jira/browse/SPARK-24425

> spark.read.parquet no long maintains ordering of the data
> ---------------------------------------------------------
>
>                 Key: SPARK-20144
>                 URL: https://issues.apache.org/jira/browse/SPARK-20144
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.2
>            Reporter: Li Jin
>            Priority: Major
>
> Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is 
> when we read parquet files in 2.0.2, the ordering of rows in the resulting 
> dataframe is not the same as the ordering of rows in the dataframe that the 
> parquet file was reproduced with. 
> This is because FileSourceStrategy.scala combines the parquet files into 
> fewer partitions and also reordered them. This breaks our workflows because 
> they assume the ordering of the data. 
> Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec 
> changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with 
> 2.1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20144) spark.read.parquet no long maintains ordering of the data

Reply via email to