[jira] [Commented] (SPARK-20144) spark.read.parquet no long maintains ordering of the data

David Greenberg (JIRA) Thu, 11 Apr 2019 13:14:58 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-20144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16815755#comment-16815755
 ]


David Greenberg commented on SPARK-20144:
-----------------------------------------

Hello, this issue is also a major one for me. Almost all of the data I work 
with is has a natural sort order, and I store it in CSV, parquet, and orc. 
Unfortunately, some of my datasets are very large, and so I waste a lot of 
compute time loading those datasets out of storage due to Spark throwing out 
serialization information at load & store time.

 

I would really like to see a solution to this problem, as it's fairly expensive 
to our bottom line when using spark.

> spark.read.parquet no long maintains ordering of the data
> ---------------------------------------------------------
>
>                 Key: SPARK-20144
>                 URL: https://issues.apache.org/jira/browse/SPARK-20144
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.2
>            Reporter: Li Jin
>            Priority: Major
>
> Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is 
> when we read parquet files in 2.0.2, the ordering of rows in the resulting 
> dataframe is not the same as the ordering of rows in the dataframe that the 
> parquet file was reproduced with. 
> This is because FileSourceStrategy.scala combines the parquet files into 
> fewer partitions and also reordered them. This breaks our workflows because 
> they assume the ordering of the data. 
> Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec 
> changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with 
> 2.1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20144) spark.read.parquet no long maintains ordering of the data

Reply via email to