[jira] [Comment Edited] (SPARK-20144) spark.read.parquet no long maintains ordering of the data

Li Jin (JIRA) Fri, 31 Mar 2017 07:16:00 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-20144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15950979#comment-15950979
 ]


Li Jin edited comment on SPARK-20144 at 3/31/17 2:14 PM:
---------------------------------------------------------

Thanks for getting back to me.

Sorting in this case will just add extra cost to in our workflow and we are 
trying to avoid it in the first place.

Because DataFrame presents the data in a tabular format, it is very surprising 
that the ordering of rows in the table changes after going through hdfs. In any 
other tabular format that I know of, ordering of rows is a property of the data 
and it is surprising that reading/writing changes properties of the data. This 
is also a bit scary because if ordering were not a property of a DataFrame, can 
things like cache or select("col") change ordering of rows in the future? 



was (Author: icexelloss):
Thanks for getting back to me.

Sorting in this case will just add extra cost to in our workflow and we are 
trying to avoid it in the first place.

Because DataFrame presents the data in a tabular format, it is very surprising 
that the table changes after going through hdfs. In any other tabular format 
that I know of, ordering of rows is a property of the data and it is surprising 
that reading/writing changes properties of the data. This is also a bit scary 
because if ordering were not a property of a DataFrame, can things like cache 
or select("col") change ordering of rows in the future? 


> spark.read.parquet no long maintains ordering of the data
> ---------------------------------------------------------
>
>                 Key: SPARK-20144
>                 URL: https://issues.apache.org/jira/browse/SPARK-20144
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.2
>            Reporter: Li Jin
>
> Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is 
> when we read parquet files in 2.0.2, the ordering of rows in the resulting 
> dataframe is not the same as the ordering of rows in the dataframe that the 
> parquet file was reproduced with. 
> This is because FileSourceStrategy.scala combines the parquet files into 
> fewer partitions and also reordered them. This breaks our workflows because 
> they assume the ordering of the data. 
> Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec 
> changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with 
> 2.1.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-20144) spark.read.parquet no long maintains ordering of the data

Reply via email to