[ 
https://issues.apache.org/jira/browse/SPARK-13299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nazarii Balkovskyi updated SPARK-13299:
---------------------------------------
    Attachment: SparkLimitIssue.png

> DataFrame limit operation is not consistent
> -------------------------------------------
>
>                 Key: SPARK-13299
>                 URL: https://issues.apache.org/jira/browse/SPARK-13299
>             Project: Spark
>          Issue Type: Bug
>    Affects Versions: 1.3.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0
>            Reporter: Nazarii Balkovskyi
>              Labels: SparkSQL, dataframe
>         Attachments: SparkLimitIssue.png
>
>
> I faced to a problem with using limit method from DataFrame API. 
> I try to get first 999 records from the AVRO source which contains about 3.5K 
> records. 
> {code:java}
> DataFrame df = sqlContext.load(inputSource, "com.databricks.spark.avro");
> df = df.limit(999);
> {code}
> Then after saving operation I get the rows not in the same order as in input 
> data set. Sometimes it gives me proper order but usually not. 
> {code:java}
> df.save(filepathToSave, "com.databricks.spark.avro", SaveMode.ErrorIfExists);
> {code}
> Here you can see Spark plan (maybe it can help to figure out the cause of the 
> issue):
> == Parsed Logical Plan ==
> Limit 999
>  Relation[color#0,id#1,type#2,rand#3,junk#4] 
> AvroRelation(hdfs://<server_name>:8020/tmp/hdfs.2016-02-12--10-18-55-171-488/hdfs.2016-02-12--10-19-05-109-895.avro,None,0)
> == Analyzed Logical Plan ==
> color: string, id: int, type: string, rand: int, junk: string
> Limit 999
>  Relation[color#0,id#1,type#2,rand#3,junk#4] 
> AvroRelation(hdfs://<server_name>:8020/tmp/hdfs.2016-02-12--10-18-55-171-488/hdfs.2016-02-12--10-19-05-109-895.avro,None,0)
> == Optimized Logical Plan ==
> InMemoryRelation [color#0,id#1,type#2,rand#3,junk#4], true, 10000, 
> StorageLevel(true, true, false, true, 1), (Limit 999), None
> == Physical Plan ==
> InMemoryColumnarTableScan [color#0,id#1,type#2,rand#3,junk#4], 
> (InMemoryRelation [color#0,id#1,type#2,rand#3,junk#4], true, 10000, 
> StorageLevel(true, true, false, true, 1), (Limit 999), None)
> Code Generation: true



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to