[ https://issues.apache.org/jira/browse/SPARK-13299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Nazarii Balkovskyi updated SPARK-13299: --------------------------------------- Description: I faced to a problem with using limit method from DataFrame API. I try to get first 999 records from the AVRO source which contains about 3.5K records. {code:java} DataFrame df = sqlContext.load(inputSource, "com.databricks.spark.avro"); df = df.limit(999); {code} Then after saving operation I get the rows not in the same order as in input data set. Sometimes it gives me proper order but usually not. {code:java} df.save(filepathToSave, "com.databricks.spark.avro", SaveMode.ErrorIfExists); {code} Here you can see Spark plan (maybe it can help to figure out the cause of the issue): {code} == Parsed Logical Plan == Limit 999 Filter (1 = 1) Relation[mobileNumber#0L,tariff#1,debit#2] AvroRelation(hdfs://<server_name>:8020/user/hdfs/dataset.avro,None,0) == Analyzed Logical Plan == mobileNumber: bigint, tariff: string, debit: float Limit 999 Filter (1 = 1) Relation[mobileNumber#0L,tariff#1,debit#2] AvroRelation(hdfs://<server_name>:8020/user/hdfs/dataset.avro,None,0) == Optimized Logical Plan == Limit 999 Relation[mobileNumber#0L,tariff#1,debit#2] AvroRelation(hdfs://<server_name>:8020/user/hdfs/dataset.avro,None,0) == Physical Plan == Limit 999 Scan AvroRelation(hdfs://<server_name>:8020/user/hdfs/dataset.avro,None,0)[mobileNumber#0L,tariff#1,debit#2] Code Generation: true {code} was: I faced to a problem with using limit method from DataFrame API. I try to get first 999 records from the AVRO source which contains about 3.5K records. {code:java} DataFrame df = sqlContext.load(inputSource, "com.databricks.spark.avro"); df = df.limit(999); {code} Then after saving operation I get the rows not in the same order as in input data set. Sometimes it gives me proper order but usually not. {code:java} df.save(filepathToSave, "com.databricks.spark.avro", SaveMode.ErrorIfExists); {code} Here you can see Spark plan (maybe it can help to figure out the cause of the issue): {code} == Parsed Logical Plan == Limit 999 Filter (1 = 1) Relation[mobileNumber#0L,tariff#1,debit#2] AvroRelation(hdfs://<server_name>:8020/user/hdfs/dataset.avro,None,0) == Analyzed Logical Plan == mobileNumber: bigint, tariff: string, debit: float Limit 999 Filter (1 = 1) Relation[mobileNumber#0L,tariff#1,debit#2] AvroRelation(hdfs://<server_name>:8020/user/hdfs/dataset.avro,None,0) == Optimized Logical Plan == Limit 999 Relation[mobileNumber#0L,tariff#1,debit#2] AvroRelation(hdfs://<server_name>:8020/user/hdfs/dataset.avro,None,0) == Physical Plan == Limit 999 Scan AvroRelation(hdfs://<server_name>:8020/user/hdfs/clientsENG10M.avro,None,0)[mobileNumber#0L,tariff#1,debit#2] Code Generation: true {code} > DataFrame limit operation is not consistent > ------------------------------------------- > > Key: SPARK-13299 > URL: https://issues.apache.org/jira/browse/SPARK-13299 > Project: Spark > Issue Type: Bug > Affects Versions: 1.3.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0 > Reporter: Nazarii Balkovskyi > Labels: SparkSQL, dataframe > Attachments: SparkLimitIssue.png > > > I faced to a problem with using limit method from DataFrame API. > I try to get first 999 records from the AVRO source which contains about 3.5K > records. > {code:java} > DataFrame df = sqlContext.load(inputSource, "com.databricks.spark.avro"); > df = df.limit(999); > {code} > Then after saving operation I get the rows not in the same order as in input > data set. Sometimes it gives me proper order but usually not. > {code:java} > df.save(filepathToSave, "com.databricks.spark.avro", SaveMode.ErrorIfExists); > {code} > Here you can see Spark plan (maybe it can help to figure out the cause of the > issue): > {code} > == Parsed Logical Plan == > Limit 999 > Filter (1 = 1) > Relation[mobileNumber#0L,tariff#1,debit#2] > AvroRelation(hdfs://<server_name>:8020/user/hdfs/dataset.avro,None,0) > == Analyzed Logical Plan == > mobileNumber: bigint, tariff: string, debit: float > Limit 999 > Filter (1 = 1) > Relation[mobileNumber#0L,tariff#1,debit#2] > AvroRelation(hdfs://<server_name>:8020/user/hdfs/dataset.avro,None,0) > == Optimized Logical Plan == > Limit 999 > Relation[mobileNumber#0L,tariff#1,debit#2] > AvroRelation(hdfs://<server_name>:8020/user/hdfs/dataset.avro,None,0) > == Physical Plan == > Limit 999 > Scan > AvroRelation(hdfs://<server_name>:8020/user/hdfs/dataset.avro,None,0)[mobileNumber#0L,tariff#1,debit#2] > Code Generation: true > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org