[jira] [Updated] (SPARK-13299) DataFrame limit operation is not consistent

Nazarii Balkovskyi (JIRA) Fri, 12 Feb 2016 04:35:33 -0800

     [ 
https://issues.apache.org/jira/browse/SPARK-13299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Nazarii Balkovskyi updated SPARK-13299:
---------------------------------------
    Description: 
I faced to a problem with using limit method from DataFrame API. 
I try to get first 999 records from the AVRO source which contains about 3.5K 
records. 

{code:java}
DataFrame df = sqlContext.load(inputSource, "com.databricks.spark.avro");

df = df.limit(999);
{code}

Then after saving operation I get the rows not in the same order as in input 
data set. Sometimes it gives me proper order but usually not. 

{code:java}
df.save(filepathToSave, "com.databricks.spark.avro", SaveMode.ErrorIfExists);
{code}

Here you can see Spark plan (maybe it can help to figure out the cause of the 
issue):
{code}
== Parsed Logical Plan ==
Limit 999
 Filter (1 = 1)
  Relation[mobileNumber#0L,tariff#1,debit#2] 
AvroRelation(hdfs://<server_name>:8020/user/hdfs/dataset.avro,None,0)

== Analyzed Logical Plan ==
mobileNumber: bigint, tariff: string, debit: float
Limit 999
 Filter (1 = 1)
  Relation[mobileNumber#0L,tariff#1,debit#2] 
AvroRelation(hdfs://<server_name>:8020/user/hdfs/dataset.avro,None,0)

== Optimized Logical Plan ==
Limit 999
 Relation[mobileNumber#0L,tariff#1,debit#2] 
AvroRelation(hdfs://<server_name>:8020/user/hdfs/dataset.avro,None,0)

== Physical Plan ==
Limit 999
 Scan 
AvroRelation(hdfs://lssparkmaster.edvantis.com:8020/user/hdfs/clientsENG10M.avro,None,0)[mobileNumber#0L,tariff#1,debit#2]

Code Generation: true
{code}

  was:
I faced to a problem with using limit method from DataFrame API. 
I try to get first 999 records from the AVRO source which contains about 3.5K 
records. 

{code:java}
DataFrame df = sqlContext.load(inputSource, "com.databricks.spark.avro");

df = df.limit(999);
{code}

Then after saving operation I get the rows not in the same order as in input 
data set. Sometimes it gives me proper order but usually not. 

{code:java}
df.save(filepathToSave, "com.databricks.spark.avro", SaveMode.ErrorIfExists);
{code}

Here you can see Spark plan (maybe it can help to figure out the cause of the 
issue):

== Parsed Logical Plan ==
Limit 999
 Relation[color#0,id#1,type#2,rand#3,junk#4] 
AvroRelation(hdfs://<server_name>:8020/tmp/hdfs.2016-02-12--10-18-55-171-488/hdfs.2016-02-12--10-19-05-109-895.avro,None,0)

== Analyzed Logical Plan ==
color: string, id: int, type: string, rand: int, junk: string
Limit 999
 Relation[color#0,id#1,type#2,rand#3,junk#4] 
AvroRelation(hdfs://<server_name>:8020/tmp/hdfs.2016-02-12--10-18-55-171-488/hdfs.2016-02-12--10-19-05-109-895.avro,None,0)

== Optimized Logical Plan ==
InMemoryRelation [color#0,id#1,type#2,rand#3,junk#4], true, 10000, 
StorageLevel(true, true, false, true, 1), (Limit 999), None

== Physical Plan ==
InMemoryColumnarTableScan [color#0,id#1,type#2,rand#3,junk#4], 
(InMemoryRelation [color#0,id#1,type#2,rand#3,junk#4], true, 10000, 
StorageLevel(true, true, false, true, 1), (Limit 999), None)

Code Generation: true




> DataFrame limit operation is not consistent
> -------------------------------------------
>
>                 Key: SPARK-13299
>                 URL: https://issues.apache.org/jira/browse/SPARK-13299
>             Project: Spark
>          Issue Type: Bug
>    Affects Versions: 1.3.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0
>            Reporter: Nazarii Balkovskyi
>              Labels: SparkSQL, dataframe
>         Attachments: SparkLimitIssue.png
>
>
> I faced to a problem with using limit method from DataFrame API. 
> I try to get first 999 records from the AVRO source which contains about 3.5K 
> records. 
> {code:java}
> DataFrame df = sqlContext.load(inputSource, "com.databricks.spark.avro");
> df = df.limit(999);
> {code}
> Then after saving operation I get the rows not in the same order as in input 
> data set. Sometimes it gives me proper order but usually not. 
> {code:java}
> df.save(filepathToSave, "com.databricks.spark.avro", SaveMode.ErrorIfExists);
> {code}
> Here you can see Spark plan (maybe it can help to figure out the cause of the 
> issue):
> {code}
> == Parsed Logical Plan ==
> Limit 999
>  Filter (1 = 1)
>   Relation[mobileNumber#0L,tariff#1,debit#2] 
> AvroRelation(hdfs://<server_name>:8020/user/hdfs/dataset.avro,None,0)
> == Analyzed Logical Plan ==
> mobileNumber: bigint, tariff: string, debit: float
> Limit 999
>  Filter (1 = 1)
>   Relation[mobileNumber#0L,tariff#1,debit#2] 
> AvroRelation(hdfs://<server_name>:8020/user/hdfs/dataset.avro,None,0)
> == Optimized Logical Plan ==
> Limit 999
>  Relation[mobileNumber#0L,tariff#1,debit#2] 
> AvroRelation(hdfs://<server_name>:8020/user/hdfs/dataset.avro,None,0)
> == Physical Plan ==
> Limit 999
>  Scan 
> AvroRelation(hdfs://lssparkmaster.edvantis.com:8020/user/hdfs/clientsENG10M.avro,None,0)[mobileNumber#0L,tariff#1,debit#2]
> Code Generation: true
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13299) DataFrame limit operation is not consistent

Reply via email to