[jira] [Updated] (SPARK-11009) RowNumber in HiveContext returns negative values in cluster mode

Saif Addin Ellafi (JIRA) Thu, 08 Oct 2015 11:49:53 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-11009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Saif Addin Ellafi updated SPARK-11009:
--------------------------------------
    Description: 
This issue happens when submitting the job into a standalone cluster. Have not 
tried YARN or MESOS. Repartition df into 1 piece or default parallelism=1 does 
not fix the issue. Also tried having only one node in the cluster, with same 
result. Other shuffle configuration changes do not alter the results either.

The issue does NOT happen in --master local[*].

        val ws = Window.
            partitionBy("client_id").
            orderBy("date")
 
        val nm = "repeatMe"
        df.select(df.col("*"), rowNumber().over(ws).as(nm))
 
        
df.filter(df("repeatMe").isNotNull).orderBy("repeatMe").take(50).foreach(println(_))
 
--->
 
Long, DateType, Int
[219483904822,2006-06-01,-1863462909]
[219483904822,2006-09-01,-1863462909]
[219483904822,2007-01-01,-1863462909]
[219483904822,2007-08-01,-1863462909]
[219483904822,2007-07-01,-1863462909]
[192489238423,2007-07-01,-1863462774]
[192489238423,2007-02-01,-1863462774]
[192489238423,2006-11-01,-1863462774]
[192489238423,2006-08-01,-1863462774]
[192489238423,2007-08-01,-1863462774]
[192489238423,2006-09-01,-1863462774]
[192489238423,2007-03-01,-1863462774]
[192489238423,2006-10-01,-1863462774]
[192489238423,2007-05-01,-1863462774]
[192489238423,2006-06-01,-1863462774]
[192489238423,2006-12-01,-1863462774]


  was:
This issue happens when submitting the job into a standalone cluster. Have not 
tried YARN or MESOS. Repartition df into 1 piece or default parallelism=1 does 
not fix the issue. Also tried having only one node in the cluster, with same 
result. Other shuffle configuration changes do not alter the results either.

The issue does NOT happen in --master local[*].

        val ws = Window.
            partitionBy("client_id").
            orderBy("date")
 
        val nm = "repeatMe"
        df.select(df.col("*"), rowNumber().over(ws).as(nm))
 
        
df.filter(df("repeatMe").isNotNull).orderBy("repeatMe").take(50).foreach(println(_))
 
--->
 
Long, DateType, Int
[200000000003,2006-06-01,-1863462909]
[200000000003,2006-09-01,-1863462909]
[200000000003,2007-01-01,-1863462909]
[200000000003,2007-08-01,-1863462909]
[200000000003,2007-07-01,-1863462909]
[200000000138,2007-07-01,-1863462774]
[200000000138,2007-02-01,-1863462774]
[200000000138,2006-11-01,-1863462774]
[200000000138,2006-08-01,-1863462774]
[200000000138,2007-08-01,-1863462774]
[200000000138,2006-09-01,-1863462774]
[200000000138,2007-03-01,-1863462774]
[200000000138,2006-10-01,-1863462774]
[200000000138,2007-05-01,-1863462774]
[200000000138,2006-06-01,-1863462774]
[200000000138,2006-12-01,-1863462774]



> RowNumber in HiveContext returns negative values in cluster mode
> ----------------------------------------------------------------
>
>                 Key: SPARK-11009
>                 URL: https://issues.apache.org/jira/browse/SPARK-11009
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.5.1
>         Environment: Standalone cluster mode
> No hadoop/hive is present in the environment (no hive-site.xml), only using 
> HiveContext. Spark build as with hadoop 2.6.0.
> Default spark configuration variables.
> cluster has 4 nodes, but happens with n nodes as well.
>            Reporter: Saif Addin Ellafi
>
> This issue happens when submitting the job into a standalone cluster. Have 
> not tried YARN or MESOS. Repartition df into 1 piece or default parallelism=1 
> does not fix the issue. Also tried having only one node in the cluster, with 
> same result. Other shuffle configuration changes do not alter the results 
> either.
> The issue does NOT happen in --master local[*].
>         val ws = Window.
>             partitionBy("client_id").
>             orderBy("date")
>  
>         val nm = "repeatMe"
>         df.select(df.col("*"), rowNumber().over(ws).as(nm))
>  
>         
> df.filter(df("repeatMe").isNotNull).orderBy("repeatMe").take(50).foreach(println(_))
>  
> --->
>  
> Long, DateType, Int
> [219483904822,2006-06-01,-1863462909]
> [219483904822,2006-09-01,-1863462909]
> [219483904822,2007-01-01,-1863462909]
> [219483904822,2007-08-01,-1863462909]
> [219483904822,2007-07-01,-1863462909]
> [192489238423,2007-07-01,-1863462774]
> [192489238423,2007-02-01,-1863462774]
> [192489238423,2006-11-01,-1863462774]
> [192489238423,2006-08-01,-1863462774]
> [192489238423,2007-08-01,-1863462774]
> [192489238423,2006-09-01,-1863462774]
> [192489238423,2007-03-01,-1863462774]
> [192489238423,2006-10-01,-1863462774]
> [192489238423,2007-05-01,-1863462774]
> [192489238423,2006-06-01,-1863462774]
> [192489238423,2006-12-01,-1863462774]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11009) RowNumber in HiveContext returns negative values in cluster mode

Reply via email to