RowNumber in HiveContext returns null or negative values

2015-10-08 Thread Saif.A.Ellafi
Hi all, would this be a bug??

val ws = Window.
partitionBy("clrty_id").
orderBy("filemonth_dtt")

val nm = "repeatMe"
df.select(df.col("*"), rowNumber().over(ws).cast("int").as(nm))


stacked_data.filter(stacked_data("repeatMe").isNotNull).orderBy("repeatMe").take(50).foreach(println(_))

--->

Long, DateType, Int
[2003,2006-06-01,-1863462909]
[2003,2006-09-01,-1863462909]
[2003,2007-01-01,-1863462909]
[2003,2007-08-01,-1863462909]
[2003,2007-07-01,-1863462909]
[2138,2007-07-01,-1863462774]
[2138,2007-02-01,-1863462774]
[2138,2006-11-01,-1863462774]
[2138,2006-08-01,-1863462774]
[2138,2007-08-01,-1863462774]
[2138,2006-09-01,-1863462774]
[2138,2007-03-01,-1863462774]
[2138,2006-10-01,-1863462774]
[2138,2007-05-01,-1863462774]
[2138,2006-06-01,-1863462774]
[2138,2006-12-01,-1863462774]


Thanks,
Saif



Re: RowNumber in HiveContext returns null or negative values

2015-10-08 Thread Michael Armbrust
Which version of Spark?

On Thu, Oct 8, 2015 at 7:25 AM,  wrote:

> Hi all, would this be a bug??
>
> val ws = Window.
> partitionBy("clrty_id").
> orderBy("filemonth_dtt")
>
> val nm = "repeatMe"
> df.select(df.col("*"), rowNumber().over(ws).cast("int").as(nm))
>
>
> stacked_data.filter(stacked_data("repeatMe").isNotNull).orderBy("repeatMe").take(50).foreach(println(_))
>
> --->
>
> *Long, DateType, Int*
> [2003,2006-06-01,-1863462909]
> [2003,2006-09-01,-1863462909]
> [2003,2007-01-01,-1863462909]
> [2003,2007-08-01,-1863462909]
> [2003,2007-07-01,-1863462909]
> [2138,2007-07-01,-1863462774]
> [2138,2007-02-01,-1863462774]
> [2138,2006-11-01,-1863462774]
> [2138,2006-08-01,-1863462774]
> [2138,2007-08-01,-1863462774]
> [2138,2006-09-01,-1863462774]
> [2138,2007-03-01,-1863462774]
> [2138,2006-10-01,-1863462774]
> [2138,2007-05-01,-1863462774]
> [2138,2006-06-01,-1863462774]
> [2138,2006-12-01,-1863462774]
>
>
> Thanks,
> Saif
>
>


RE: RowNumber in HiveContext returns null or negative values

2015-10-08 Thread Saif.A.Ellafi
Hi, thanks for looking into. v1.5.1. I am really worried.
I dont have hive/hadoop for real in the environment.

Saif

From: Michael Armbrust [mailto:mich...@databricks.com]
Sent: Thursday, October 08, 2015 2:57 PM
To: Ellafi, Saif A.
Cc: user
Subject: Re: RowNumber in HiveContext returns null or negative values

Which version of Spark?

On Thu, Oct 8, 2015 at 7:25 AM, 
<saif.a.ell...@wellsfargo.com<mailto:saif.a.ell...@wellsfargo.com>> wrote:
Hi all, would this be a bug??

val ws = Window.
partitionBy("clrty_id").
orderBy("filemonth_dtt")

val nm = "repeatMe"
df.select(df.col("*"), rowNumber().over(ws).cast("int").as(nm))


stacked_data.filter(stacked_data("repeatMe").isNotNull).orderBy("repeatMe").take(50).foreach(println(_))

--->

Long, DateType, Int
[2003,2006-06-01,-1863462909]
[2003,2006-09-01,-1863462909]
[2003,2007-01-01,-1863462909]
[2003,2007-08-01,-1863462909]
[2003,2007-07-01,-1863462909]
[2138,2007-07-01,-1863462774]
[2138,2007-02-01,-1863462774]
[2138,2006-11-01,-1863462774]
[2138,2006-08-01,-1863462774]
[2138,2007-08-01,-1863462774]
[2138,2006-09-01,-1863462774]
[2138,2007-03-01,-1863462774]
[2138,2006-10-01,-1863462774]
[2138,2007-05-01,-1863462774]
[2138,2006-06-01,-1863462774]
[2138,2006-12-01,-1863462774]


Thanks,
Saif




RE: RowNumber in HiveContext returns null or negative values

2015-10-08 Thread Saif.A.Ellafi
It turns out this does not happen in local[32] mode. Only happens when 
submiting to standalone cluster. Don’t have YARN/MESOS to compare.

Will keep diagnosing.

Saif

From: saif.a.ell...@wellsfargo.com [mailto:saif.a.ell...@wellsfargo.com]
Sent: Thursday, October 08, 2015 3:01 PM
To: mich...@databricks.com
Cc: user@spark.apache.org
Subject: RE: RowNumber in HiveContext returns null or negative values

Hi, thanks for looking into. v1.5.1. I am really worried.
I dont have hive/hadoop for real in the environment.

Saif

From: Michael Armbrust [mailto:mich...@databricks.com]
Sent: Thursday, October 08, 2015 2:57 PM
To: Ellafi, Saif A.
Cc: user
Subject: Re: RowNumber in HiveContext returns null or negative values

Which version of Spark?

On Thu, Oct 8, 2015 at 7:25 AM, 
<saif.a.ell...@wellsfargo.com<mailto:saif.a.ell...@wellsfargo.com>> wrote:
Hi all, would this be a bug??

val ws = Window.
partitionBy("clrty_id").
orderBy("filemonth_dtt")

val nm = "repeatMe"
df.select(df.col("*"), rowNumber().over(ws).cast("int").as(nm))


stacked_data.filter(stacked_data("repeatMe").isNotNull).orderBy("repeatMe").take(50).foreach(println(_))

--->

Long, DateType, Int
[2003,2006-06-01,-1863462909]
[2003,2006-09-01,-1863462909]
[2003,2007-01-01,-1863462909]
[2003,2007-08-01,-1863462909]
[2003,2007-07-01,-1863462909]
[2138,2007-07-01,-1863462774]
[2138,2007-02-01,-1863462774]
[2138,2006-11-01,-1863462774]
[2138,2006-08-01,-1863462774]
[2138,2007-08-01,-1863462774]
[2138,2006-09-01,-1863462774]
[2138,2007-03-01,-1863462774]
[2138,2006-10-01,-1863462774]
[2138,2007-05-01,-1863462774]
[2138,2006-06-01,-1863462774]
[2138,2006-12-01,-1863462774]


Thanks,
Saif




RE: RowNumber in HiveContext returns null or negative values

2015-10-08 Thread Saif.A.Ellafi
Repartition and default parallelism to 1, in cluster mode, is still broken.

So the problem is not the parallelism, but the cluster mode itself. Something 
wrong with HiveContext + cluster mode.

Saif

From: saif.a.ell...@wellsfargo.com [mailto:saif.a.ell...@wellsfargo.com]
Sent: Thursday, October 08, 2015 3:01 PM
To: mich...@databricks.com
Cc: user@spark.apache.org
Subject: RE: RowNumber in HiveContext returns null or negative values

Hi, thanks for looking into. v1.5.1. I am really worried.
I dont have hive/hadoop for real in the environment.

Saif

From: Michael Armbrust [mailto:mich...@databricks.com]
Sent: Thursday, October 08, 2015 2:57 PM
To: Ellafi, Saif A.
Cc: user
Subject: Re: RowNumber in HiveContext returns null or negative values

Which version of Spark?

On Thu, Oct 8, 2015 at 7:25 AM, 
<saif.a.ell...@wellsfargo.com<mailto:saif.a.ell...@wellsfargo.com>> wrote:
Hi all, would this be a bug??

val ws = Window.
partitionBy("clrty_id").
orderBy("filemonth_dtt")

val nm = "repeatMe"
df.select(df.col("*"), rowNumber().over(ws).cast("int").as(nm))


stacked_data.filter(stacked_data("repeatMe").isNotNull).orderBy("repeatMe").take(50).foreach(println(_))

--->

Long, DateType, Int
[2003,2006-06-01,-1863462909]
[2003,2006-09-01,-1863462909]
[2003,2007-01-01,-1863462909]
[2003,2007-08-01,-1863462909]
[2003,2007-07-01,-1863462909]
[2138,2007-07-01,-1863462774]
[2138,2007-02-01,-1863462774]
[2138,2006-11-01,-1863462774]
[2138,2006-08-01,-1863462774]
[2138,2007-08-01,-1863462774]
[2138,2006-09-01,-1863462774]
[2138,2007-03-01,-1863462774]
[2138,2006-10-01,-1863462774]
[2138,2007-05-01,-1863462774]
[2138,2006-06-01,-1863462774]
[2138,2006-12-01,-1863462774]


Thanks,
Saif




Re: RowNumber in HiveContext returns null or negative values

2015-10-08 Thread Michael Armbrust
Can you open a JIRA?

On Thu, Oct 8, 2015 at 11:24 AM, <saif.a.ell...@wellsfargo.com> wrote:

> Repartition and default parallelism to 1, in cluster mode, is still
> *broken*.
>
>
>
> So the problem is not the parallelism, but the cluster mode itself.
> Something wrong with HiveContext + cluster mode.
>
>
>
> Saif
>
>
>
> *From:* saif.a.ell...@wellsfargo.com [mailto:saif.a.ell...@wellsfargo.com]
>
> *Sent:* Thursday, October 08, 2015 3:01 PM
> *To:* mich...@databricks.com
> *Cc:* user@spark.apache.org
> *Subject:* RE: RowNumber in HiveContext returns null or negative values
>
>
>
> Hi, thanks for looking into. v1.5.1. I am really worried.
>
> I dont have hive/hadoop for real in the environment.
>
>
>
> Saif
>
>
>
> *From:* Michael Armbrust [mailto:mich...@databricks.com
> <mich...@databricks.com>]
> *Sent:* Thursday, October 08, 2015 2:57 PM
> *To:* Ellafi, Saif A.
> *Cc:* user
> *Subject:* Re: RowNumber in HiveContext returns null or negative values
>
>
>
> Which version of Spark?
>
>
>
> On Thu, Oct 8, 2015 at 7:25 AM, <saif.a.ell...@wellsfargo.com> wrote:
>
> Hi all, would this be a bug??
>
>
>
> val ws = Window.
>
> partitionBy("clrty_id").
>
> orderBy("filemonth_dtt")
>
>
>
> val nm = "repeatMe"
>
> df.select(df.col("*"), rowNumber().over(ws).cast("int").as(nm))
>
>
>
>
> stacked_data.filter(stacked_data("repeatMe").isNotNull).orderBy("repeatMe").take(50).foreach(println(_))
>
>
>
> --->
>
>
>
> *Long, DateType, Int*
>
> [2003,2006-06-01,-1863462909]
>
> [2003,2006-09-01,-1863462909]
>
> [2003,2007-01-01,-1863462909]
>
> [2003,2007-08-01,-1863462909]
>
> [2003,2007-07-01,-1863462909]
>
> [2138,2007-07-01,-1863462774]
>
> [2138,2007-02-01,-1863462774]
>
> [2138,2006-11-01,-1863462774]
>
> [2138,2006-08-01,-1863462774]
>
> [2138,2007-08-01,-1863462774]
>
> [2138,2006-09-01,-1863462774]
>
> [2138,2007-03-01,-1863462774]
>
> [2138,2006-10-01,-1863462774]
>
> [2138,2007-05-01,-1863462774]
>
> [2138,2006-06-01,-1863462774]
>
> [2138,2006-12-01,-1863462774]
>
>
>
>
>
> Thanks,
>
> Saif
>
>
>
>
>