Re: Spark groupby and agg inconsistent and missing data

2015-12-10 Thread Kapil Raaj
Hi Folks,

I am also getting similar issue:

(df.groupBy("email").agg(last("user_id") as
"user_id").select("user_id").count,df.groupBy("email").agg(last("user_id")
as "user_id").select("user_id").distinct.count)

When run on one computer it gives: (15123144,15123144)

When run on cluster it gives:  (15123144,24)

The first one is expected and looks correct but second one is horribly
wrong. One more observation - even if I change data where total count is
more/less than 15123144 I get distinct = 24 on cluster. Any clue? or Jira
ticket? or what can be fix for now?

On Thu, Oct 22, 2015 at 9:59 PM,  wrote:

> nevermind my last email. res2 is filtered so my test does not make sense.
> The issue is not reproduced there. I have the problem somwhere else.
>
>
>
> *From:* Ellafi, Saif A.
> *Sent:* Thursday, October 22, 2015 12:57 PM
> *To:* 'Xiao Li'
> *Cc:* user
> *Subject:* RE: Spark groupby and agg inconsistent and missing data
>
>
>
> Thanks, sorry I cannot share the data and not sure how much significant it
> will be for you.
>
> I am reproducing the issue on a smaller piece of the content and see
> wether I find a reason on the inconsistence.
>
>
>
> val res2 = data.filter($"closed" === $"ever_closed").groupBy("product",
> "band ", "aget", "vine", "time",
> "mm").agg(count($"account_id").as("N"), sum($"balance").as("balance"),
> sum($"spend").as("spend"), sum($"payment").as("payment")).persist()
>
>
>
> then I collect distinct values of “vine” (which is StringType) both from
> data and res2, and res2 is missing a lot of values:
>
>
>
> val t1 = res2.select("vine").distinct.collect
>
> scala> t1.size
>
> res10: Int = 617
>
>
>
> val t_real = data.select("vine").distinct.collect
>
> scala> t_real.size
>
> res9: Int = 639
>
>
>
>
>
> *From:* Xiao Li [mailto:gatorsm...@gmail.com ]
> *Sent:* Thursday, October 22, 2015 12:45 PM
> *To:* Ellafi, Saif A.
> *Cc:* user
> *Subject:* Re: Spark groupby and agg inconsistent and missing data
>
>
>
> Hi, Saif,
>
>
>
> Could you post your code here? It might help others reproduce the errors
> and give you a correct answer.
>
>
>
> Thanks,
>
>
>
> Xiao Li
>
>
>
> 2015-10-22 8:27 GMT-07:00 :
>
> Hello everyone,
>
>
>
> I am doing some analytics experiments under a 4 server stand-alone cluster
> in a spark shell, mostly involving a huge database with groupBy and
> aggregations.
>
>
>
> I am picking 6 groupBy columns and returning various aggregated results in
> a dataframe. GroupBy fields are of two types, most of them are StringType
> and the rest are LongType.
>
>
>
> The data source is a splitted json file dataframe,  once the data is
> persisted, the result is consistent. But if I unload the memory and reload
> the data, the groupBy action returns different content results, missing
> data.
>
>
>
> Could I be missing something? this is rather serious for my analytics, and
> not sure how to properly diagnose this situation.
>
>
>
> Thanks,
>
> Saif
>
>
>
>
>



-- 
-Kapil Rajak <http://cse.iitkgp.ac.in/~kdkr/>


RE: Spark groupby and agg inconsistent and missing data

2015-10-22 Thread Saif.A.Ellafi
nevermind my last email. res2 is filtered so my test does not make sense. The 
issue is not reproduced there. I have the problem somwhere else.

From: Ellafi, Saif A.
Sent: Thursday, October 22, 2015 12:57 PM
To: 'Xiao Li'
Cc: user
Subject: RE: Spark groupby and agg inconsistent and missing data

Thanks, sorry I cannot share the data and not sure how much significant it will 
be for you.
I am reproducing the issue on a smaller piece of the content and see wether I 
find a reason on the inconsistence.

val res2 = data.filter($"closed" === $"ever_closed").groupBy("product", "band 
", "aget", "vine", "time", "mm").agg(count($"account_id").as("N"), 
sum($"balance").as("balance"), sum($"spend").as("spend"), 
sum($"payment").as("payment")).persist()

then I collect distinct values of “vine” (which is StringType) both from data 
and res2, and res2 is missing a lot of values:

val t1 = res2.select("vine").distinct.collect
scala> t1.size
res10: Int = 617

val t_real = data.select("vine").distinct.collect
scala> t_real.size
res9: Int = 639


From: Xiao Li [mailto:gatorsm...@gmail.com]
Sent: Thursday, October 22, 2015 12:45 PM
To: Ellafi, Saif A.
Cc: user
Subject: Re: Spark groupby and agg inconsistent and missing data

Hi, Saif,

Could you post your code here? It might help others reproduce the errors and 
give you a correct answer.

Thanks,

Xiao Li

2015-10-22 8:27 GMT-07:00 
mailto:saif.a.ell...@wellsfargo.com>>:
Hello everyone,

I am doing some analytics experiments under a 4 server stand-alone cluster in a 
spark shell, mostly involving a huge database with groupBy and aggregations.

I am picking 6 groupBy columns and returning various aggregated results in a 
dataframe. GroupBy fields are of two types, most of them are StringType and the 
rest are LongType.

The data source is a splitted json file dataframe,  once the data is persisted, 
the result is consistent. But if I unload the memory and reload the data, the 
groupBy action returns different content results, missing data.

Could I be missing something? this is rather serious for my analytics, and not 
sure how to properly diagnose this situation.

Thanks,
Saif




RE: Spark groupby and agg inconsistent and missing data

2015-10-22 Thread Saif.A.Ellafi
Thanks, sorry I cannot share the data and not sure how much significant it will 
be for you.
I am reproducing the issue on a smaller piece of the content and see wether I 
find a reason on the inconsistence.

val res2 = data.filter($"closed" === $"ever_closed").groupBy("product", "band 
", "aget", "vine", "time", "mm").agg(count($"account_id").as("N"), 
sum($"balance").as("balance"), sum($"spend").as("spend"), 
sum($"payment").as("payment")).persist()

then I collect distinct values of “vine” (which is StringType) both from data 
and res2, and res2 is missing a lot of values:

val t1 = res2.select("vine").distinct.collect
scala> t1.size
res10: Int = 617

val t_real = data.select("vine").distinct.collect
scala> t_real.size
res9: Int = 639


From: Xiao Li [mailto:gatorsm...@gmail.com]
Sent: Thursday, October 22, 2015 12:45 PM
To: Ellafi, Saif A.
Cc: user
Subject: Re: Spark groupby and agg inconsistent and missing data

Hi, Saif,

Could you post your code here? It might help others reproduce the errors and 
give you a correct answer.

Thanks,

Xiao Li

2015-10-22 8:27 GMT-07:00 
mailto:saif.a.ell...@wellsfargo.com>>:
Hello everyone,

I am doing some analytics experiments under a 4 server stand-alone cluster in a 
spark shell, mostly involving a huge database with groupBy and aggregations.

I am picking 6 groupBy columns and returning various aggregated results in a 
dataframe. GroupBy fields are of two types, most of them are StringType and the 
rest are LongType.

The data source is a splitted json file dataframe,  once the data is persisted, 
the result is consistent. But if I unload the memory and reload the data, the 
groupBy action returns different content results, missing data.

Could I be missing something? this is rather serious for my analytics, and not 
sure how to properly diagnose this situation.

Thanks,
Saif




Re: Spark groupby and agg inconsistent and missing data

2015-10-22 Thread Xiao Li
Hi, Saif,

Could you post your code here? It might help others reproduce the errors
and give you a correct answer.

Thanks,

Xiao Li

2015-10-22 8:27 GMT-07:00 :

> Hello everyone,
>
> I am doing some analytics experiments under a 4 server stand-alone cluster
> in a spark shell, mostly involving a huge database with groupBy and
> aggregations.
>
> I am picking 6 groupBy columns and returning various aggregated results in
> a dataframe. GroupBy fields are of two types, most of them are StringType
> and the rest are LongType.
>
> The data source is a splitted json file dataframe,  once the data is
> persisted, the result is consistent. But if I unload the memory and reload
> the data, the groupBy action returns different content results, missing
> data.
>
> Could I be missing something? this is rather serious for my analytics, and
> not sure how to properly diagnose this situation.
>
> Thanks,
> Saif
>
>