Re: Spark groupby and agg inconsistent and missing data
Hi Folks, I am also getting similar issue: (df.groupBy("email").agg(last("user_id") as "user_id").select("user_id").count,df.groupBy("email").agg(last("user_id") as "user_id").select("user_id").distinct.count) When run on one computer it gives: (15123144,15123144) When run on cluster it gives: (15123144,24) The first one is expected and looks correct but second one is horribly wrong. One more observation - even if I change data where total count is more/less than 15123144 I get distinct = 24 on cluster. Any clue? or Jira ticket? or what can be fix for now? On Thu, Oct 22, 2015 at 9:59 PM, wrote: > nevermind my last email. res2 is filtered so my test does not make sense. > The issue is not reproduced there. I have the problem somwhere else. > > > > *From:* Ellafi, Saif A. > *Sent:* Thursday, October 22, 2015 12:57 PM > *To:* 'Xiao Li' > *Cc:* user > *Subject:* RE: Spark groupby and agg inconsistent and missing data > > > > Thanks, sorry I cannot share the data and not sure how much significant it > will be for you. > > I am reproducing the issue on a smaller piece of the content and see > wether I find a reason on the inconsistence. > > > > val res2 = data.filter($"closed" === $"ever_closed").groupBy("product", > "band ", "aget", "vine", "time", > "mm").agg(count($"account_id").as("N"), sum($"balance").as("balance"), > sum($"spend").as("spend"), sum($"payment").as("payment")).persist() > > > > then I collect distinct values of “vine” (which is StringType) both from > data and res2, and res2 is missing a lot of values: > > > > val t1 = res2.select("vine").distinct.collect > > scala> t1.size > > res10: Int = 617 > > > > val t_real = data.select("vine").distinct.collect > > scala> t_real.size > > res9: Int = 639 > > > > > > *From:* Xiao Li [mailto:gatorsm...@gmail.com ] > *Sent:* Thursday, October 22, 2015 12:45 PM > *To:* Ellafi, Saif A. > *Cc:* user > *Subject:* Re: Spark groupby and agg inconsistent and missing data > > > > Hi, Saif, > > > > Could you post your code here? It might help others reproduce the errors > and give you a correct answer. > > > > Thanks, > > > > Xiao Li > > > > 2015-10-22 8:27 GMT-07:00 : > > Hello everyone, > > > > I am doing some analytics experiments under a 4 server stand-alone cluster > in a spark shell, mostly involving a huge database with groupBy and > aggregations. > > > > I am picking 6 groupBy columns and returning various aggregated results in > a dataframe. GroupBy fields are of two types, most of them are StringType > and the rest are LongType. > > > > The data source is a splitted json file dataframe, once the data is > persisted, the result is consistent. But if I unload the memory and reload > the data, the groupBy action returns different content results, missing > data. > > > > Could I be missing something? this is rather serious for my analytics, and > not sure how to properly diagnose this situation. > > > > Thanks, > > Saif > > > > > -- -Kapil Rajak <http://cse.iitkgp.ac.in/~kdkr/>
RE: Spark groupby and agg inconsistent and missing data
nevermind my last email. res2 is filtered so my test does not make sense. The issue is not reproduced there. I have the problem somwhere else. From: Ellafi, Saif A. Sent: Thursday, October 22, 2015 12:57 PM To: 'Xiao Li' Cc: user Subject: RE: Spark groupby and agg inconsistent and missing data Thanks, sorry I cannot share the data and not sure how much significant it will be for you. I am reproducing the issue on a smaller piece of the content and see wether I find a reason on the inconsistence. val res2 = data.filter($"closed" === $"ever_closed").groupBy("product", "band ", "aget", "vine", "time", "mm").agg(count($"account_id").as("N"), sum($"balance").as("balance"), sum($"spend").as("spend"), sum($"payment").as("payment")).persist() then I collect distinct values of “vine” (which is StringType) both from data and res2, and res2 is missing a lot of values: val t1 = res2.select("vine").distinct.collect scala> t1.size res10: Int = 617 val t_real = data.select("vine").distinct.collect scala> t_real.size res9: Int = 639 From: Xiao Li [mailto:gatorsm...@gmail.com] Sent: Thursday, October 22, 2015 12:45 PM To: Ellafi, Saif A. Cc: user Subject: Re: Spark groupby and agg inconsistent and missing data Hi, Saif, Could you post your code here? It might help others reproduce the errors and give you a correct answer. Thanks, Xiao Li 2015-10-22 8:27 GMT-07:00 mailto:saif.a.ell...@wellsfargo.com>>: Hello everyone, I am doing some analytics experiments under a 4 server stand-alone cluster in a spark shell, mostly involving a huge database with groupBy and aggregations. I am picking 6 groupBy columns and returning various aggregated results in a dataframe. GroupBy fields are of two types, most of them are StringType and the rest are LongType. The data source is a splitted json file dataframe, once the data is persisted, the result is consistent. But if I unload the memory and reload the data, the groupBy action returns different content results, missing data. Could I be missing something? this is rather serious for my analytics, and not sure how to properly diagnose this situation. Thanks, Saif
RE: Spark groupby and agg inconsistent and missing data
Thanks, sorry I cannot share the data and not sure how much significant it will be for you. I am reproducing the issue on a smaller piece of the content and see wether I find a reason on the inconsistence. val res2 = data.filter($"closed" === $"ever_closed").groupBy("product", "band ", "aget", "vine", "time", "mm").agg(count($"account_id").as("N"), sum($"balance").as("balance"), sum($"spend").as("spend"), sum($"payment").as("payment")).persist() then I collect distinct values of “vine” (which is StringType) both from data and res2, and res2 is missing a lot of values: val t1 = res2.select("vine").distinct.collect scala> t1.size res10: Int = 617 val t_real = data.select("vine").distinct.collect scala> t_real.size res9: Int = 639 From: Xiao Li [mailto:gatorsm...@gmail.com] Sent: Thursday, October 22, 2015 12:45 PM To: Ellafi, Saif A. Cc: user Subject: Re: Spark groupby and agg inconsistent and missing data Hi, Saif, Could you post your code here? It might help others reproduce the errors and give you a correct answer. Thanks, Xiao Li 2015-10-22 8:27 GMT-07:00 mailto:saif.a.ell...@wellsfargo.com>>: Hello everyone, I am doing some analytics experiments under a 4 server stand-alone cluster in a spark shell, mostly involving a huge database with groupBy and aggregations. I am picking 6 groupBy columns and returning various aggregated results in a dataframe. GroupBy fields are of two types, most of them are StringType and the rest are LongType. The data source is a splitted json file dataframe, once the data is persisted, the result is consistent. But if I unload the memory and reload the data, the groupBy action returns different content results, missing data. Could I be missing something? this is rather serious for my analytics, and not sure how to properly diagnose this situation. Thanks, Saif
Re: Spark groupby and agg inconsistent and missing data
Hi, Saif, Could you post your code here? It might help others reproduce the errors and give you a correct answer. Thanks, Xiao Li 2015-10-22 8:27 GMT-07:00 : > Hello everyone, > > I am doing some analytics experiments under a 4 server stand-alone cluster > in a spark shell, mostly involving a huge database with groupBy and > aggregations. > > I am picking 6 groupBy columns and returning various aggregated results in > a dataframe. GroupBy fields are of two types, most of them are StringType > and the rest are LongType. > > The data source is a splitted json file dataframe, once the data is > persisted, the result is consistent. But if I unload the memory and reload > the data, the groupBy action returns different content results, missing > data. > > Could I be missing something? this is rather serious for my analytics, and > not sure how to properly diagnose this situation. > > Thanks, > Saif > >