[jira] [Comment Edited] (SPARK-12076) countDistinct behaves inconsistently

2017-01-10 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15816938#comment-15816938
 ] 

Hyukjin Kwon edited comment on SPARK-12076 at 1/11/17 2:43 AM:
---

Would you be able to try this in Spark 2.1?

It is painful to imagine and generate the data to reproduce this issue with 
such a complex query and even if someone like me makes it to verify, I can't 
say it is correctly reproduced somewhere because strictly it is unknown if the 
data was correct and I believe SQL component has rapidly changed and now it 
might produce other plans. Also, it is even worse because someone like me can't 
sure what is incorrect in the output.

So, I guess we could narrow down the problem so that someone can verify it is a 
problem.

If you are not able to try this in the current master, we should resolve this 
either as {{Cannot Reproduce}} because I guess no one can reproduce this and 
verify it, or {{Not A Problem}} because this "applies to issues or components 
that have changed radically since it was opened".



was (Author: hyukjin.kwon):
Would you be able to try this in Spark 2.1?

It is painful to imagine and generate the data to reproduce this issue with 
such a complex query and even if someone like me make it to verify, I can't say 
it is fixed somewhere because strictly it is unknown if the data was correct 
and I believe SQL component has rapidly changed and now it might produce other 
plans. Also, it is even worse because someone like me can't sure what is 
incorrect in the output.

So, I guess we could narrow down the problem so that someone can verify it is a 
problem.

If you are not able to try this in the current master, we should resolve this 
either as {{Cannot Reproduce}} because I guess no one can reproduce this and 
verify it, or {{Not A Problem}} because this "applies to issues or components 
that have changed radically since it was opened".


> countDistinct behaves inconsistently
> 
>
> Key: SPARK-12076
> URL: https://issues.apache.org/jira/browse/SPARK-12076
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Paul Zaczkieiwcz
>Priority: Minor
>
> Assume:
> {code:java}
> val slicePlayed:DataFrame = _
> val joinKeys:DataFrame = _
> {code}
> Also assume that all columns beginning with "cdnt_" are from {{slicePlayed}} 
> and all columns beginning with "join_" are from {{joinKeys}}.  The following 
> queries can return different values for slice_count_distinct:
> {code:java}
> slicePlayed.join(
>   joinKeys,
>   ( 
> $"join_session_id" === $"cdnt_session_id" &&
> $"join_asset_id" === $"cdnt_asset_id" &&
> $"join_euid" === $"cdnt_euid"
>   ),
>   "inner"
> ).groupBy(
>   $"cdnt_session_id".as("slice_played_session_id"),
>   $"cdnt_asset_id".as("slice_played_asset_id"),
>   $"cdnt_euid".as("slice_played_euid")
> ).agg(
>   countDistinct($"cdnt_slice_number").as("slice_count_distinct"),
>   count($"cdnt_slice_number").as("slice_count_total"),
>   min($"cdnt_slice_number").as("min_slice_number"),
>   max($"cdnt_slice_number").as("max_slice_number")
> ).show(false)
> {code}
> {code:java}
> slicePlayed.join(
>   joinKeys,
>   ( 
> $"join_session_id" === $"cdnt_session_id" &&
> $"join_asset_id" === $"cdnt_asset_id" &&
> $"join_euid" === $"cdnt_euid"
>   ),
>   "inner"
> ).groupBy(
>   $"cdnt_session_id".as("slice_played_session_id"),
>   $"cdnt_asset_id".as("slice_played_asset_id"),
>   $"cdnt_euid".as("slice_played_euid")
> ).agg(
>   min($"cdnt_event_time").as("slice_start_time"),
>   min($"cdnt_playing_owner_id").as("slice_played_playing_owner_id"),
>   min($"cdnt_user_ip").as("slice_played_user_ip"),
>   min($"cdnt_user_agent").as("slice_played_user_agent"),
>   min($"cdnt_referer").as("slice_played_referer"),
>   max($"cdnt_event_time").as("slice_end_time"),
>   countDistinct($"cdnt_slice_number").as("slice_count_distinct"),
>   count($"cdnt_slice_number").as("slice_count_total"),
>   min($"cdnt_slice_number").as("min_slice_number"),
>   max($"cdnt_slice_number").as("max_slice_number"),
>   min($"cdnt_is_live").as("is_live")
> ).show(false)
> {code}
> The +only+ difference between the two queries are that I'm adding more 
> columns to the {{agg}} method.
> I can't reproduce by manually creating a dataFrame from 
> {{DataFrame.parallelize}}. The original sources of the dataFrames are parquet 
> files.
> The explain plans for the two queries are slightly different.
> {code}
> == Physical Plan ==
> TungstenAggregate(key=[cdnt_session_id#23,cdnt_asset_id#5,cdnt_euid#13], 
> functions=[(count(cdnt_slice_number#24L),mode=Final,isDistinct=false),(min(cdnt_slice_number#24L),mode=Final,isDistinct=false),(max(cdnt_slice_number#24L),mode=Final,isDistinct=false),(count(cdnt_slice_number#2

[jira] [Comment Edited] (SPARK-12076) countDistinct behaves inconsistently

2017-01-10 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15816938#comment-15816938
 ] 

Hyukjin Kwon edited comment on SPARK-12076 at 1/11/17 2:44 AM:
---

Would you be able to try this in the current master or Spark 2.1?

It is painful to imagine and generate the data to reproduce this issue with 
such a complex query and even if someone like me makes it to verify, I can't 
say it is correctly reproduced because strictly it is unknown if the data was 
correct and I believe SQL component has rapidly changed and now it might 
produce other plans. Also, it is even worse because someone like me can't sure 
what is incorrect in the output.

So, I guess we could narrow down the problem so that someone can verify it is a 
problem.

If you are not able to try this in the current master, we should resolve this 
either as {{Cannot Reproduce}} because I guess no one can reproduce this and 
verify it, or {{Not A Problem}} because this "applies to issues or components 
that have changed radically since it was opened".



was (Author: hyukjin.kwon):
Would you be able to try this in the current master or Spark 2.1?

It is painful to imagine and generate the data to reproduce this issue with 
such a complex query and even if someone like me makes it to verify, I can't 
say it is correctly reproduced somewhere because strictly it is unknown if the 
data was correct and I believe SQL component has rapidly changed and now it 
might produce other plans. Also, it is even worse because someone like me can't 
sure what is incorrect in the output.

So, I guess we could narrow down the problem so that someone can verify it is a 
problem.

If you are not able to try this in the current master, we should resolve this 
either as {{Cannot Reproduce}} because I guess no one can reproduce this and 
verify it, or {{Not A Problem}} because this "applies to issues or components 
that have changed radically since it was opened".


> countDistinct behaves inconsistently
> 
>
> Key: SPARK-12076
> URL: https://issues.apache.org/jira/browse/SPARK-12076
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Paul Zaczkieiwcz
>Priority: Minor
>
> Assume:
> {code:java}
> val slicePlayed:DataFrame = _
> val joinKeys:DataFrame = _
> {code}
> Also assume that all columns beginning with "cdnt_" are from {{slicePlayed}} 
> and all columns beginning with "join_" are from {{joinKeys}}.  The following 
> queries can return different values for slice_count_distinct:
> {code:java}
> slicePlayed.join(
>   joinKeys,
>   ( 
> $"join_session_id" === $"cdnt_session_id" &&
> $"join_asset_id" === $"cdnt_asset_id" &&
> $"join_euid" === $"cdnt_euid"
>   ),
>   "inner"
> ).groupBy(
>   $"cdnt_session_id".as("slice_played_session_id"),
>   $"cdnt_asset_id".as("slice_played_asset_id"),
>   $"cdnt_euid".as("slice_played_euid")
> ).agg(
>   countDistinct($"cdnt_slice_number").as("slice_count_distinct"),
>   count($"cdnt_slice_number").as("slice_count_total"),
>   min($"cdnt_slice_number").as("min_slice_number"),
>   max($"cdnt_slice_number").as("max_slice_number")
> ).show(false)
> {code}
> {code:java}
> slicePlayed.join(
>   joinKeys,
>   ( 
> $"join_session_id" === $"cdnt_session_id" &&
> $"join_asset_id" === $"cdnt_asset_id" &&
> $"join_euid" === $"cdnt_euid"
>   ),
>   "inner"
> ).groupBy(
>   $"cdnt_session_id".as("slice_played_session_id"),
>   $"cdnt_asset_id".as("slice_played_asset_id"),
>   $"cdnt_euid".as("slice_played_euid")
> ).agg(
>   min($"cdnt_event_time").as("slice_start_time"),
>   min($"cdnt_playing_owner_id").as("slice_played_playing_owner_id"),
>   min($"cdnt_user_ip").as("slice_played_user_ip"),
>   min($"cdnt_user_agent").as("slice_played_user_agent"),
>   min($"cdnt_referer").as("slice_played_referer"),
>   max($"cdnt_event_time").as("slice_end_time"),
>   countDistinct($"cdnt_slice_number").as("slice_count_distinct"),
>   count($"cdnt_slice_number").as("slice_count_total"),
>   min($"cdnt_slice_number").as("min_slice_number"),
>   max($"cdnt_slice_number").as("max_slice_number"),
>   min($"cdnt_is_live").as("is_live")
> ).show(false)
> {code}
> The +only+ difference between the two queries are that I'm adding more 
> columns to the {{agg}} method.
> I can't reproduce by manually creating a dataFrame from 
> {{DataFrame.parallelize}}. The original sources of the dataFrames are parquet 
> files.
> The explain plans for the two queries are slightly different.
> {code}
> == Physical Plan ==
> TungstenAggregate(key=[cdnt_session_id#23,cdnt_asset_id#5,cdnt_euid#13], 
> functions=[(count(cdnt_slice_number#24L),mode=Final,isDistinct=false),(min(cdnt_slice_number#24L),mode=Final,isDistinct=false),(max(cdnt_slice_number#24L),mode=

[jira] [Comment Edited] (SPARK-12076) countDistinct behaves inconsistently

2017-01-10 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15816938#comment-15816938
 ] 

Hyukjin Kwon edited comment on SPARK-12076 at 1/11/17 2:43 AM:
---

Would you be able to try this in the current master or Spark 2.1?

It is painful to imagine and generate the data to reproduce this issue with 
such a complex query and even if someone like me makes it to verify, I can't 
say it is correctly reproduced somewhere because strictly it is unknown if the 
data was correct and I believe SQL component has rapidly changed and now it 
might produce other plans. Also, it is even worse because someone like me can't 
sure what is incorrect in the output.

So, I guess we could narrow down the problem so that someone can verify it is a 
problem.

If you are not able to try this in the current master, we should resolve this 
either as {{Cannot Reproduce}} because I guess no one can reproduce this and 
verify it, or {{Not A Problem}} because this "applies to issues or components 
that have changed radically since it was opened".



was (Author: hyukjin.kwon):
Would you be able to try this in Spark 2.1?

It is painful to imagine and generate the data to reproduce this issue with 
such a complex query and even if someone like me makes it to verify, I can't 
say it is correctly reproduced somewhere because strictly it is unknown if the 
data was correct and I believe SQL component has rapidly changed and now it 
might produce other plans. Also, it is even worse because someone like me can't 
sure what is incorrect in the output.

So, I guess we could narrow down the problem so that someone can verify it is a 
problem.

If you are not able to try this in the current master, we should resolve this 
either as {{Cannot Reproduce}} because I guess no one can reproduce this and 
verify it, or {{Not A Problem}} because this "applies to issues or components 
that have changed radically since it was opened".


> countDistinct behaves inconsistently
> 
>
> Key: SPARK-12076
> URL: https://issues.apache.org/jira/browse/SPARK-12076
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Paul Zaczkieiwcz
>Priority: Minor
>
> Assume:
> {code:java}
> val slicePlayed:DataFrame = _
> val joinKeys:DataFrame = _
> {code}
> Also assume that all columns beginning with "cdnt_" are from {{slicePlayed}} 
> and all columns beginning with "join_" are from {{joinKeys}}.  The following 
> queries can return different values for slice_count_distinct:
> {code:java}
> slicePlayed.join(
>   joinKeys,
>   ( 
> $"join_session_id" === $"cdnt_session_id" &&
> $"join_asset_id" === $"cdnt_asset_id" &&
> $"join_euid" === $"cdnt_euid"
>   ),
>   "inner"
> ).groupBy(
>   $"cdnt_session_id".as("slice_played_session_id"),
>   $"cdnt_asset_id".as("slice_played_asset_id"),
>   $"cdnt_euid".as("slice_played_euid")
> ).agg(
>   countDistinct($"cdnt_slice_number").as("slice_count_distinct"),
>   count($"cdnt_slice_number").as("slice_count_total"),
>   min($"cdnt_slice_number").as("min_slice_number"),
>   max($"cdnt_slice_number").as("max_slice_number")
> ).show(false)
> {code}
> {code:java}
> slicePlayed.join(
>   joinKeys,
>   ( 
> $"join_session_id" === $"cdnt_session_id" &&
> $"join_asset_id" === $"cdnt_asset_id" &&
> $"join_euid" === $"cdnt_euid"
>   ),
>   "inner"
> ).groupBy(
>   $"cdnt_session_id".as("slice_played_session_id"),
>   $"cdnt_asset_id".as("slice_played_asset_id"),
>   $"cdnt_euid".as("slice_played_euid")
> ).agg(
>   min($"cdnt_event_time").as("slice_start_time"),
>   min($"cdnt_playing_owner_id").as("slice_played_playing_owner_id"),
>   min($"cdnt_user_ip").as("slice_played_user_ip"),
>   min($"cdnt_user_agent").as("slice_played_user_agent"),
>   min($"cdnt_referer").as("slice_played_referer"),
>   max($"cdnt_event_time").as("slice_end_time"),
>   countDistinct($"cdnt_slice_number").as("slice_count_distinct"),
>   count($"cdnt_slice_number").as("slice_count_total"),
>   min($"cdnt_slice_number").as("min_slice_number"),
>   max($"cdnt_slice_number").as("max_slice_number"),
>   min($"cdnt_is_live").as("is_live")
> ).show(false)
> {code}
> The +only+ difference between the two queries are that I'm adding more 
> columns to the {{agg}} method.
> I can't reproduce by manually creating a dataFrame from 
> {{DataFrame.parallelize}}. The original sources of the dataFrames are parquet 
> files.
> The explain plans for the two queries are slightly different.
> {code}
> == Physical Plan ==
> TungstenAggregate(key=[cdnt_session_id#23,cdnt_asset_id#5,cdnt_euid#13], 
> functions=[(count(cdnt_slice_number#24L),mode=Final,isDistinct=false),(min(cdnt_slice_number#24L),mode=Final,isDistinct=false),(max(cdnt_slice_number#24L),mode=Final,isDist