[jira] [Comment Edited] (SPARK-12076) countDistinct behaves inconsistently
[ https://issues.apache.org/jira/browse/SPARK-12076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15816938#comment-15816938 ] Hyukjin Kwon edited comment on SPARK-12076 at 1/11/17 2:43 AM: --- Would you be able to try this in Spark 2.1? It is painful to imagine and generate the data to reproduce this issue with such a complex query and even if someone like me makes it to verify, I can't say it is correctly reproduced somewhere because strictly it is unknown if the data was correct and I believe SQL component has rapidly changed and now it might produce other plans. Also, it is even worse because someone like me can't sure what is incorrect in the output. So, I guess we could narrow down the problem so that someone can verify it is a problem. If you are not able to try this in the current master, we should resolve this either as {{Cannot Reproduce}} because I guess no one can reproduce this and verify it, or {{Not A Problem}} because this "applies to issues or components that have changed radically since it was opened". was (Author: hyukjin.kwon): Would you be able to try this in Spark 2.1? It is painful to imagine and generate the data to reproduce this issue with such a complex query and even if someone like me make it to verify, I can't say it is fixed somewhere because strictly it is unknown if the data was correct and I believe SQL component has rapidly changed and now it might produce other plans. Also, it is even worse because someone like me can't sure what is incorrect in the output. So, I guess we could narrow down the problem so that someone can verify it is a problem. If you are not able to try this in the current master, we should resolve this either as {{Cannot Reproduce}} because I guess no one can reproduce this and verify it, or {{Not A Problem}} because this "applies to issues or components that have changed radically since it was opened". > countDistinct behaves inconsistently > > > Key: SPARK-12076 > URL: https://issues.apache.org/jira/browse/SPARK-12076 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Paul Zaczkieiwcz >Priority: Minor > > Assume: > {code:java} > val slicePlayed:DataFrame = _ > val joinKeys:DataFrame = _ > {code} > Also assume that all columns beginning with "cdnt_" are from {{slicePlayed}} > and all columns beginning with "join_" are from {{joinKeys}}. The following > queries can return different values for slice_count_distinct: > {code:java} > slicePlayed.join( > joinKeys, > ( > $"join_session_id" === $"cdnt_session_id" && > $"join_asset_id" === $"cdnt_asset_id" && > $"join_euid" === $"cdnt_euid" > ), > "inner" > ).groupBy( > $"cdnt_session_id".as("slice_played_session_id"), > $"cdnt_asset_id".as("slice_played_asset_id"), > $"cdnt_euid".as("slice_played_euid") > ).agg( > countDistinct($"cdnt_slice_number").as("slice_count_distinct"), > count($"cdnt_slice_number").as("slice_count_total"), > min($"cdnt_slice_number").as("min_slice_number"), > max($"cdnt_slice_number").as("max_slice_number") > ).show(false) > {code} > {code:java} > slicePlayed.join( > joinKeys, > ( > $"join_session_id" === $"cdnt_session_id" && > $"join_asset_id" === $"cdnt_asset_id" && > $"join_euid" === $"cdnt_euid" > ), > "inner" > ).groupBy( > $"cdnt_session_id".as("slice_played_session_id"), > $"cdnt_asset_id".as("slice_played_asset_id"), > $"cdnt_euid".as("slice_played_euid") > ).agg( > min($"cdnt_event_time").as("slice_start_time"), > min($"cdnt_playing_owner_id").as("slice_played_playing_owner_id"), > min($"cdnt_user_ip").as("slice_played_user_ip"), > min($"cdnt_user_agent").as("slice_played_user_agent"), > min($"cdnt_referer").as("slice_played_referer"), > max($"cdnt_event_time").as("slice_end_time"), > countDistinct($"cdnt_slice_number").as("slice_count_distinct"), > count($"cdnt_slice_number").as("slice_count_total"), > min($"cdnt_slice_number").as("min_slice_number"), > max($"cdnt_slice_number").as("max_slice_number"), > min($"cdnt_is_live").as("is_live") > ).show(false) > {code} > The +only+ difference between the two queries are that I'm adding more > columns to the {{agg}} method. > I can't reproduce by manually creating a dataFrame from > {{DataFrame.parallelize}}. The original sources of the dataFrames are parquet > files. > The explain plans for the two queries are slightly different. > {code} > == Physical Plan == > TungstenAggregate(key=[cdnt_session_id#23,cdnt_asset_id#5,cdnt_euid#13], > functions=[(count(cdnt_slice_number#24L),mode=Final,isDistinct=false),(min(cdnt_slice_number#24L),mode=Final,isDistinct=false),(max(cdnt_slice_number#24L),mode=Final,isDistinct=false),(count(cdnt_slice_number#2
[jira] [Comment Edited] (SPARK-12076) countDistinct behaves inconsistently
[ https://issues.apache.org/jira/browse/SPARK-12076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15816938#comment-15816938 ] Hyukjin Kwon edited comment on SPARK-12076 at 1/11/17 2:44 AM: --- Would you be able to try this in the current master or Spark 2.1? It is painful to imagine and generate the data to reproduce this issue with such a complex query and even if someone like me makes it to verify, I can't say it is correctly reproduced because strictly it is unknown if the data was correct and I believe SQL component has rapidly changed and now it might produce other plans. Also, it is even worse because someone like me can't sure what is incorrect in the output. So, I guess we could narrow down the problem so that someone can verify it is a problem. If you are not able to try this in the current master, we should resolve this either as {{Cannot Reproduce}} because I guess no one can reproduce this and verify it, or {{Not A Problem}} because this "applies to issues or components that have changed radically since it was opened". was (Author: hyukjin.kwon): Would you be able to try this in the current master or Spark 2.1? It is painful to imagine and generate the data to reproduce this issue with such a complex query and even if someone like me makes it to verify, I can't say it is correctly reproduced somewhere because strictly it is unknown if the data was correct and I believe SQL component has rapidly changed and now it might produce other plans. Also, it is even worse because someone like me can't sure what is incorrect in the output. So, I guess we could narrow down the problem so that someone can verify it is a problem. If you are not able to try this in the current master, we should resolve this either as {{Cannot Reproduce}} because I guess no one can reproduce this and verify it, or {{Not A Problem}} because this "applies to issues or components that have changed radically since it was opened". > countDistinct behaves inconsistently > > > Key: SPARK-12076 > URL: https://issues.apache.org/jira/browse/SPARK-12076 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Paul Zaczkieiwcz >Priority: Minor > > Assume: > {code:java} > val slicePlayed:DataFrame = _ > val joinKeys:DataFrame = _ > {code} > Also assume that all columns beginning with "cdnt_" are from {{slicePlayed}} > and all columns beginning with "join_" are from {{joinKeys}}. The following > queries can return different values for slice_count_distinct: > {code:java} > slicePlayed.join( > joinKeys, > ( > $"join_session_id" === $"cdnt_session_id" && > $"join_asset_id" === $"cdnt_asset_id" && > $"join_euid" === $"cdnt_euid" > ), > "inner" > ).groupBy( > $"cdnt_session_id".as("slice_played_session_id"), > $"cdnt_asset_id".as("slice_played_asset_id"), > $"cdnt_euid".as("slice_played_euid") > ).agg( > countDistinct($"cdnt_slice_number").as("slice_count_distinct"), > count($"cdnt_slice_number").as("slice_count_total"), > min($"cdnt_slice_number").as("min_slice_number"), > max($"cdnt_slice_number").as("max_slice_number") > ).show(false) > {code} > {code:java} > slicePlayed.join( > joinKeys, > ( > $"join_session_id" === $"cdnt_session_id" && > $"join_asset_id" === $"cdnt_asset_id" && > $"join_euid" === $"cdnt_euid" > ), > "inner" > ).groupBy( > $"cdnt_session_id".as("slice_played_session_id"), > $"cdnt_asset_id".as("slice_played_asset_id"), > $"cdnt_euid".as("slice_played_euid") > ).agg( > min($"cdnt_event_time").as("slice_start_time"), > min($"cdnt_playing_owner_id").as("slice_played_playing_owner_id"), > min($"cdnt_user_ip").as("slice_played_user_ip"), > min($"cdnt_user_agent").as("slice_played_user_agent"), > min($"cdnt_referer").as("slice_played_referer"), > max($"cdnt_event_time").as("slice_end_time"), > countDistinct($"cdnt_slice_number").as("slice_count_distinct"), > count($"cdnt_slice_number").as("slice_count_total"), > min($"cdnt_slice_number").as("min_slice_number"), > max($"cdnt_slice_number").as("max_slice_number"), > min($"cdnt_is_live").as("is_live") > ).show(false) > {code} > The +only+ difference between the two queries are that I'm adding more > columns to the {{agg}} method. > I can't reproduce by manually creating a dataFrame from > {{DataFrame.parallelize}}. The original sources of the dataFrames are parquet > files. > The explain plans for the two queries are slightly different. > {code} > == Physical Plan == > TungstenAggregate(key=[cdnt_session_id#23,cdnt_asset_id#5,cdnt_euid#13], > functions=[(count(cdnt_slice_number#24L),mode=Final,isDistinct=false),(min(cdnt_slice_number#24L),mode=Final,isDistinct=false),(max(cdnt_slice_number#24L),mode=
[jira] [Comment Edited] (SPARK-12076) countDistinct behaves inconsistently
[ https://issues.apache.org/jira/browse/SPARK-12076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15816938#comment-15816938 ] Hyukjin Kwon edited comment on SPARK-12076 at 1/11/17 2:43 AM: --- Would you be able to try this in the current master or Spark 2.1? It is painful to imagine and generate the data to reproduce this issue with such a complex query and even if someone like me makes it to verify, I can't say it is correctly reproduced somewhere because strictly it is unknown if the data was correct and I believe SQL component has rapidly changed and now it might produce other plans. Also, it is even worse because someone like me can't sure what is incorrect in the output. So, I guess we could narrow down the problem so that someone can verify it is a problem. If you are not able to try this in the current master, we should resolve this either as {{Cannot Reproduce}} because I guess no one can reproduce this and verify it, or {{Not A Problem}} because this "applies to issues or components that have changed radically since it was opened". was (Author: hyukjin.kwon): Would you be able to try this in Spark 2.1? It is painful to imagine and generate the data to reproduce this issue with such a complex query and even if someone like me makes it to verify, I can't say it is correctly reproduced somewhere because strictly it is unknown if the data was correct and I believe SQL component has rapidly changed and now it might produce other plans. Also, it is even worse because someone like me can't sure what is incorrect in the output. So, I guess we could narrow down the problem so that someone can verify it is a problem. If you are not able to try this in the current master, we should resolve this either as {{Cannot Reproduce}} because I guess no one can reproduce this and verify it, or {{Not A Problem}} because this "applies to issues or components that have changed radically since it was opened". > countDistinct behaves inconsistently > > > Key: SPARK-12076 > URL: https://issues.apache.org/jira/browse/SPARK-12076 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Paul Zaczkieiwcz >Priority: Minor > > Assume: > {code:java} > val slicePlayed:DataFrame = _ > val joinKeys:DataFrame = _ > {code} > Also assume that all columns beginning with "cdnt_" are from {{slicePlayed}} > and all columns beginning with "join_" are from {{joinKeys}}. The following > queries can return different values for slice_count_distinct: > {code:java} > slicePlayed.join( > joinKeys, > ( > $"join_session_id" === $"cdnt_session_id" && > $"join_asset_id" === $"cdnt_asset_id" && > $"join_euid" === $"cdnt_euid" > ), > "inner" > ).groupBy( > $"cdnt_session_id".as("slice_played_session_id"), > $"cdnt_asset_id".as("slice_played_asset_id"), > $"cdnt_euid".as("slice_played_euid") > ).agg( > countDistinct($"cdnt_slice_number").as("slice_count_distinct"), > count($"cdnt_slice_number").as("slice_count_total"), > min($"cdnt_slice_number").as("min_slice_number"), > max($"cdnt_slice_number").as("max_slice_number") > ).show(false) > {code} > {code:java} > slicePlayed.join( > joinKeys, > ( > $"join_session_id" === $"cdnt_session_id" && > $"join_asset_id" === $"cdnt_asset_id" && > $"join_euid" === $"cdnt_euid" > ), > "inner" > ).groupBy( > $"cdnt_session_id".as("slice_played_session_id"), > $"cdnt_asset_id".as("slice_played_asset_id"), > $"cdnt_euid".as("slice_played_euid") > ).agg( > min($"cdnt_event_time").as("slice_start_time"), > min($"cdnt_playing_owner_id").as("slice_played_playing_owner_id"), > min($"cdnt_user_ip").as("slice_played_user_ip"), > min($"cdnt_user_agent").as("slice_played_user_agent"), > min($"cdnt_referer").as("slice_played_referer"), > max($"cdnt_event_time").as("slice_end_time"), > countDistinct($"cdnt_slice_number").as("slice_count_distinct"), > count($"cdnt_slice_number").as("slice_count_total"), > min($"cdnt_slice_number").as("min_slice_number"), > max($"cdnt_slice_number").as("max_slice_number"), > min($"cdnt_is_live").as("is_live") > ).show(false) > {code} > The +only+ difference between the two queries are that I'm adding more > columns to the {{agg}} method. > I can't reproduce by manually creating a dataFrame from > {{DataFrame.parallelize}}. The original sources of the dataFrames are parquet > files. > The explain plans for the two queries are slightly different. > {code} > == Physical Plan == > TungstenAggregate(key=[cdnt_session_id#23,cdnt_asset_id#5,cdnt_euid#13], > functions=[(count(cdnt_slice_number#24L),mode=Final,isDistinct=false),(min(cdnt_slice_number#24L),mode=Final,isDistinct=false),(max(cdnt_slice_number#24L),mode=Final,isDist