[
https://issues.apache.org/jira/browse/SPARK-24426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Krzysztof Skulski updated SPARK-24426:
--
Description:
I have unexpected results, when I cache DataFrame and try to do another
grouping on it. New DataFrames based on cached groupBy DataFrame works ok, but
when i try join it to anohter DataFrame it seems like second join is adding new
column but the data is copy from first joined DataFrame. Example below
(userAgentType - is ok,
userChannelType - is ok, userOrigin - is not ok).
When I remove cache from aggregated DataFrame it works ok.
{code:scala}
val aggregated = dataFrame.cache()
val userAgentType = aggregated.groupBy("id", "agentType").count()
.orderBy(asc("id"),
desc("count")).groupBy("id").agg(first("agentType").as("agentType"))
val userChannelType = aggregated.groupBy("id", "channelType").count()
.orderBy(asc("id"),
desc("count")).groupBy("id").agg(first("channelType").as("channelType"))
val userOrigin = userInfo
.join(userAgentType, Seq("id"), "left")
.join(userChannelType, Seq("id"), "left")
{code}
was:
I have unexpected results, when I cache DataFrame and try to do another
grouping on it. New DataFrames based on cached groupBy DataFrame works ok, but
when i try join it to anohter DataFrame it seems like second join is adding new
column but the data is copy from first joined DataFrame. Example below
(userAgentType - is ok,
userChannelType - is ok, userOrigin - is not ok).
When I remove cache from aggregated DataFrame it works ok.
{code}
val aggregated = dataFrame.cache()
val userAgentType = aggregated.groupBy("id", "agentType").count()
.orderBy(asc("id"),
desc("count")).groupBy("id").agg(first("agentType").as("agentType"))
val userChannelType = aggregated.groupBy("id", "channelType").count()
.orderBy(asc("id"),
desc("count")).groupBy("id").agg(first("channelType").as("channelType"))
val userOrigin = userInfo
.join(userAgentType, Seq("id"), "left")
.join(userChannelType, Seq("id"), "left")
{code}
> Unexpected combination of cache and join on DataFrame
> -
>
> Key: SPARK-24426
> URL: https://issues.apache.org/jira/browse/SPARK-24426
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Krzysztof Skulski
>Priority: Major
>
> I have unexpected results, when I cache DataFrame and try to do another
> grouping on it. New DataFrames based on cached groupBy DataFrame works ok,
> but when i try join it to anohter DataFrame it seems like second join is
> adding new column but the data is copy from first joined DataFrame. Example
> below (userAgentType - is ok,
> userChannelType - is ok, userOrigin - is not ok).
> When I remove cache from aggregated DataFrame it works ok.
>
> {code:scala}
> val aggregated = dataFrame.cache()
> val userAgentType = aggregated.groupBy("id", "agentType").count()
>.orderBy(asc("id"),
> desc("count")).groupBy("id").agg(first("agentType").as("agentType"))
> val userChannelType = aggregated.groupBy("id", "channelType").count()
>.orderBy(asc("id"),
> desc("count")).groupBy("id").agg(first("channelType").as("channelType"))
> val userOrigin = userInfo
>.join(userAgentType, Seq("id"), "left")
>.join(userChannelType, Seq("id"), "left")
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org