Hi Jiří, Thanks for your mail.
Could you create a JIRA ticket for this: https://issues.apache.org/jira/browse/SPARK/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel <https://issues.apache.org/jira/browse/SPARK/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel> <https://issues.apache.org/jira/browse/SPARK/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel> ? Kind regards, Herman van Hövell 2016-02-26 15:11 GMT+01:00 Jiří Syrový <syrovy.j...@gmail.com>: > Hi, > > I've recently noticed a bug in Spark (branch 1.6) that appears if you do > the following > > Let's have some DataFrame called df. > > 1) Aggregation of multiple columns on the Dataframe df and store result as > result_agg_1 > 2) Do another aggregation of multiple columns, but on one less grouping > columns and store the result as result_agg_2 > 3) Align the result of second aggregation by adding missing grouping > column with value empty lit("") > 4) Union result_agg_1 and result_agg_2 > 5) Do the projection from "sum(count_column)" to "count_column" for all > aggregated columns. > > The result is structurally inconsistent DataFrame that has all the data > coming from result_agg_1 shifted. > > An example of stripped down code and example result can be seen here: > > https://gist.github.com/xjrk58/e0c7171287ee9bdc8df8 > https://gist.github.com/xjrk58/7a297a42ebb94f300d96 > > Best, > Jiri Syrovy > >