[ https://issues.apache.org/jira/browse/BEAM-11393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Beam JIRA Bot updated BEAM-11393: --------------------------------- Labels: (was: stale-P2) > Support grouping by a Series > ---------------------------- > > Key: BEAM-11393 > URL: https://issues.apache.org/jira/browse/BEAM-11393 > Project: Beam > Issue Type: Improvement > Components: sdk-py-core > Reporter: Brian Hulette > Priority: P3 > Time Spent: 10m > Remaining Estimate: 0h > > grouping by a Series (e.g. \{{df.groupby(df.column)}}, > \{{series.groupby(other_series)}}) does not work. The previous implementation > relied on aligning the index between the two deferred frames, but it's > possible that one or both frames will have duplicate values in their index. > Leading to the following error at execution time: > {code} > Traceback (most recent call last): > > > File > "/usr/local/google/home/bhulette/working_dir/beam/sdks/python/apache_beam/dataframe/doctests.py", > line 237, in fix > > computed = self.compute(to_compute) > > > File > "/usr/local/google/home/bhulette/working_dir/beam/sdks/python/apache_beam/dataframe/doctests.py", > line 195, in compute_using_session > return { > > > File > "/usr/local/google/home/bhulette/working_dir/beam/sdks/python/apache_beam/dataframe/doctests.py", > line 196, in <dictcomp> > name: frame._expr.evaluate_at(session) > > File > "/usr/local/google/home/bhulette/working_dir/beam/sdks/python/apache_beam/dataframe/expressions.py", > line 329, in evaluate_at > return self._func(*(session.evaluate(arg) for arg in self._args)) > > File > "/usr/local/google/home/bhulette/working_dir/beam/sdks/python/apache_beam/dataframe/expressions.py", > line 329, in <genexpr> > return self._func(*(session.evaluate(arg) for arg in self._args)) > > File > "/usr/local/google/home/bhulette/working_dir/beam/sdks/python/apache_beam/dataframe/expressions.py", > line 144, in evaluate > result = evaluate_with(input_partitioning) > > > File > "/usr/local/google/home/bhulette/working_dir/beam/sdks/python/apache_beam/dataframe/expressions.py", > line 114, in evaluate_with > results.append(session.evaluate(expr)) > > > File > "/usr/local/google/home/bhulette/working_dir/beam/sdks/python/apache_beam/dataframe/expressions.py", > line 42, in evaluate > self._bindings[expr] = expr.evaluate_at(self) > > > File > "/usr/local/google/home/bhulette/working_dir/beam/sdks/python/apache_beam/dataframe/expressions.py", > line 329, in evaluate_at > return self._func(*(session.evaluate(arg) for arg in self._args)) > > > File > "/usr/local/google/home/bhulette/working_dir/beam/sdks/python/apache_beam/dataframe/frames.py", > line 149, in set_index > df, by = df.align(by, axis=0, join='inner') > > > File > "/usr/local/google/home/bhulette/.pyenv/versions/beam/lib/python3.8/site-packages/pandas/core/frame.py", > line 3962, in align > return super().align( > > File > "/usr/local/google/home/bhulette/.pyenv/versions/beam/lib/python3.8/site-packages/pandas/core/generic.py", > line 8559, in align > return self._align_series( > File > "/usr/local/google/home/bhulette/.pyenv/versions/beam/lib/python3.8/site-packages/pandas/core/generic.py", > line 8681, in _align_series > > fdata = fdata.reindex_indexer(join_index, lidx, axis=1) > > > File > "/usr/local/google/home/bhulette/.pyenv/versions/beam/lib/python3.8/site-packages/pandas/core/internals/managers.py", > line 1276, in reindex_indexer > self.axes[axis]._can_reindex(indexer) > File > "/usr/local/google/home/bhulette/.pyenv/versions/beam/lib/python3.8/site-packages/pandas/core/indexes/base.py", > line 3289, in _can_reindex > raise ValueError("cannot reindex from > a duplicate axis") > ValueError: cannot reindex from a duplicate axis > {code} > Discovered in https://github.com/apache/beam/pull/13401, GHA run: > https://github.com/apache/beam/runs/1445605501 -- This message was sent by Atlassian Jira (v8.3.4#803005)