Thanks for the feedback and the bug report Xinyu! I really appreciate it. Brian
On Thu, Mar 25, 2021 at 6:04 PM Xinyu Liu <[email protected]> wrote: > Np, thanks for quickly identifying the fix. > > Btw, I am very happy about Beam Python supporting the same Pandas > dataframe api. It's super user-friendly to both devs and data scientists. > Really cool work! > > Thanks, > Xinyu > > On Thu, Mar 25, 2021 at 4:53 PM Robert Bradshaw <[email protected]> > wrote: > >> Thanks, Xinyu, for finding this! >> >> On Thu, Mar 25, 2021 at 4:48 PM Kenneth Knowles <[email protected]> wrote: >> >>> Cloned to https://issues.apache.org/jira/browse/BEAM-12056 >>> >>> On Thu, Mar 25, 2021 at 4:46 PM Brian Hulette <[email protected]> >>> wrote: >>> >>>> Yes this looks like https://issues.apache.org/jira/browse/BEAM-11929, >>>> I removed it from the release blockers since there is a workaround (use a >>>> NamedTuple type), but it's probably worth cherrypicking the fix. >>>> >>>> On Thu, Mar 25, 2021 at 4:44 PM Robert Bradshaw <[email protected]> >>>> wrote: >>>> >>>>> This could be https://issues.apache.org/jira/browse/BEAM-11929 >>>>> >>>>> On Thu, Mar 25, 2021 at 4:26 PM Robert Bradshaw <[email protected]> >>>>> wrote: >>>>> >>>>>> This is definitely wrong. Looking into what's going on here, but this >>>>>> seems severe enough to be a blocker for the next release. >>>>>> >>>>>> On Thu, Mar 25, 2021 at 3:39 PM Xinyu Liu <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Hi, folks, >>>>>>> >>>>>>> I am playing around with the Python Dataframe API, and seemly got an >>>>>>> schema issue when converting pcollection to dataframe. I wrote the >>>>>>> following code for a simple test: >>>>>>> >>>>>>> import apache_beam as beam >>>>>>> from apache_beam.dataframe.convert import to_dataframe >>>>>>> from apache_beam.dataframe.convert import to_pcollection >>>>>>> >>>>>>> p = beam.Pipeline() >>>>>>> data = p | beam.Create([('a', '1111'), ('b', '2222')]) | beam.Map( >>>>>>> lambda x : beam.Row(word=x[0], val=x[1])) >>>>>>> _ = data | beam.Map(print) >>>>>>> p.run() >>>>>>> >>>>>>> This shows the following: >>>>>>> Row(val='1111', word='a') Row(val='2222', word='b') >>>>>>> >>>>>>> But if I use to_dataframe() to convert it into a df, seems the >>>>>>> schema was reversed: >>>>>>> >>>>>>> df = to_dataframe(data) >>>>>>> dataCopy = to_pcollection(df) >>>>>>> _ = dataCopy | beam.Map(print) >>>>>>> p.run() >>>>>>> >>>>>>> I got: >>>>>>> BeamSchema_4100b64e_16e9_467d_932e_5fc2e4acaca7(word='1111', >>>>>>> val='a') BeamSchema_4100b64e_16e9_467d_932e_5fc2e4acaca7(word='2222', >>>>>>> val='b') >>>>>>> >>>>>>> Seems now the column 'word' and 'val' is swapped. The problem seems >>>>>>> to happen during to_dataframe(). If I print out df['word'], I got '1111' >>>>>>> and '2222'. I am not sure whether I am doing something wrong or there >>>>>>> is an >>>>>>> issue in the schema conversion. Could someone help me take a look? >>>>>>> >>>>>>> Thanks, Xinyu >>>>>>> >>>>>>
