This is definitely wrong. Looking into what's going on here, but this seems severe enough to be a blocker for the next release.
On Thu, Mar 25, 2021 at 3:39 PM Xinyu Liu <[email protected]> wrote: > Hi, folks, > > I am playing around with the Python Dataframe API, and seemly got an > schema issue when converting pcollection to dataframe. I wrote the > following code for a simple test: > > import apache_beam as beam > from apache_beam.dataframe.convert import to_dataframe > from apache_beam.dataframe.convert import to_pcollection > > p = beam.Pipeline() > data = p | beam.Create([('a', '1111'), ('b', '2222')]) | beam.Map(lambda > x : beam.Row(word=x[0], val=x[1])) > _ = data | beam.Map(print) > p.run() > > This shows the following: > Row(val='1111', word='a') Row(val='2222', word='b') > > But if I use to_dataframe() to convert it into a df, seems the schema was > reversed: > > df = to_dataframe(data) > dataCopy = to_pcollection(df) > _ = dataCopy | beam.Map(print) > p.run() > > I got: > BeamSchema_4100b64e_16e9_467d_932e_5fc2e4acaca7(word='1111', val='a') > BeamSchema_4100b64e_16e9_467d_932e_5fc2e4acaca7(word='2222', val='b') > > Seems now the column 'word' and 'val' is swapped. The problem seems to > happen during to_dataframe(). If I print out df['word'], I got '1111' and > '2222'. I am not sure whether I am doing something wrong or there is an > issue in the schema conversion. Could someone help me take a look? > > Thanks, Xinyu >
