Python Dataframe API issue

Xinyu Liu Thu, 25 Mar 2021 15:39:04 -0700

Hi, folks,

I am playing around with the Python Dataframe API, and seemly got an schema
issue when converting pcollection to dataframe. I wrote the following code
for a simple test:


import apache_beam as beam
from apache_beam.dataframe.convert import to_dataframe
from apache_beam.dataframe.convert import to_pcollection

p = beam.Pipeline()
data = p | beam.Create([('a', '1111'), ('b', '2222')]) | beam.Map(lambda x
: beam.Row(word=x[0], val=x[1]))
_ = data | beam.Map(print)
p.run()

This shows the following:
Row(val='1111', word='a') Row(val='2222', word='b')

But if I use to_dataframe() to convert it into a df, seems the schema was
reversed:

df = to_dataframe(data)
dataCopy = to_pcollection(df)
_ = dataCopy | beam.Map(print)
p.run()

I got:
BeamSchema_4100b64e_16e9_467d_932e_5fc2e4acaca7(word='1111', val='a')
BeamSchema_4100b64e_16e9_467d_932e_5fc2e4acaca7(word='2222', val='b')

Seems now the column 'word' and 'val' is swapped. The problem seems to
happen during to_dataframe(). If I print out df['word'], I got '1111' and
'2222'. I am not sure whether I am doing something wrong or there is an
issue in the schema conversion. Could someone help me take a look?

Thanks, Xinyu

Python Dataframe API issue

Reply via email to