This is definitely wrong. Looking into what's going on here, but this seems
severe enough to be a blocker for the next release.

On Thu, Mar 25, 2021 at 3:39 PM Xinyu Liu <[email protected]> wrote:

> Hi, folks,
>
> I am playing around with the Python Dataframe API, and seemly got an
> schema issue when converting pcollection to dataframe. I wrote the
> following code for a simple test:
>
> import apache_beam as beam
> from apache_beam.dataframe.convert import to_dataframe
> from apache_beam.dataframe.convert import to_pcollection
>
> p = beam.Pipeline()
> data = p | beam.Create([('a', '1111'), ('b', '2222')]) | beam.Map(lambda
> x : beam.Row(word=x[0], val=x[1]))
> _ = data | beam.Map(print)
> p.run()
>
> This shows the following:
> Row(val='1111', word='a') Row(val='2222', word='b')
>
> But if I use to_dataframe() to convert it into a df, seems the schema was
> reversed:
>
> df = to_dataframe(data)
> dataCopy = to_pcollection(df)
> _ = dataCopy | beam.Map(print)
> p.run()
>
> I got:
> BeamSchema_4100b64e_16e9_467d_932e_5fc2e4acaca7(word='1111', val='a')
> BeamSchema_4100b64e_16e9_467d_932e_5fc2e4acaca7(word='2222', val='b')
>
> Seems now the column 'word' and 'val' is swapped. The problem seems to
> happen during to_dataframe(). If I print out df['word'], I got '1111' and
> '2222'. I am not sure whether I am doing something wrong or there is an
> issue in the schema conversion. Could someone help me take a look?
>
> Thanks, Xinyu
>

Reply via email to