This could be https://issues.apache.org/jira/browse/BEAM-11929

On Thu, Mar 25, 2021 at 4:26 PM Robert Bradshaw <[email protected]> wrote:

> This is definitely wrong. Looking into what's going on here, but this
> seems severe enough to be a blocker for the next release.
>
> On Thu, Mar 25, 2021 at 3:39 PM Xinyu Liu <[email protected]> wrote:
>
>> Hi, folks,
>>
>> I am playing around with the Python Dataframe API, and seemly got an
>> schema issue when converting pcollection to dataframe. I wrote the
>> following code for a simple test:
>>
>> import apache_beam as beam
>> from apache_beam.dataframe.convert import to_dataframe
>> from apache_beam.dataframe.convert import to_pcollection
>>
>> p = beam.Pipeline()
>> data = p | beam.Create([('a', '1111'), ('b', '2222')]) | beam.Map(lambda
>> x : beam.Row(word=x[0], val=x[1]))
>> _ = data | beam.Map(print)
>> p.run()
>>
>> This shows the following:
>> Row(val='1111', word='a') Row(val='2222', word='b')
>>
>> But if I use to_dataframe() to convert it into a df, seems the schema was
>> reversed:
>>
>> df = to_dataframe(data)
>> dataCopy = to_pcollection(df)
>> _ = dataCopy | beam.Map(print)
>> p.run()
>>
>> I got:
>> BeamSchema_4100b64e_16e9_467d_932e_5fc2e4acaca7(word='1111', val='a')
>> BeamSchema_4100b64e_16e9_467d_932e_5fc2e4acaca7(word='2222', val='b')
>>
>> Seems now the column 'word' and 'val' is swapped. The problem seems to
>> happen during to_dataframe(). If I print out df['word'], I got '1111' and
>> '2222'. I am not sure whether I am doing something wrong or there is an
>> issue in the schema conversion. Could someone help me take a look?
>>
>> Thanks, Xinyu
>>
>

Reply via email to