Re: Python Dataframe API issue

Brian Hulette Thu, 25 Mar 2021 16:46:46 -0700

Yes this looks like https://issues.apache.org/jira/browse/BEAM-11929, I
removed it from the release blockers since there is a workaround (use a
NamedTuple type), but it's probably worth cherrypicking the fix.


On Thu, Mar 25, 2021 at 4:44 PM Robert Bradshaw <[email protected]> wrote:

> This could be https://issues.apache.org/jira/browse/BEAM-11929
>
> On Thu, Mar 25, 2021 at 4:26 PM Robert Bradshaw <[email protected]>
> wrote:
>
>> This is definitely wrong. Looking into what's going on here, but this
>> seems severe enough to be a blocker for the next release.
>>
>> On Thu, Mar 25, 2021 at 3:39 PM Xinyu Liu <[email protected]> wrote:
>>
>>> Hi, folks,
>>>
>>> I am playing around with the Python Dataframe API, and seemly got an
>>> schema issue when converting pcollection to dataframe. I wrote the
>>> following code for a simple test:
>>>
>>> import apache_beam as beam
>>> from apache_beam.dataframe.convert import to_dataframe
>>> from apache_beam.dataframe.convert import to_pcollection
>>>
>>> p = beam.Pipeline()
>>> data = p | beam.Create([('a', '1111'), ('b', '2222')]) | beam.Map(lambda
>>> x : beam.Row(word=x[0], val=x[1]))
>>> _ = data | beam.Map(print)
>>> p.run()
>>>
>>> This shows the following:
>>> Row(val='1111', word='a') Row(val='2222', word='b')
>>>
>>> But if I use to_dataframe() to convert it into a df, seems the schema
>>> was reversed:
>>>
>>> df = to_dataframe(data)
>>> dataCopy = to_pcollection(df)
>>> _ = dataCopy | beam.Map(print)
>>> p.run()
>>>
>>> I got:
>>> BeamSchema_4100b64e_16e9_467d_932e_5fc2e4acaca7(word='1111', val='a')
>>> BeamSchema_4100b64e_16e9_467d_932e_5fc2e4acaca7(word='2222', val='b')
>>>
>>> Seems now the column 'word' and 'val' is swapped. The problem seems to
>>> happen during to_dataframe(). If I print out df['word'], I got '1111' and
>>> '2222'. I am not sure whether I am doing something wrong or there is an
>>> issue in the schema conversion. Could someone help me take a look?
>>>
>>> Thanks, Xinyu
>>>
>>

Re: Python Dataframe API issue

Reply via email to