Re: Python Dataframe API issue

Brian Hulette Mon, 29 Mar 2021 08:40:45 -0700

Thanks for the feedback and the bug report Xinyu! I really appreciate it.

Brian


On Thu, Mar 25, 2021 at 6:04 PM Xinyu Liu <[email protected]> wrote:

> Np, thanks for quickly identifying the fix.
>
> Btw, I am very happy about Beam Python supporting the same Pandas
> dataframe api. It's super user-friendly to both devs and data scientists.
> Really cool work!
>
> Thanks,
> Xinyu
>
> On Thu, Mar 25, 2021 at 4:53 PM Robert Bradshaw <[email protected]>
> wrote:
>
>> Thanks, Xinyu, for finding this!
>>
>> On Thu, Mar 25, 2021 at 4:48 PM Kenneth Knowles <[email protected]> wrote:
>>
>>> Cloned to https://issues.apache.org/jira/browse/BEAM-12056
>>>
>>> On Thu, Mar 25, 2021 at 4:46 PM Brian Hulette <[email protected]>
>>> wrote:
>>>
>>>> Yes this looks like https://issues.apache.org/jira/browse/BEAM-11929,
>>>> I removed it from the release blockers since there is a workaround (use a
>>>> NamedTuple type), but it's probably worth cherrypicking the fix.
>>>>
>>>> On Thu, Mar 25, 2021 at 4:44 PM Robert Bradshaw <[email protected]>
>>>> wrote:
>>>>
>>>>> This could be https://issues.apache.org/jira/browse/BEAM-11929
>>>>>
>>>>> On Thu, Mar 25, 2021 at 4:26 PM Robert Bradshaw <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> This is definitely wrong. Looking into what's going on here, but this
>>>>>> seems severe enough to be a blocker for the next release.
>>>>>>
>>>>>> On Thu, Mar 25, 2021 at 3:39 PM Xinyu Liu <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi, folks,
>>>>>>>
>>>>>>> I am playing around with the Python Dataframe API, and seemly got an
>>>>>>> schema issue when converting pcollection to dataframe. I wrote the
>>>>>>> following code for a simple test:
>>>>>>>
>>>>>>> import apache_beam as beam
>>>>>>> from apache_beam.dataframe.convert import to_dataframe
>>>>>>> from apache_beam.dataframe.convert import to_pcollection
>>>>>>>
>>>>>>> p = beam.Pipeline()
>>>>>>> data = p | beam.Create([('a', '1111'), ('b', '2222')]) | beam.Map(
>>>>>>> lambda x : beam.Row(word=x[0], val=x[1]))
>>>>>>> _ = data | beam.Map(print)
>>>>>>> p.run()
>>>>>>>
>>>>>>> This shows the following:
>>>>>>> Row(val='1111', word='a') Row(val='2222', word='b')
>>>>>>>
>>>>>>> But if I use to_dataframe() to convert it into a df, seems the
>>>>>>> schema was reversed:
>>>>>>>
>>>>>>> df = to_dataframe(data)
>>>>>>> dataCopy = to_pcollection(df)
>>>>>>> _ = dataCopy | beam.Map(print)
>>>>>>> p.run()
>>>>>>>
>>>>>>> I got:
>>>>>>> BeamSchema_4100b64e_16e9_467d_932e_5fc2e4acaca7(word='1111',
>>>>>>> val='a') BeamSchema_4100b64e_16e9_467d_932e_5fc2e4acaca7(word='2222',
>>>>>>> val='b')
>>>>>>>
>>>>>>> Seems now the column 'word' and 'val' is swapped. The problem seems
>>>>>>> to happen during to_dataframe(). If I print out df['word'], I got '1111'
>>>>>>> and '2222'. I am not sure whether I am doing something wrong or there 
>>>>>>> is an
>>>>>>> issue in the schema conversion. Could someone help me take a look?
>>>>>>>
>>>>>>> Thanks, Xinyu
>>>>>>>
>>>>>>

Re: Python Dataframe API issue

Reply via email to