[ 
https://issues.apache.org/jira/browse/ARROW-2814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

rob updated ARROW-2814:
-----------------------
    Description: 
There is a problem when trying to run pa.Table.from_pandas() on a parquet file 
that has a json string in it.  I have attached the file to this ticket that is 
the source of the problem and the code below will show the error.
h2. Reproducible code

import pandas as pd
 import pyarrow as pa
 import pyarrow.parquet as pq

pd.options.display.max_colwidth = 10000

pq_table = 
pq.read_table("part-00000-8f03690f-736d-43a9-9287-6db9e228d59c.c000.gz.parquet")
 
 panda_table = pq_table.to_pandas() 
 orginal_count = len(panda_table)
h2. h2. Fails

table_output = pa.Table.from_pandas(panda_table)

del panda_table['payload']
h3. h2. Works

table_output = pa.Table.from_pandas(panda_table)
h3. h2. payload is the faulty column. Print out data

pq_table = 
pq.read_table("part-00000-8f03690f-736d-43a9-9287-6db9e228d59c.c000.gz.parquet")
 
 panda_table = pq_table.to_pandas() 
 orginal_count = len(panda_table)

table_output = pa.Table.from_pandas(panda_table[['payload']])

panda_table[['payload']]

  was:
There is a problem when trying to run pa.Table.from_pandas() on a parquet file 
that has a json string in it.  I have attached the file to this ticket that is 
the source of the problem and the code below will show the error.
h2. Reproducible code

import pandas as pd
 import pyarrow as pa
 import pyarrow.parquet as pq

pd.options.display.max_colwidth = 10000

pq_table = 
pq.read_table("part-00000-8f03690f-736d-43a9-9287-6db9e228d59c.c000.gz.parquet")
 
 panda_table = pq_table.to_pandas() 
 orginal_count = len(panda_table)
h2. h3. Fails

table_output = pa.Table.from_pandas(panda_table)

del panda_table['payload']
h3. h3. Works

table_output = pa.Table.from_pandas(panda_table)
h3. h3. payload is the faulty column. Print out data

pq_table = 
pq.read_table("part-00000-8f03690f-736d-43a9-9287-6db9e228d59c.c000.gz.parquet")
 
 panda_table = pq_table.to_pandas() 
 orginal_count = len(panda_table)

table_output = pa.Table.from_pandas(panda_table[['payload']])

panda_table[['payload']]


> Error inferring Arrow type for Python object array. Got Python object of type 
> dict but can only handle these types: string, bool, float, int, date, time, 
> decimal, list, array
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-2814
>                 URL: https://issues.apache.org/jira/browse/ARROW-2814
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.9.0
>            Reporter: rob
>            Priority: Blocker
>         Attachments: 
> part-00000-8f03690f-736d-43a9-9287-6db9e228d59c.c000.gz.parquet
>
>
> There is a problem when trying to run pa.Table.from_pandas() on a parquet 
> file that has a json string in it.  I have attached the file to this ticket 
> that is the source of the problem and the code below will show the error.
> h2. Reproducible code
> import pandas as pd
>  import pyarrow as pa
>  import pyarrow.parquet as pq
> pd.options.display.max_colwidth = 10000
> pq_table = 
> pq.read_table("part-00000-8f03690f-736d-43a9-9287-6db9e228d59c.c000.gz.parquet")
>  
>  panda_table = pq_table.to_pandas() 
>  orginal_count = len(panda_table)
> h2. h2. Fails
> table_output = pa.Table.from_pandas(panda_table)
> del panda_table['payload']
> h3. h2. Works
> table_output = pa.Table.from_pandas(panda_table)
> h3. h2. payload is the faulty column. Print out data
> pq_table = 
> pq.read_table("part-00000-8f03690f-736d-43a9-9287-6db9e228d59c.c000.gz.parquet")
>  
>  panda_table = pq_table.to_pandas() 
>  orginal_count = len(panda_table)
> table_output = pa.Table.from_pandas(panda_table[['payload']])
> panda_table[['payload']]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to