[jira] [Commented] (ARROW-1291) [Python] pa.RecordBatch.from_pandas doesn't accept DataFrame with numeric column names

2017-07-29 Thread Li Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16106168#comment-16106168
 ] 

Li Jin commented on ARROW-1291:
---

I think it's ok to not maintain "roundtrip exact conversion" between Arrow and 
other data representation. It's inevitable that other data representation has 
some exotic feature that Arrow cannot support, it's a little bit too strict in 
my opinion to error out in all cases. Just to provide another data point, (not 
saying this is correct, just for reference), Spark/Pandas conversion also casts 
int column names to string.

> [Python] pa.RecordBatch.from_pandas doesn't accept DataFrame with numeric 
> column names
> --
>
> Key: ARROW-1291
> URL: https://issues.apache.org/jira/browse/ARROW-1291
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.5.0
>Reporter: Li Jin
>Assignee: Wes McKinney
>Priority: Minor
> Fix For: 0.6.0
>
>
> {code}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame([1])
> pa.RecordBatch.from_pandas(df)
> {code}
> Exception:
> {code}
> TypeError Traceback (most recent call last)
>  in ()
>   3 
>   4 df = pd.DataFrame([1])
> > 5 pa.RecordBatch.from_pandas(df)
> table.pxi in pyarrow.lib.RecordBatch.from_pandas()
> table.pxi in pyarrow.lib._dataframe_to_arrays()
> /home/icexelloss/miniconda3/envs/spark-dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py
>  in construct_metadata(df, index_levels, preserve_index, types)
> 187 arrow_type=arrow_type
> 188 )
> --> 189 for name, arrow_type in zip(df.columns, df_types)
> 190 ] + (
> 191 [
> /home/icexelloss/miniconda3/envs/spark-dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py
>  in (.0)
> 187 arrow_type=arrow_type
> 188 )
> --> 189 for name, arrow_type in zip(df.columns, df_types)
> 190 ] + (
> 191 [
> /home/icexelloss/miniconda3/envs/spark-dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py
>  in get_column_metadata(column, name, arrow_type)
> 125 raise TypeError(
> 126 'Column name must be a string. Got column {} of type 
> {}'.format(
> --> 127 name, type(name).__name__
> 128 )
> 129 )
> TypeError: Column name must be a string. Got column 0 of type int64
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1291) [Python] pa.RecordBatch.from_pandas doesn't accept DataFrame with numeric column names

2017-07-29 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16106155#comment-16106155
 ] 

Wes McKinney commented on ARROW-1291:
-

PR: https://github.com/apache/arrow/pull/911

> [Python] pa.RecordBatch.from_pandas doesn't accept DataFrame with numeric 
> column names
> --
>
> Key: ARROW-1291
> URL: https://issues.apache.org/jira/browse/ARROW-1291
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.5.0
>Reporter: Li Jin
>Assignee: Wes McKinney
>Priority: Minor
> Fix For: 0.6.0
>
>
> {code}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame([1])
> pa.RecordBatch.from_pandas(df)
> {code}
> Exception:
> {code}
> TypeError Traceback (most recent call last)
>  in ()
>   3 
>   4 df = pd.DataFrame([1])
> > 5 pa.RecordBatch.from_pandas(df)
> table.pxi in pyarrow.lib.RecordBatch.from_pandas()
> table.pxi in pyarrow.lib._dataframe_to_arrays()
> /home/icexelloss/miniconda3/envs/spark-dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py
>  in construct_metadata(df, index_levels, preserve_index, types)
> 187 arrow_type=arrow_type
> 188 )
> --> 189 for name, arrow_type in zip(df.columns, df_types)
> 190 ] + (
> 191 [
> /home/icexelloss/miniconda3/envs/spark-dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py
>  in (.0)
> 187 arrow_type=arrow_type
> 188 )
> --> 189 for name, arrow_type in zip(df.columns, df_types)
> 190 ] + (
> 191 [
> /home/icexelloss/miniconda3/envs/spark-dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py
>  in get_column_metadata(column, name, arrow_type)
> 125 raise TypeError(
> 126 'Column name must be a string. Got column {} of type 
> {}'.format(
> --> 127 name, type(name).__name__
> 128 )
> 129 )
> TypeError: Column name must be a string. Got column 0 of type int64
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1291) [Python] pa.RecordBatch.from_pandas doesn't accept DataFrame with numeric column names

2017-07-29 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16106151#comment-16106151
 ] 

Wes McKinney commented on ARROW-1291:
-

I'm more in favor of #2, mostly because renaming the columns on a DataFrame 
without destroying the original object will generally involve a memory 
doubling. You can assign to {{df.columns}} to avoid this, but 
{{df.rename(columns=str)}} will double memory

> [Python] pa.RecordBatch.from_pandas doesn't accept DataFrame with numeric 
> column names
> --
>
> Key: ARROW-1291
> URL: https://issues.apache.org/jira/browse/ARROW-1291
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.5.0
>Reporter: Li Jin
>Priority: Minor
> Fix For: 0.6.0
>
>
> {code}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame([1])
> pa.RecordBatch.from_pandas(df)
> {code}
> Exception:
> {code}
> TypeError Traceback (most recent call last)
>  in ()
>   3 
>   4 df = pd.DataFrame([1])
> > 5 pa.RecordBatch.from_pandas(df)
> table.pxi in pyarrow.lib.RecordBatch.from_pandas()
> table.pxi in pyarrow.lib._dataframe_to_arrays()
> /home/icexelloss/miniconda3/envs/spark-dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py
>  in construct_metadata(df, index_levels, preserve_index, types)
> 187 arrow_type=arrow_type
> 188 )
> --> 189 for name, arrow_type in zip(df.columns, df_types)
> 190 ] + (
> 191 [
> /home/icexelloss/miniconda3/envs/spark-dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py
>  in (.0)
> 187 arrow_type=arrow_type
> 188 )
> --> 189 for name, arrow_type in zip(df.columns, df_types)
> 190 ] + (
> 191 [
> /home/icexelloss/miniconda3/envs/spark-dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py
>  in get_column_metadata(column, name, arrow_type)
> 125 raise TypeError(
> 126 'Column name must be a string. Got column {} of type 
> {}'.format(
> --> 127 name, type(name).__name__
> 128 )
> 129 )
> TypeError: Column name must be a string. Got column 0 of type int64
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1291) [Python] pa.RecordBatch.from_pandas doesn't accept DataFrame with numeric column names

2017-07-28 Thread Phillip Cloud (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105728#comment-16105728
 ] 

Phillip Cloud commented on ARROW-1291:
--

That could work, but then the round trip conversion is no longer exact.

It seems like the choice is "where should the surprise be?" or maybe "what's 
least surprising to users?" and that there are three options.

# Leave the behavior as is, and users of arrow need to handle their own input 
columns before sending dataframes to arrow. This is the current behavior.
# Add casting to strings in one direction, when the input is a dataframe with 
numeric columns. This gives IMO behavior that is more surprising than an error: 
when you call {{.to_pandas()}} you get back something different than what you 
put in. It's also not easy to tell that it's different by looking at the 
dataframe because of the way dataframes repr.
# Add enough metadata in to preserve the current round trip behavior.

I favor #1 the most and #3 if we decide it really is necessary to allow numeric 
columns. With 3 we still lose some compatibility with other systems that want 
to read and write data that came from dataframes unless those systems want to 
handle integer columns.

I think #2 isn't a great option because it results in behavior in the public 
API that isn't obvious unless you know something about how both arrow and 
pandas work.

Additionally, we can't just call {{str}} on every column and be done, we have 
to make additional decisions like do we allow mixed string and integer column 
names? Though, maybe that's a red herring and we can just say "{{Int64Index}} s 
only" though we still have to make that decision as well.

> [Python] pa.RecordBatch.from_pandas doesn't accept DataFrame with numeric 
> column names
> --
>
> Key: ARROW-1291
> URL: https://issues.apache.org/jira/browse/ARROW-1291
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.5.0
>Reporter: Li Jin
>Priority: Minor
> Fix For: 0.6.0
>
>
> {code}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame([1])
> pa.RecordBatch.from_pandas(df)
> {code}
> Exception:
> {code}
> TypeError Traceback (most recent call last)
>  in ()
>   3 
>   4 df = pd.DataFrame([1])
> > 5 pa.RecordBatch.from_pandas(df)
> table.pxi in pyarrow.lib.RecordBatch.from_pandas()
> table.pxi in pyarrow.lib._dataframe_to_arrays()
> /home/icexelloss/miniconda3/envs/spark-dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py
>  in construct_metadata(df, index_levels, preserve_index, types)
> 187 arrow_type=arrow_type
> 188 )
> --> 189 for name, arrow_type in zip(df.columns, df_types)
> 190 ] + (
> 191 [
> /home/icexelloss/miniconda3/envs/spark-dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py
>  in (.0)
> 187 arrow_type=arrow_type
> 188 )
> --> 189 for name, arrow_type in zip(df.columns, df_types)
> 190 ] + (
> 191 [
> /home/icexelloss/miniconda3/envs/spark-dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py
>  in get_column_metadata(column, name, arrow_type)
> 125 raise TypeError(
> 126 'Column name must be a string. Got column {} of type 
> {}'.format(
> --> 127 name, type(name).__name__
> 128 )
> 129 )
> TypeError: Column name must be a string. Got column 0 of type int64
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1291) [Python] pa.RecordBatch.from_pandas doesn't accept DataFrame with numeric column names

2017-07-28 Thread Li Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105716#comment-16105716
 ] 

Li Jin commented on ARROW-1291:
---

+1

> [Python] pa.RecordBatch.from_pandas doesn't accept DataFrame with numeric 
> column names
> --
>
> Key: ARROW-1291
> URL: https://issues.apache.org/jira/browse/ARROW-1291
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.5.0
>Reporter: Li Jin
>Priority: Minor
> Fix For: 0.6.0
>
>
> {code}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame([1])
> pa.RecordBatch.from_pandas(df)
> {code}
> Exception:
> {code}
> TypeError Traceback (most recent call last)
>  in ()
>   3 
>   4 df = pd.DataFrame([1])
> > 5 pa.RecordBatch.from_pandas(df)
> table.pxi in pyarrow.lib.RecordBatch.from_pandas()
> table.pxi in pyarrow.lib._dataframe_to_arrays()
> /home/icexelloss/miniconda3/envs/spark-dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py
>  in construct_metadata(df, index_levels, preserve_index, types)
> 187 arrow_type=arrow_type
> 188 )
> --> 189 for name, arrow_type in zip(df.columns, df_types)
> 190 ] + (
> 191 [
> /home/icexelloss/miniconda3/envs/spark-dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py
>  in (.0)
> 187 arrow_type=arrow_type
> 188 )
> --> 189 for name, arrow_type in zip(df.columns, df_types)
> 190 ] + (
> 191 [
> /home/icexelloss/miniconda3/envs/spark-dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py
>  in get_column_metadata(column, name, arrow_type)
> 125 raise TypeError(
> 126 'Column name must be a string. Got column {} of type 
> {}'.format(
> --> 127 name, type(name).__name__
> 128 )
> 129 )
> TypeError: Column name must be a string. Got column 0 of type int64
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1291) [Python] pa.RecordBatch.from_pandas doesn't accept DataFrame with numeric column names

2017-07-28 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105640#comment-16105640
 ] 

Wes McKinney commented on ARROW-1291:
-

How about we convert non-string column labels to strings for now and wait and 
see if it becomes a real need to preserve the original labels on the back? I 
think efforts beyond that may fall into the YAGNI category for the moment. 

> [Python] pa.RecordBatch.from_pandas doesn't accept DataFrame with numeric 
> column names
> --
>
> Key: ARROW-1291
> URL: https://issues.apache.org/jira/browse/ARROW-1291
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.5.0
>Reporter: Li Jin
>Priority: Minor
> Fix For: 0.6.0
>
>
> {code}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame([1])
> pa.RecordBatch.from_pandas(df)
> {code}
> Exception:
> {code}
> TypeError Traceback (most recent call last)
>  in ()
>   3 
>   4 df = pd.DataFrame([1])
> > 5 pa.RecordBatch.from_pandas(df)
> table.pxi in pyarrow.lib.RecordBatch.from_pandas()
> table.pxi in pyarrow.lib._dataframe_to_arrays()
> /home/icexelloss/miniconda3/envs/spark-dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py
>  in construct_metadata(df, index_levels, preserve_index, types)
> 187 arrow_type=arrow_type
> 188 )
> --> 189 for name, arrow_type in zip(df.columns, df_types)
> 190 ] + (
> 191 [
> /home/icexelloss/miniconda3/envs/spark-dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py
>  in (.0)
> 187 arrow_type=arrow_type
> 188 )
> --> 189 for name, arrow_type in zip(df.columns, df_types)
> 190 ] + (
> 191 [
> /home/icexelloss/miniconda3/envs/spark-dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py
>  in get_column_metadata(column, name, arrow_type)
> 125 raise TypeError(
> 126 'Column name must be a string. Got column {} of type 
> {}'.format(
> --> 127 name, type(name).__name__
> 128 )
> 129 )
> TypeError: Column name must be a string. Got column 0 of type int64
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1291) [Python] pa.RecordBatch.from_pandas doesn't accept DataFrame with numeric column names

2017-07-28 Thread Li Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105491#comment-16105491
 ] 

Li Jin commented on ARROW-1291:
---

The use case I have is that I am passing a user provided pandas dataframe to 
Spark using Arrow. In my particular case, I don't care about the name of the 
column in the pandas DataFrame because the column names are defined in the 
Spark's schema, so it's weird to ask for people to write out their column names 
in pandas and just to throw them away later...

I think it's more friendly behavior that to cast numeric columns to string than 
to throw this exception. My use case is a bit special that I don't care about 
the column names, so I could do the casting in my code. But I think other user 
might also find the current behavior surprising. 

I agree it's probably not worth it for arrow to preserve the numeric column 
names.

> [Python] pa.RecordBatch.from_pandas doesn't accept DataFrame with numeric 
> column names
> --
>
> Key: ARROW-1291
> URL: https://issues.apache.org/jira/browse/ARROW-1291
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.5.0
>Reporter: Li Jin
>Priority: Minor
>
> {code}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame([1])
> pa.RecordBatch.from_pandas(df)
> {code}
> Exception:
> {code}
> TypeError Traceback (most recent call last)
>  in ()
>   3 
>   4 df = pd.DataFrame([1])
> > 5 pa.RecordBatch.from_pandas(df)
> table.pxi in pyarrow.lib.RecordBatch.from_pandas()
> table.pxi in pyarrow.lib._dataframe_to_arrays()
> /home/icexelloss/miniconda3/envs/spark-dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py
>  in construct_metadata(df, index_levels, preserve_index, types)
> 187 arrow_type=arrow_type
> 188 )
> --> 189 for name, arrow_type in zip(df.columns, df_types)
> 190 ] + (
> 191 [
> /home/icexelloss/miniconda3/envs/spark-dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py
>  in (.0)
> 187 arrow_type=arrow_type
> 188 )
> --> 189 for name, arrow_type in zip(df.columns, df_types)
> 190 ] + (
> 191 [
> /home/icexelloss/miniconda3/envs/spark-dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py
>  in get_column_metadata(column, name, arrow_type)
> 125 raise TypeError(
> 126 'Column name must be a string. Got column {} of type 
> {}'.format(
> --> 127 name, type(name).__name__
> 128 )
> 129 )
> TypeError: Column name must be a string. Got column 0 of type int64
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1291) [Python] pa.RecordBatch.from_pandas doesn't accept DataFrame with numeric column names

2017-07-28 Thread Phillip Cloud (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105308#comment-16105308
 ] 

Phillip Cloud commented on ARROW-1291:
--

I'm -1 on allowing numeric column names since it adds an IMO unnecessary 
coupling to pandas semantics. With such a change, any tool that wants to read 
data out of an arrow array must now consider the origin of the data's column 
names, and cannot simply assume that the columns in the schema are always a 
simple list of strings. I don't think it's easy to make this behavior 
transparent to tools that use arrow, while OTOH a list of strings is easy to 
deal with in pretty much any system that arrow is a part of or will be a part 
of.

Since this is really only useful when doing pandas -> arrow -> pandas, and 
users of pandas can already refer to columns by positional index with {{.iloc}} 
I'm not convinced we should allow this.

I think adding metadata for indexes has less far-reaching effects because it's 
an optional feature of pandas that isn't a core part of arrow, while column 
names are non-negotiable.

I don't think it's too much to ask people to explicitly write out their column 
names as strings.

I *am* willing to be convinced though :)

> [Python] pa.RecordBatch.from_pandas doesn't accept DataFrame with numeric 
> column names
> --
>
> Key: ARROW-1291
> URL: https://issues.apache.org/jira/browse/ARROW-1291
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.5.0
>Reporter: Li Jin
>Priority: Minor
>
> {code}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame([1])
> pa.RecordBatch.from_pandas(df)
> {code}
> Exception:
> {code}
> TypeError Traceback (most recent call last)
>  in ()
>   3 
>   4 df = pd.DataFrame([1])
> > 5 pa.RecordBatch.from_pandas(df)
> table.pxi in pyarrow.lib.RecordBatch.from_pandas()
> table.pxi in pyarrow.lib._dataframe_to_arrays()
> /home/icexelloss/miniconda3/envs/spark-dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py
>  in construct_metadata(df, index_levels, preserve_index, types)
> 187 arrow_type=arrow_type
> 188 )
> --> 189 for name, arrow_type in zip(df.columns, df_types)
> 190 ] + (
> 191 [
> /home/icexelloss/miniconda3/envs/spark-dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py
>  in (.0)
> 187 arrow_type=arrow_type
> 188 )
> --> 189 for name, arrow_type in zip(df.columns, df_types)
> 190 ] + (
> 191 [
> /home/icexelloss/miniconda3/envs/spark-dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py
>  in get_column_metadata(column, name, arrow_type)
> 125 raise TypeError(
> 126 'Column name must be a string. Got column {} of type 
> {}'.format(
> --> 127 name, type(name).__name__
> 128 )
> 129 )
> TypeError: Column name must be a string. Got column 0 of type int64
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1291) [Python] pa.RecordBatch.from_pandas doesn't accept DataFrame with numeric column names

2017-07-28 Thread Li Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105288#comment-16105288
 ] 

Li Jin commented on ARROW-1291:
---

I think stringifying non-string columns is fine. Having metadata containing the 
original column labels sounds good but I feel it will likely to get lost 
somewhere because other systems, for instance, Spark SQL, does not support 
non-string column labels.


> [Python] pa.RecordBatch.from_pandas doesn't accept DataFrame with numeric 
> column names
> --
>
> Key: ARROW-1291
> URL: https://issues.apache.org/jira/browse/ARROW-1291
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.5.0
>Reporter: Li Jin
>Priority: Minor
>
> {code}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame([1])
> pa.RecordBatch.from_pandas(df)
> {code}
> Exception:
> {code}
> TypeError Traceback (most recent call last)
>  in ()
>   3 
>   4 df = pd.DataFrame([1])
> > 5 pa.RecordBatch.from_pandas(df)
> table.pxi in pyarrow.lib.RecordBatch.from_pandas()
> table.pxi in pyarrow.lib._dataframe_to_arrays()
> /home/icexelloss/miniconda3/envs/spark-dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py
>  in construct_metadata(df, index_levels, preserve_index, types)
> 187 arrow_type=arrow_type
> 188 )
> --> 189 for name, arrow_type in zip(df.columns, df_types)
> 190 ] + (
> 191 [
> /home/icexelloss/miniconda3/envs/spark-dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py
>  in (.0)
> 187 arrow_type=arrow_type
> 188 )
> --> 189 for name, arrow_type in zip(df.columns, df_types)
> 190 ] + (
> 191 [
> /home/icexelloss/miniconda3/envs/spark-dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py
>  in get_column_metadata(column, name, arrow_type)
> 125 raise TypeError(
> 126 'Column name must be a string. Got column {} of type 
> {}'.format(
> --> 127 name, type(name).__name__
> 128 )
> 129 )
> TypeError: Column name must be a string. Got column 0 of type int64
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1291) [Python] pa.RecordBatch.from_pandas doesn't accept DataFrame with numeric column names

2017-07-28 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105153#comment-16105153
 ] 

Wes McKinney commented on ARROW-1291:
-

This is a known limitation because Arrow schemas must have all string field 
names. We might consider a default casting behavior (like stringifying 
non-string columns), since it's better than failing. We can always choose to 
persist the original column labels (pickled, if necessary) in the schema 
metadata

cc [~cpcloud]

> [Python] pa.RecordBatch.from_pandas doesn't accept DataFrame with numeric 
> column names
> --
>
> Key: ARROW-1291
> URL: https://issues.apache.org/jira/browse/ARROW-1291
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.5.0
>Reporter: Li Jin
>Priority: Minor
>
> {code}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame([1])
> pa.RecordBatch.from_pandas(df)
> {code}
> Exception:
> {code}
> TypeError Traceback (most recent call last)
>  in ()
>   3 
>   4 df = pd.DataFrame([1])
> > 5 pa.RecordBatch.from_pandas(df)
> table.pxi in pyarrow.lib.RecordBatch.from_pandas()
> table.pxi in pyarrow.lib._dataframe_to_arrays()
> /home/icexelloss/miniconda3/envs/spark-dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py
>  in construct_metadata(df, index_levels, preserve_index, types)
> 187 arrow_type=arrow_type
> 188 )
> --> 189 for name, arrow_type in zip(df.columns, df_types)
> 190 ] + (
> 191 [
> /home/icexelloss/miniconda3/envs/spark-dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py
>  in (.0)
> 187 arrow_type=arrow_type
> 188 )
> --> 189 for name, arrow_type in zip(df.columns, df_types)
> 190 ] + (
> 191 [
> /home/icexelloss/miniconda3/envs/spark-dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py
>  in get_column_metadata(column, name, arrow_type)
> 125 raise TypeError(
> 126 'Column name must be a string. Got column {} of type 
> {}'.format(
> --> 127 name, type(name).__name__
> 128 )
> 129 )
> TypeError: Column name must be a string. Got column 0 of type int64
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)