[jira] [Created] (ARROW-16893) Add quoting style support for pyarrow.csv.WriteOptions
David Lee created ARROW-16893: - Summary: Add quoting style support for pyarrow.csv.WriteOptions Key: ARROW-16893 URL: https://issues.apache.org/jira/browse/ARROW-16893 Project: Apache Arrow Issue Type: Improvement Components: Python Affects Versions: 8.0.0 Reporter: David Lee https://issues.apache.org/jira/browse/ARROW-14905 The quoting style option was added for C++, but is not supported in Python. The pyarrow.csv writer module currently produces a CSV file where all strings are double quoted with no option to not wrap strings in double quotes. The C++ default for quoting style is "needed" "portfolioID","marketValue","notionalMarketValue","weight","notionalWeight" "ABCXYZ12345",26260.74,0.039716113109573174,26260.74,0.039716113109573174 -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16629) Apache Arrow Flight transport speed improvement for list structures
David Lee created ARROW-16629: - Summary: Apache Arrow Flight transport speed improvement for list structures Key: ARROW-16629 URL: https://issues.apache.org/jira/browse/ARROW-16629 Project: Apache Arrow Issue Type: Improvement Components: FlightRPC Affects Versions: 8.0.0 Reporter: David Lee I just started testing using Arrow Flight to send results from a GraphQL server with FlightServer() running on i. GraphQL defines a schema for your data output which can be mapped to an Arrow schema so I thought it would make sense to try using Arrow Flight to transport results instead of using REST style JSON records. Arrow Flight was 66% faster in all case, but it didn't scale as the number of child records increased. I suspect that serializing structs or lists needs some improvement.. Here is the discussion I opened including links to test scripts. [https://github.com/mirumee/ariadne/discussions/867] 10 records it was 0.049 seconds faster or 80% faster 1 records it was 0.109 seconds faster or 66% faster 10 million records it was 54 seconds faster or 66% faster. Also here is the data structure that is sent across the wire.. pyarrow.Table data: struct, int_list: list, length: int64, string_list: list, time_spent: double>> child 0, test_lists: struct, int_list: list, length: int64, string_list: list, time_spent: double> child 0, float_list: list child 0, item: double child 1, int_list: list child 0, item: int64 child 2, length: int64 child 3, string_list: list child 0, item: string child 4, time_spent: double data: [ -- is_valid: all not null -- child 0 type: struct, int_list: list, length: int64, string_list: list, time_spent: double> -- is_valid: all not null -- child 0 type: list [[13.500371672273381,17.747395152140353,28.973205439157457,1.361443415643098,19.029191125636135,14.62284718057391,18.44333922481529,7.906278860251386,14.402464768126993,5.826040531772251]] -- child 1 type: list [[23,3,21,15,20,4,10,16,23,25]] -- child 2 type: int64 [10] -- child 3 type: list [["qypsupwtxy","vrxptpspyt","qpvruwsuqq","ywwpyxrvrt","wswutpxxqv","tsyypstxvv","ytprpqsxsx","wtwsxvprvu","suwtrvqvwp","wtsrwywwty"]] -- child 4 type: double [0]] -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Comment Edited] (ARROW-1644) [C++][Parquet] Read and write nested Parquet data with a mix of struct and list nesting levels
[ https://issues.apache.org/jira/browse/ARROW-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16979362#comment-16979362 ] David Lee edited comment on ARROW-1644 at 11/21/19 3:38 PM: The format is valid. [http://jsonlines.org|http://jsonlines.org/] Line delimited json is a better format for data since you can leverage threads to speed up read operations. You also added a comma and bracket incorrectly which turned valid jsonl to invalid json. They should be outside the curly braces. {{[{"a": [1, 2], "b": {"c": true, "d": "1991-02-03", {{{"a": [3, 4, 5], "b": {"c": false, "d": "2019-04-01" ] was (Author: davlee1...@yahoo.com): The format is valid. [http://jsonlines.org|http://jsonlines.org/] Line delimited json is a better format for data since you can leverage threads to speed up read operations. You also added a comma and bracket incorrectly which turned valid jsonl to invalid json. They should be outside the curly braces. > [C++][Parquet] Read and write nested Parquet data with a mix of struct and > list nesting levels > -- > > Key: ARROW-1644 > URL: https://issues.apache.org/jira/browse/ARROW-1644 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Affects Versions: 0.8.0 >Reporter: DB Tsai >Assignee: Micah Kornfield >Priority: Major > Labels: parquet, pull-request-available > Fix For: 1.0.0 > > > We have many nested parquet files generated from Apache Spark for ranking > problems, and we would like to load them in python for other programs to > consume. > The schema looks like > {code:java} > root > |-- profile_id: long (nullable = true) > |-- country_iso_code: string (nullable = true) > |-- items: array (nullable = false) > ||-- element: struct (containsNull = false) > |||-- show_title_id: integer (nullable = true) > |||-- duration: double (nullable = true) > {code} > And when I tried to load it with nightly build pyarrow on Oct 4, 2017, I got > the following error. > {code:python} > Python 3.6.2 |Anaconda, Inc.| (default, Sep 30 2017, 18:42:57) > [GCC 7.2.0] on linux > Type "help", "copyright", "credits" or "license" for more information. > >>> import numpy as np > >>> import pandas as pd > >>> import pyarrow as pa > >>> import pyarrow.parquet as pq > >>> table2 = pq.read_table('part-0') > Traceback (most recent call last): > File "", line 1, in > File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", > line 823, in read_table > use_pandas_metadata=use_pandas_metadata) > File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", > line 119, in read > nthreads=nthreads) > File "_parquet.pyx", line 466, in pyarrow._parquet.ParquetReader.read_all > File "error.pxi", line 85, in pyarrow.lib.check_status > pyarrow.lib.ArrowNotImplementedError: lists with structs are not supported. > {code} > I somehow get the impression that after > https://issues.apache.org/jira/browse/PARQUET-911 is merged, we should be > able to load the nested parquet in pyarrow. > Any insight about this? > Thanks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-1644) [C++][Parquet] Read and write nested Parquet data with a mix of struct and list nesting levels
[ https://issues.apache.org/jira/browse/ARROW-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16979362#comment-16979362 ] David Lee edited comment on ARROW-1644 at 11/21/19 3:33 PM: The format is valid. [http://jsonlines.org|http://jsonlines.org/] Line delimited json is a better format for data since you can leverage threads to speed up read operations. You also added a comma and bracket incorrectly which turned valid jsonl to invalid json. They should be outside the curly braces. was (Author: davlee1...@yahoo.com): The format is valid. [http://jsonlines.org|http://jsonlines.org/] Line delimited json is a better format for data since you can leverage threads to speed up read operations. You also added a comma incorrectly above which turned valid jsonl to invalid json. > [C++][Parquet] Read and write nested Parquet data with a mix of struct and > list nesting levels > -- > > Key: ARROW-1644 > URL: https://issues.apache.org/jira/browse/ARROW-1644 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Affects Versions: 0.8.0 >Reporter: DB Tsai >Assignee: Micah Kornfield >Priority: Major > Labels: parquet, pull-request-available > Fix For: 1.0.0 > > > We have many nested parquet files generated from Apache Spark for ranking > problems, and we would like to load them in python for other programs to > consume. > The schema looks like > {code:java} > root > |-- profile_id: long (nullable = true) > |-- country_iso_code: string (nullable = true) > |-- items: array (nullable = false) > ||-- element: struct (containsNull = false) > |||-- show_title_id: integer (nullable = true) > |||-- duration: double (nullable = true) > {code} > And when I tried to load it with nightly build pyarrow on Oct 4, 2017, I got > the following error. > {code:python} > Python 3.6.2 |Anaconda, Inc.| (default, Sep 30 2017, 18:42:57) > [GCC 7.2.0] on linux > Type "help", "copyright", "credits" or "license" for more information. > >>> import numpy as np > >>> import pandas as pd > >>> import pyarrow as pa > >>> import pyarrow.parquet as pq > >>> table2 = pq.read_table('part-0') > Traceback (most recent call last): > File "", line 1, in > File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", > line 823, in read_table > use_pandas_metadata=use_pandas_metadata) > File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", > line 119, in read > nthreads=nthreads) > File "_parquet.pyx", line 466, in pyarrow._parquet.ParquetReader.read_all > File "error.pxi", line 85, in pyarrow.lib.check_status > pyarrow.lib.ArrowNotImplementedError: lists with structs are not supported. > {code} > I somehow get the impression that after > https://issues.apache.org/jira/browse/PARQUET-911 is merged, we should be > able to load the nested parquet in pyarrow. > Any insight about this? > Thanks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-1644) [C++][Parquet] Read and write nested Parquet data with a mix of struct and list nesting levels
[ https://issues.apache.org/jira/browse/ARROW-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16979362#comment-16979362 ] David Lee edited comment on ARROW-1644 at 11/21/19 3:30 PM: The format is valid. [http://jsonlines.org|http://jsonlines.org/] Line delimited json is a better format for data since you can leverage threads to speed up read operations. You also added a comma incorrectly above which turned valid jsonl to invalid json. was (Author: davlee1...@yahoo.com): The format is valid. [http://jsonlines.org|http://jsonlines.org/] Line delimited json is a better format for data since you can leverage threads to speed up read operations. > [C++][Parquet] Read and write nested Parquet data with a mix of struct and > list nesting levels > -- > > Key: ARROW-1644 > URL: https://issues.apache.org/jira/browse/ARROW-1644 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Affects Versions: 0.8.0 >Reporter: DB Tsai >Assignee: Micah Kornfield >Priority: Major > Labels: parquet, pull-request-available > Fix For: 1.0.0 > > > We have many nested parquet files generated from Apache Spark for ranking > problems, and we would like to load them in python for other programs to > consume. > The schema looks like > {code:java} > root > |-- profile_id: long (nullable = true) > |-- country_iso_code: string (nullable = true) > |-- items: array (nullable = false) > ||-- element: struct (containsNull = false) > |||-- show_title_id: integer (nullable = true) > |||-- duration: double (nullable = true) > {code} > And when I tried to load it with nightly build pyarrow on Oct 4, 2017, I got > the following error. > {code:python} > Python 3.6.2 |Anaconda, Inc.| (default, Sep 30 2017, 18:42:57) > [GCC 7.2.0] on linux > Type "help", "copyright", "credits" or "license" for more information. > >>> import numpy as np > >>> import pandas as pd > >>> import pyarrow as pa > >>> import pyarrow.parquet as pq > >>> table2 = pq.read_table('part-0') > Traceback (most recent call last): > File "", line 1, in > File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", > line 823, in read_table > use_pandas_metadata=use_pandas_metadata) > File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", > line 119, in read > nthreads=nthreads) > File "_parquet.pyx", line 466, in pyarrow._parquet.ParquetReader.read_all > File "error.pxi", line 85, in pyarrow.lib.check_status > pyarrow.lib.ArrowNotImplementedError: lists with structs are not supported. > {code} > I somehow get the impression that after > https://issues.apache.org/jira/browse/PARQUET-911 is merged, we should be > able to load the nested parquet in pyarrow. > Any insight about this? > Thanks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-1644) [C++][Parquet] Read and write nested Parquet data with a mix of struct and list nesting levels
[ https://issues.apache.org/jira/browse/ARROW-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16979362#comment-16979362 ] David Lee edited comment on ARROW-1644 at 11/21/19 3:27 PM: The format is valid. [http://jsonlines.org|http://jsonlines.org/] Line delimited json is a better format for data since you can leverage threads to speed up read operations. was (Author: davlee1...@yahoo.com): The format is valid. http://jsonlines.org Line delimited json is a better format for data since you can leverage threads to speed up read operation. > [C++][Parquet] Read and write nested Parquet data with a mix of struct and > list nesting levels > -- > > Key: ARROW-1644 > URL: https://issues.apache.org/jira/browse/ARROW-1644 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Affects Versions: 0.8.0 >Reporter: DB Tsai >Assignee: Micah Kornfield >Priority: Major > Labels: parquet, pull-request-available > Fix For: 1.0.0 > > > We have many nested parquet files generated from Apache Spark for ranking > problems, and we would like to load them in python for other programs to > consume. > The schema looks like > {code:java} > root > |-- profile_id: long (nullable = true) > |-- country_iso_code: string (nullable = true) > |-- items: array (nullable = false) > ||-- element: struct (containsNull = false) > |||-- show_title_id: integer (nullable = true) > |||-- duration: double (nullable = true) > {code} > And when I tried to load it with nightly build pyarrow on Oct 4, 2017, I got > the following error. > {code:python} > Python 3.6.2 |Anaconda, Inc.| (default, Sep 30 2017, 18:42:57) > [GCC 7.2.0] on linux > Type "help", "copyright", "credits" or "license" for more information. > >>> import numpy as np > >>> import pandas as pd > >>> import pyarrow as pa > >>> import pyarrow.parquet as pq > >>> table2 = pq.read_table('part-0') > Traceback (most recent call last): > File "", line 1, in > File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", > line 823, in read_table > use_pandas_metadata=use_pandas_metadata) > File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", > line 119, in read > nthreads=nthreads) > File "_parquet.pyx", line 466, in pyarrow._parquet.ParquetReader.read_all > File "error.pxi", line 85, in pyarrow.lib.check_status > pyarrow.lib.ArrowNotImplementedError: lists with structs are not supported. > {code} > I somehow get the impression that after > https://issues.apache.org/jira/browse/PARQUET-911 is merged, we should be > able to load the nested parquet in pyarrow. > Any insight about this? > Thanks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-1644) [C++][Parquet] Read and write nested Parquet data with a mix of struct and list nesting levels
[ https://issues.apache.org/jira/browse/ARROW-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16979362#comment-16979362 ] David Lee commented on ARROW-1644: -- The format is valid. http://jsonlines.org Line delimited json is a better format for data since you can leverage threads to speed up read operation. > [C++][Parquet] Read and write nested Parquet data with a mix of struct and > list nesting levels > -- > > Key: ARROW-1644 > URL: https://issues.apache.org/jira/browse/ARROW-1644 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Affects Versions: 0.8.0 >Reporter: DB Tsai >Assignee: Micah Kornfield >Priority: Major > Labels: parquet, pull-request-available > Fix For: 1.0.0 > > > We have many nested parquet files generated from Apache Spark for ranking > problems, and we would like to load them in python for other programs to > consume. > The schema looks like > {code:java} > root > |-- profile_id: long (nullable = true) > |-- country_iso_code: string (nullable = true) > |-- items: array (nullable = false) > ||-- element: struct (containsNull = false) > |||-- show_title_id: integer (nullable = true) > |||-- duration: double (nullable = true) > {code} > And when I tried to load it with nightly build pyarrow on Oct 4, 2017, I got > the following error. > {code:python} > Python 3.6.2 |Anaconda, Inc.| (default, Sep 30 2017, 18:42:57) > [GCC 7.2.0] on linux > Type "help", "copyright", "credits" or "license" for more information. > >>> import numpy as np > >>> import pandas as pd > >>> import pyarrow as pa > >>> import pyarrow.parquet as pq > >>> table2 = pq.read_table('part-0') > Traceback (most recent call last): > File "", line 1, in > File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", > line 823, in read_table > use_pandas_metadata=use_pandas_metadata) > File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", > line 119, in read > nthreads=nthreads) > File "_parquet.pyx", line 466, in pyarrow._parquet.ParquetReader.read_all > File "error.pxi", line 85, in pyarrow.lib.check_status > pyarrow.lib.ArrowNotImplementedError: lists with structs are not supported. > {code} > I somehow get the impression that after > https://issues.apache.org/jira/browse/PARQUET-911 is merged, we should be > able to load the nested parquet in pyarrow. > Any insight about this? > Thanks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6001) Add from_pydict(), from_pylist() and to_pylist() to pyarrow.Table + improve pandas.to_dict()
[ https://issues.apache.org/jira/browse/ARROW-6001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16891152#comment-16891152 ] David Lee commented on ARROW-6001: -- Table.from_dict in 0.14.1 looks fine. The code I originally reviewed iterated through the ordered dictionary keys instead of the schema field names. Here's some testing samples for to_pylist() and from_pylist() {code:java} test_schema = pa.schema([ pa.field('id', pa.int16()), pa.field('struct_test', pa.list_(pa.struct([pa.field("child_id", pa.int16()), pa.field("child_name", pa.string())]))), pa.field('list_test', pa.list_(pa.int16())) ]) test_data = [ {'id': 1, 'struct_test': [{'child_id': 11, 'child_name': '_11'}, {'child_id': 12, 'child_name': '_12'}], 'list_test': [1,2,3]}, {'id': 2, 'struct_test': [{'child_id': 21, 'child_name': '_21'}], 'list_test': [4,5]} ] test_tbl = from_pylist(test_data, schema = test_schema) test_list = to_pylist(test_tbl) test_tbl test_list {code} > Add from_pydict(), from_pylist() and to_pylist() to pyarrow.Table + improve > pandas.to_dict() > > > Key: ARROW-6001 > URL: https://issues.apache.org/jira/browse/ARROW-6001 > Project: Apache Arrow > Issue Type: Improvement >Reporter: David Lee >Priority: Minor > > I noticed that pyarrow.Table.to_pydict() exists, but > pyarrow.Table.from_pydict() doesn't exist. There is a proposed ticket to > create one, but it doesn't take into account potential mismatches between > column order and number of columns. > I'm including some code I've written which I've been using to handle arrow > conversions to ordered dictionaries and lists of dictionaries.. I've also > included an example where this can be used to speed up pandas.to_dict() by a > factor of 6x. > > {code:java} > def from_pylist(pylist, names=None, schema=None, safe=True): > """ > Converts a python list of dictionaries to a pyarrow table > :param pylist: pylist list of dictionaries > :param names: list of column names > :param schema: pyarrow schema > :param safe: True or False > :return: arrow table > """ > arrow_columns = list() > if schema: > for column in schema.names: > arrow_columns.append(pa.array([v[column] if column in v else None > for v in pylist], safe=safe, > type=schema.types[schema.get_field_index(column)])) > arrow_table = pa.Table.from_arrays(arrow_columns, schema.names) > else: > for column in names: > arrow_columns.append(pa.array([v[column] if column in v else None > for v in pylist], safe=safe)) > arrow_table = pa.Table.from_arrays(arrow_columns, names) > return arrow_table > def to_pylist(arrow_table, index_columns=None): > """ > Converts a pyarrow table to a python list of dictionaries > :param arrow_table: arrow table > :param index_columns: columns to index > :return: python list of dictionaries > """ > pydict = arrow_table.to_pydict() > if index_columns: > columns = arrow_table.schema.names > columns.append("_index") > pylist = [{column: tuple([pydict[index_column][row] for index_column > in index_columns]) if column == '_index' else pydict[column][row] for column > in columns} for row in range(arrow_table.num_rows)] > else: > pylist = [{column: pydict[column][row] for column in > arrow_table.schema.names} for row in range(arrow_table.num_rows)] > return pylist > def from_pydict(pydict, names=None, schema=None, safe=True): > """ > Converts a pyarrow table to a python ordered dictionary > :param pydict: ordered dictionary > :param names: list of column names > :param schema: pyarrow schema > :param safe: True or False > :return: arrow table > """ > arrow_columns = list() > dict_columns = list(pydict.keys()) > if schema: > for column in schema.names: > if column in pydict: > arrow_columns.append(pa.array(pydict[column], safe=safe, > type=schema.types[schema.get_field_index(column)])) > else: > arrow_columns.append(pa.array([None] * > len(pydict[dict_columns[0]]), safe=safe, > type=schema.types[schema.get_field_index(column)])) > arrow_table = pa.Table.from_arrays(arrow_columns, schema.names) > else: > if not names: > names = dict_columns > for column in names: > if column in dict_columns: > arrow_columns.append(pa.array(pydict[column], safe=safe)) > else: > arrow_columns.append(pa.array([None] * > len(pydict[dict_columns[0]]), safe=safe)) > arrow_table = pa.Table.from_arrays(arrow_columns, names) > return arrow_table > def get_ind
[jira] [Updated] (ARROW-6001) Add from_pydict(), from_pylist() and to_pylist() to pyarrow.Table + improve pandas.to_dict()
[ https://issues.apache.org/jira/browse/ARROW-6001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Lee updated ARROW-6001: - Description: I noticed that pyarrow.Table.to_pydict() exists, but pyarrow.Table.from_pydict() doesn't exist. There is a proposed ticket to create one, but it doesn't take into account potential mismatches between column order and number of columns. I'm including some code I've written which I've been using to handle arrow conversions to ordered dictionaries and lists of dictionaries.. I've also included an example where this can be used to speed up pandas.to_dict() by a factor of 6x. {code:java} def from_pylist(pylist, names=None, schema=None, safe=True): """ Converts a python list of dictionaries to a pyarrow table :param pylist: pylist list of dictionaries :param names: list of column names :param schema: pyarrow schema :param safe: True or False :return: arrow table """ arrow_columns = list() if schema: for column in schema.names: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist], safe=safe, type=schema.types[schema.get_field_index(column)])) arrow_table = pa.Table.from_arrays(arrow_columns, schema.names) else: for column in names: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist], safe=safe)) arrow_table = pa.Table.from_arrays(arrow_columns, names) return arrow_table def to_pylist(arrow_table, index_columns=None): """ Converts a pyarrow table to a python list of dictionaries :param arrow_table: arrow table :param index_columns: columns to index :return: python list of dictionaries """ pydict = arrow_table.to_pydict() if index_columns: columns = arrow_table.schema.names columns.append("_index") pylist = [{column: tuple([pydict[index_column][row] for index_column in index_columns]) if column == '_index' else pydict[column][row] for column in columns} for row in range(arrow_table.num_rows)] else: pylist = [{column: pydict[column][row] for column in arrow_table.schema.names} for row in range(arrow_table.num_rows)] return pylist def from_pydict(pydict, names=None, schema=None, safe=True): """ Converts a pyarrow table to a python ordered dictionary :param pydict: ordered dictionary :param names: list of column names :param schema: pyarrow schema :param safe: True or False :return: arrow table """ arrow_columns = list() dict_columns = list(pydict.keys()) if schema: for column in schema.names: if column in pydict: arrow_columns.append(pa.array(pydict[column], safe=safe, type=schema.types[schema.get_field_index(column)])) else: arrow_columns.append(pa.array([None] * len(pydict[dict_columns[0]]), safe=safe, type=schema.types[schema.get_field_index(column)])) arrow_table = pa.Table.from_arrays(arrow_columns, schema.names) else: if not names: names = dict_columns for column in names: if column in dict_columns: arrow_columns.append(pa.array(pydict[column], safe=safe)) else: arrow_columns.append(pa.array([None] * len(pydict[dict_columns[0]]), safe=safe)) arrow_table = pa.Table.from_arrays(arrow_columns, names) return arrow_table def get_indexed_values(arrow_table, index_columns): """ returns back a set of unique values for a list of columns. :param arrow_table: arrow_table :param index_columns: list of column names :return: set of tuples """ pydict = arrow_table.to_pydict() index_set = set([tuple([pydict[index_column][row] for index_column in index_columns]) for row in range(arrow_table.num_rows)]) return index_set {code} Here are my benchmarks using pandas to arrow to python vs of pandas.to_dict() {code:java} # benchmark panda conversion to python objects print('**benchmark 1 million rows**') start_time = time.time() python_df1 = panda_df1.to_dict(orient='records') total_time = time.time() - start_time print("pandas to python: " + str(total_time)) start_time = time.time() arrow_df1 = pa.Table.from_pandas(panda_df1) pydict = arrow_df1.to_pydict() python_df1 = [{column: pydict[column][row] for column in arrow_df1.schema.names} for row in range(arrow_df1.num_rows)] total_time = time.time() - start_time print("pandas to arrow to python: " + str(total_time)) print('**benchmark 4 million rows**') start_time = time.time() python_df4 = panda_df4.to_dict(orient='records') total_time = time.time() - start_time print("pandas to python:: " + str(total_time)) start_time = time.time() arrow_df4 = pa.Table.from_pandas(panda_df4) pydict = arrow_df4.to_pydict() python_df4 = [{column: pydict[column][row] for column in arrow_df4
[jira] [Updated] (ARROW-6001) Add from_pydict(), from_pylist() and to_pylist() to pyarrow.Table + improve pandas.to_dict()
[ https://issues.apache.org/jira/browse/ARROW-6001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Lee updated ARROW-6001: - Description: I noticed that pyarrow.Table.to_pydict() exists, but pyarrow.Table.from_pydict() doesn't exist. There is a proposed ticket to create one, but it doesn't take into account potential mismatches between column order and number of columns. I'm including some code I've written which I've been using to handle arrow conversions to ordered dictionaries and lists of dictionaries.. I've also included an example where this can be used to speed up pandas.to_dict() by a factor of 6x. {code:java} def from_pylist(pylist, names=None, schema=None, safe=True): """ Converts a python list of dictionaries to a pyarrow table :param pylist: pylist list of dictionaries :param names: list of column names :param schema: pyarrow schema :param safe: True or False :return: arrow table """ arrow_columns = list() if schema: for column in schema.names: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist], safe=safe, type=schema.types[schema.get_field_index(column)])) arrow_table = pa.Table.from_arrays(arrow_columns, schema.names) else: for column in names: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist], safe=safe)) arrow_table = pa.Table.from_arrays(arrow_columns, names) return arrow_table def to_pylist(arrow_table, index_columns=None): """ Converts a pyarrow table to a python list of dictionaries :param arrow_table: arrow table :param index_columns: columns to index :return: python list of dictionaries """ pydict = arrow_table.to_pydict() if index_columns: columns = arrow_table.schema.names columns.append("_index") pylist = [{column: tuple([pydict[index_column][row] for index_column in index_columns]) if column == '_index' else pydict[column][row] for column in columns} for row in range(arrow_table.num_rows)] else: pylist = [{column: pydict[column][row] for column in arrow_table.schema.names} for row in range(arrow_table.num_rows)] return pylist def from_pydict(pydict, names=None, schema=None, safe=True): """ Converts a pyarrow table to a python ordered dictionary :param pydict: ordered dictionary :param names: list of column names :param schema: pyarrow schema :param safe: True or False :return: arrow table """ arrow_columns = list() dict_columns = list(pydict.keys()) if schema: for column in schema.names: if column in pydict: arrow_columns.append(pa.array(pydict[column], safe=safe, type=schema.types[schema.get_field_index(column)])) else: arrow_columns.append(pa.array([None] * len(pydict[dict_columns[0]]), safe=safe, type=schema.types[schema.get_field_index(column)])) arrow_table = pa.Table.from_arrays(arrow_columns, schema.names) else: if not names: names = dict_columns for column in names: if column in dict_columns: arrow_columns.append(pa.array(pydict[column], safe=safe)) else: arrow_columns.append(pa.array([None] * len(pydict[dict_columns[0]]), safe=safe)) arrow_table = pa.Table.from_arrays(arrow_columns, names) return arrow_table def get_indexed_values(arrow_table, index_columns): """ returns back a set of unique values for a list of columns. :param arrow_table: arrow_table :param index_columns: list of column names :return: set of tuples """ pydict = arrow_table.to_pydict() index_set = set([tuple([pydict[index_column][row] for index_column in index_columns]) for row in range(arrow_table.num_rows)]) return index_set {code} Here are my benchmarks using pandas to arrow to python vs of pandas.to_dict() {code:java} # benchmark panda conversion to python objects. start_time = time.time() python_df1 = panda_df1.to_dict(orient='records') total_time = time.time() - start_time print("pandas to python - 1 million rows - " + str(total_time)) start_time = time.time() python_df4 = panda_df4.to_dict(orient='records') total_time = time.time() - start_time print("pandas to arrow to python - 1 million rows - " + str(total_time)) start_time = time.time() arrow_df1 = pa.Table.from_pandas(panda_df1) pydict = arrow_df1.to_pydict() python_df1 = [{column: pydict[column][row] for column in arrow_df1.schema.names} for row in range(arrow_df1.num_rows)] total_time = time.time() - start_time print("pandas to python - 4 million rows - " + str(total_time)) start_time = time.time() arrow_df4 = pa.Table.from_pandas(panda_df4) pydict = arrow_df4.to_pydict() python_df4 = [{column: pydict[column][row] for column in arrow_df4.schema.names} for row
[jira] [Updated] (ARROW-6001) Add from_pydict(), from_pylist() and to_pylist() to pyarrow.Table + improve pandas.to_dict()
[ https://issues.apache.org/jira/browse/ARROW-6001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Lee updated ARROW-6001: - Description: I noticed that pyarrow.Table.to_pydict() exists, but pyarrow.Table.from_pydict() doesn't exist. There is a proposed ticket to create one, but it doesn't take into account potential mismatches between column order and number of columns. I'm including some code I've written which I've been using to handle arrow conversions to ordered dictionaries and lists of dictionaries.. I've also included an example where this can be used to speed up pandas.to_dict() by a factor of 6x. {code:java} def from_pylist(pylist, names=None, schema=None, safe=True): """ Converts a python list of dictionaries to a pyarrow table :param pylist: pylist list of dictionaries :param names: list of column names :param schema: pyarrow schema :param safe: True or False :return: arrow table """ arrow_columns = list() if schema: for column in schema.names: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist], safe=safe, type=schema.types[schema.get_field_index(column)])) arrow_table = pa.Table.from_arrays(arrow_columns, schema.names) else: for column in names: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist], safe=safe)) arrow_table = pa.Table.from_arrays(arrow_columns, names) return arrow_table def to_pylist(arrow_table, index_columns=None): """ Converts a pyarrow table to a python list of dictionaries :param arrow_table: arrow table :param index_columns: columns to index :return: python list of dictionaries """ pydict = arrow_table.to_pydict() if index_columns: columns = arrow_table.schema.names columns.append("_index") pylist = [{column: tuple([pydict[index_column][row] for index_column in index_columns]) if column == '_index' else pydict[column][row] for column in columns} for row in range(arrow_table.num_rows)] else: pylist = [{column: pydict[column][row] for column in arrow_table.schema.names} for row in range(arrow_table.num_rows)] return pylist def from_pydict(pydict, names=None, schema=None, safe=True): """ Converts a pyarrow table to a python ordered dictionary :param pydict: ordered dictionary :param names: list of column names :param schema: pyarrow schema :param safe: True or False :return: arrow table """ arrow_columns = list() dict_columns = list(pydict.keys()) if schema: for column in schema.names: if column in pydict: arrow_columns.append(pa.array(pydict[column], safe=safe, type=schema.types[schema.get_field_index(column)])) else: arrow_columns.append(pa.array([None] * len(pydict[dict_columns[0]]), safe=safe, type=schema.types[schema.get_field_index(column)])) arrow_table = pa.Table.from_arrays(arrow_columns, schema.names) else: if not names: names = dict_columns for column in names: if column in dict_columns: arrow_columns.append(pa.array(pydict[column], safe=safe)) else: arrow_columns.append(pa.array([None] * len(pydict[dict_columns[0]]), safe=safe)) arrow_table = pa.Table.from_arrays(arrow_columns, names) return arrow_table def get_indexed_values(arrow_table, index_columns): """ returns back a set of unique values for a list of columns. :param arrow_table: arrow_table :param index_columns: list of column names :return: set of tuples """ pydict = arrow_table.to_pydict() index_set = set([tuple([pydict[index_column][row] for index_column in index_columns]) for row in range(arrow_table.num_rows)]) return index_set {code} Here are my benchmarks using pandas to arrow to python vs of pandas.to_dict() {code:java} # benchmark panda conversion to python objects. start_time = time.time() python_df1 = panda_df1.to_dict(orient='records') total_time = time.time() - start_time print("pandas to python - 1 million rows - " + str(total_time)) start_time = time.time() python_df4 = panda_df4.to_dict(orient='records') total_time = time.time() - start_time print("pandas to arrow to python - 1 million rows - " + str(total_time)) start_time = time.time() arrow_df1 = pa.Table.from_pandas(panda_df1) pydict = arrow_df1.to_pydict() python_df1 = [{column: pydict[column][row] for column in arrow_df1.schema.names} for row in range(arrow_df1.num_rows)] total_time = time.time() - start_time print("pandas to python - 4 million rows - " + str(total_time)) start_time = time.time() arrow_df4 = pa.Table.from_pandas(panda_df4) pydict = arrow_df4.to_pydict() python_df4 = [{column: pydict[column][row] for column in arrow_df4.schema.names} for row
[jira] [Updated] (ARROW-6001) Add from_pydict(), from_pylist() and to_pylist() to pyarrow.Table + improve pandas.to_dict()
[ https://issues.apache.org/jira/browse/ARROW-6001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Lee updated ARROW-6001: - Description: I noticed that pyarrow.Table.to_pydict() exists, but pyarrow.Table.from_pydict() doesn't exist. There is a proposed ticket to create one, but it doesn't take into account potential mismatches between column order and number of columns. I'm including some code I've written which I've been using to handle arrow conversions to ordered dictionaries and lists of dictionaries.. I've also included an example where this can be used to speed up pandas.to_dict() by a factor of 6x. {code:java} def from_pylist(pylist, names=None, schema=None, safe=True): """ Converts a python list of dictionaries to a pyarrow table :param pylist: pylist list of dictionaries :param names: list of column names :param schema: pyarrow schema :param safe: True or False :return: arrow table """ arrow_columns = list() if schema: for column in schema.names: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist], safe=safe, type=schema.types[schema.get_field_index(column)])) arrow_table = pa.Table.from_arrays(arrow_columns, schema.names) else: for column in names: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist], safe=safe)) arrow_table = pa.Table.from_arrays(arrow_columns, names) return arrow_table def to_pylist(arrow_table, index_columns=None): """ Converts a pyarrow table to a python list of dictionaries :param arrow_table: arrow table :param index_columns: columns to index :return: python list of dictionaries """ pydict = arrow_table.to_pydict() if index_columns: columns = arrow_table.schema.names columns.append("_index") pylist = [{column: tuple([pydict[index_column][row] for index_column in index_columns]) if column == '_index' else pydict[column][row] for column in columns} for row in range(arrow_table.num_rows)] else: pylist = [{column: pydict[column][row] for column in arrow_table.schema.names} for row in range(arrow_table.num_rows)] return pylist def from_pydict(pydict, names=None, schema=None, safe=True): """ Converts a pyarrow table to a python ordered dictionary :param pydict: ordered dictionary :param names: list of column names :param schema: pyarrow schema :param safe: True or False :return: arrow table """ arrow_columns = list() dict_columns = list(pydict.keys()) if schema: for column in schema.names: if column in pydict: arrow_columns.append(pa.array(pydict[column], safe=safe, type=schema.types[schema.get_field_index(column)])) else: arrow_columns.append(pa.array([None] * len(pydict[dict_columns[0]]), safe=safe, type=schema.types[schema.get_field_index(column)])) arrow_table = pa.Table.from_arrays(arrow_columns, schema.names) else: if not names: names = dict_columns for column in names: if column in dict_columns: arrow_columns.append(pa.array(pydict[column], safe=safe)) else: arrow_columns.append(pa.array([None] * len(pydict[dict_columns[0]]), safe=safe)) arrow_table = pa.Table.from_arrays(arrow_columns, names) return arrow_table def get_indexed_values(arrow_table, index_columns): """ returns back a set of unique values for a list of columns. :param arrow_table: arrow_table :param index_columns: list of column names :return: set of tuples """ pydict = arrow_table.to_pydict() index_set = set([tuple([pydict[index_column][row] for index_column in index_columns]) for row in range(arrow_table.num_rows)]) return index_set {code} Here are my benchmarks using pandas to arrow to python vs of pandas.to_dict() {code:java} # benchmark panda conversion to python objects. start_time = time.time() python_df1 = panda_df1.to_dict(orient='records') total_time = time.time() - start_time print("pandas to python - 1 million rows - " + str(total_time)) start_time = time.time() python_df4 = panda_df4.to_dict(orient='records') total_time = time.time() - start_time print("pandas to python - 4 million rows - " + str(total_time)) start_time = time.time() arrow_df1 = pa.Table.from_pandas(panda_df1) pydict = arrow_df1.to_pydict() python_df1 = [{column: pydict[column][row] for column in arrow_df1.schema.names} for row in range(arrow_df1.num_rows)] total_time = time.time() - start_time print("pandas to arrow to python - 1 million rows - " + str(total_time)) start_time = time.time() arrow_df4 = pa.Table.from_pandas(panda_df4) pydict = arrow_df4.to_pydict() python_df4 = [{column: pydict[column][row] for column in arrow_df4.schema.names} for row
[jira] [Updated] (ARROW-6001) Add from_pydict(), from_pylist() and to_pylist() to pyarrow.Table + improve pandas.to_dict()
[ https://issues.apache.org/jira/browse/ARROW-6001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Lee updated ARROW-6001: - Description: I noticed that pyarrow.Table.to_pydict() exists, but there is no pyarrow.Table.from_pydict() doesn't exist. There is a proposed ticket to create one, but it doesn't take into account potential mismatches between column order and number of columns. I'm including some code I've written which I've been using to handle arrow conversions to ordered dictionaries and lists of dictionaries.. I've also included an example where this can be used to speed up pandas.to_dict() by a factor of 6x. {code:java} def from_pylist(pylist, names=None, schema=None, safe=True): """ Converts a python list of dictionaries to a pyarrow table :param pylist: pylist list of dictionaries :param names: list of column names :param schema: pyarrow schema :param safe: True or False :return: arrow table """ arrow_columns = list() if schema: for column in schema.names: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist], safe=safe, type=schema.types[schema.get_field_index(column)])) arrow_table = pa.Table.from_arrays(arrow_columns, schema.names) else: for column in names: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist], safe=safe)) arrow_table = pa.Table.from_arrays(arrow_columns, names) return arrow_table def to_pylist(arrow_table, index_columns=None): """ Converts a pyarrow table to a python list of dictionaries :param arrow_table: arrow table :param index_columns: columns to index :return: python list of dictionaries """ pydict = arrow_table.to_pydict() if index_columns: columns = arrow_table.schema.names columns.append("_index") pylist = [{column: tuple([pydict[index_column][row] for index_column in index_columns]) if column == '_index' else pydict[column][row] for column in columns} for row in range(arrow_table.num_rows)] else: pylist = [{column: pydict[column][row] for column in arrow_table.schema.names} for row in range(arrow_table.num_rows)] return pylist def from_pydict(pydict, names=None, schema=None, safe=True): """ Converts a pyarrow table to a python ordered dictionary :param pydict: ordered dictionary :param names: list of column names :param schema: pyarrow schema :param safe: True or False :return: arrow table """ arrow_columns = list() dict_columns = list(pydict.keys()) if schema: for column in schema.names: if column in pydict: arrow_columns.append(pa.array(pydict[column], safe=safe, type=schema.types[schema.get_field_index(column)])) else: arrow_columns.append(pa.array([None] * len(pydict[dict_columns[0]]), safe=safe, type=schema.types[schema.get_field_index(column)])) arrow_table = pa.Table.from_arrays(arrow_columns, schema.names) else: if not names: names = dict_columns for column in names: if column in dict_columns: arrow_columns.append(pa.array(pydict[column], safe=safe)) else: arrow_columns.append(pa.array([None] * len(pydict[dict_columns[0]]), safe=safe)) arrow_table = pa.Table.from_arrays(arrow_columns, names) return arrow_table def get_indexed_values(arrow_table, index_columns): """ returns back a set of unique values for a list of columns. :param arrow_table: arrow_table :param index_columns: list of column names :return: set of tuples """ pydict = arrow_table.to_pydict() index_set = set([tuple([pydict[index_column][row] for index_column in index_columns]) for row in range(arrow_table.num_rows)]) return index_set {code} Here are my benchmarks using pandas to arrow to python vs of pandas.to_dict() {code:java} # benchmark panda conversion to python objects. start_time = time.time() python_df1 = panda_df1.to_dict(orient='records') total_time = time.time() - start_time print("pandas to python - 1 million rows - " + str(total_time)) start_time = time.time() python_df4 = panda_df4.to_dict(orient='records') total_time = time.time() - start_time print("pandas to python - 4 million rows - " + str(total_time)) start_time = time.time() arrow_df1 = pa.Table.from_pandas(panda_df1) pydict = arrow_df1.to_pydict() python_df1 = [{column: pydict[column][row] for column in arrow_df1.schema.names} for row in range(arrow_df1.num_rows)] total_time = time.time() - start_time print("pandas to arrow to python - 1 million rows - " + str(total_time)) start_time = time.time() arrow_df4 = pa.Table.from_pandas(panda_df4) pydict = arrow_df4.to_pydict() python_df4 = [{column: pydict[column][row] for column in arrow_df4.schema.na
[jira] [Commented] (ARROW-6001) Add from_pydict(), from_pylist() and to_pylist() to pyarrow.Table + improve pandas.to_dict()
[ https://issues.apache.org/jira/browse/ARROW-6001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16890345#comment-16890345 ] David Lee commented on ARROW-6001: -- Current implementation > Add from_pydict(), from_pylist() and to_pylist() to pyarrow.Table + improve > pandas.to_dict() > > > Key: ARROW-6001 > URL: https://issues.apache.org/jira/browse/ARROW-6001 > Project: Apache Arrow > Issue Type: Improvement >Reporter: David Lee >Priority: Minor > > I noticed that pyarrow.Table.to_pydict() exists, but there is no > pyarrow.Table.from_pydict() doesn't exist. There is a proposed ticket to > create one, but it doesn't take into account potential mismatches between > column order and number of columns. > I'm attached some code I've written which I've been using to handle arrow to > ordered dictionaries and arrow to lists of dictionaries.. I've also included > an example where this can be used to speed up pandas.to_dict() by a factor of > 6x. > > {code:java} > def from_pylist(pylist, names=None, schema=None, safe=True): > """ > Converts a python list of dictionaries to a pyarrow table > :param pylist: pylist list of dictionaries > :param names: list of column names > :param schema: pyarrow schema > :param safe: True or False > :return: arrow table > """ > arrow_columns = list() > if schema: > for column in schema.names: > arrow_columns.append(pa.array([v[column] if column in v else None > for v in pylist], safe=safe, > type=schema.types[schema.get_field_index(column)])) > arrow_table = pa.Table.from_arrays(arrow_columns, schema.names) > else: > for column in names: > arrow_columns.append(pa.array([v[column] if column in v else None > for v in pylist], safe=safe)) > arrow_table = pa.Table.from_arrays(arrow_columns, names) > return arrow_table > def to_pylist(arrow_table, index_columns=None): > """ > Converts a pyarrow table to a python list of dictionaries > :param arrow_table: arrow table > :param index_columns: columns to index > :return: python list of dictionaries > """ > pydict = arrow_table.to_pydict() > if index_columns: > columns = arrow_table.schema.names > columns.append("_index") > pylist = [{column: tuple([pydict[index_column][row] for index_column > in index_columns]) if column == '_index' else pydict[column][row] for column > in columns} for row in range(arrow_table.num_rows)] > else: > pylist = [{column: pydict[column][row] for column in > arrow_table.schema.names} for row in range(arrow_table.num_rows)] > return pylist > def from_pydict(pydict, names=None, schema=None, safe=True): > """ > Converts a pyarrow table to a python ordered dictionary > :param pydict: ordered dictionary > :param names: list of column names > :param schema: pyarrow schema > :param safe: True or False > :return: arrow table > """ > arrow_columns = list() > dict_columns = list(pydict.keys()) > if schema: > for column in schema.names: > if column in pydict: > arrow_columns.append(pa.array(pydict[column], safe=safe, > type=schema.types[schema.get_field_index(column)])) > else: > arrow_columns.append(pa.array([None] * > len(pydict[dict_columns[0]]), safe=safe, > type=schema.types[schema.get_field_index(column)])) > arrow_table = pa.Table.from_arrays(arrow_columns, schema.names) > else: > if not names: > names = dict_columns > for column in names: > if column in dict_columns: > arrow_columns.append(pa.array(pydict[column], safe=safe)) > else: > arrow_columns.append(pa.array([None] * > len(pydict[dict_columns[0]]), safe=safe)) > arrow_table = pa.Table.from_arrays(arrow_columns, names) > return arrow_table > def get_indexed_values(arrow_table, index_columns): > """ > returns back a set of unique values for a list of columns. > :param arrow_table: arrow_table > :param index_columns: list of column names > :return: set of tuples > """ > pydict = arrow_table.to_pydict() > index_set = set([tuple([pydict[index_column][row] for index_column in > index_columns]) for row in range(arrow_table.num_rows)]) > return index_set > {code} > Here are my benchmarks using pandas to arrow to python vs of pandas.to_dict() > > {code:java} > # benchmark panda conversion to python objects. > start_time = time.time() > python_df1 = panda_df1.to_dict(orient='records') > total_time = time.time() - start_time > print("pandas to python - 1 million rows - " + str(total
[jira] [Updated] (ARROW-6001) Add from_pydict(), from_pylist() and to_pylist() to pyarrow.Table + improve pandas.to_dict()
[ https://issues.apache.org/jira/browse/ARROW-6001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Lee updated ARROW-6001: - Description: I noticed that pyarrow.Table.to_pydict() exists, but there is no pyarrow.Table.from_pydict() doesn't exist. There is a proposed ticket to create one, but it doesn't take into account potential mismatches between column order and number of columns. I'm attached some code I've written which I've been using to handle arrow to ordered dictionaries and arrow to lists of dictionaries.. I've also included an example where this can be used to speed up pandas.to_dict() by a factor of 6x. {code:java} def from_pylist(pylist, names=None, schema=None, safe=True): """ Converts a python list of dictionaries to a pyarrow table :param pylist: pylist list of dictionaries :param names: list of column names :param schema: pyarrow schema :param safe: True or False :return: arrow table """ arrow_columns = list() if schema: for column in schema.names: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist], safe=safe, type=schema.types[schema.get_field_index(column)])) arrow_table = pa.Table.from_arrays(arrow_columns, schema.names) else: for column in names: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist], safe=safe)) arrow_table = pa.Table.from_arrays(arrow_columns, names) return arrow_table def to_pylist(arrow_table, index_columns=None): """ Converts a pyarrow table to a python list of dictionaries :param arrow_table: arrow table :param index_columns: columns to index :return: python list of dictionaries """ pydict = arrow_table.to_pydict() if index_columns: columns = arrow_table.schema.names columns.append("_index") pylist = [{column: tuple([pydict[index_column][row] for index_column in index_columns]) if column == '_index' else pydict[column][row] for column in columns} for row in range(arrow_table.num_rows)] else: pylist = [{column: pydict[column][row] for column in arrow_table.schema.names} for row in range(arrow_table.num_rows)] return pylist def from_pydict(pydict, names=None, schema=None, safe=True): """ Converts a pyarrow table to a python ordered dictionary :param pydict: ordered dictionary :param names: list of column names :param schema: pyarrow schema :param safe: True or False :return: arrow table """ arrow_columns = list() dict_columns = list(pydict.keys()) if schema: for column in schema.names: if column in pydict: arrow_columns.append(pa.array(pydict[column], safe=safe, type=schema.types[schema.get_field_index(column)])) else: arrow_columns.append(pa.array([None] * len(pydict[dict_columns[0]]), safe=safe, type=schema.types[schema.get_field_index(column)])) arrow_table = pa.Table.from_arrays(arrow_columns, schema.names) else: if not names: names = dict_columns for column in names: if column in dict_columns: arrow_columns.append(pa.array(pydict[column], safe=safe)) else: arrow_columns.append(pa.array([None] * len(pydict[dict_columns[0]]), safe=safe)) arrow_table = pa.Table.from_arrays(arrow_columns, names) return arrow_table def get_indexed_values(arrow_table, index_columns): """ returns back a set of unique values for a list of columns. :param arrow_table: arrow_table :param index_columns: list of column names :return: set of tuples """ pydict = arrow_table.to_pydict() index_set = set([tuple([pydict[index_column][row] for index_column in index_columns]) for row in range(arrow_table.num_rows)]) return index_set {code} Here are my benchmarks using pandas to arrow to python vs of pandas.to_dict() {code:java} # benchmark panda conversion to python objects. start_time = time.time() python_df1 = panda_df1.to_dict(orient='records') total_time = time.time() - start_time print("pandas to python - 1 million rows - " + str(total_time)) start_time = time.time() python_df4 = panda_df4.to_dict(orient='records') total_time = time.time() - start_time print("pandas to python - 4 million rows - " + str(total_time)) start_time = time.time() arrow_df1 = pa.Table.from_pandas(panda_df1) pydict = arrow_df1.to_pydict() python_df1 = [{column: pydict[column][row] for column in arrow_df1.schema.names} for row in range(arrow_df1.num_rows)] total_time = time.time() - start_time print("pandas to arrow to python - 1 million rows - " + str(total_time)) start_time = time.time() arrow_df4 = pa.Table.from_pandas(panda_df4) pydict = arrow_df4.to_pydict() python_df4 = [{column: pydict[column][row] for column in arrow_df4.schema.names}
[jira] [Created] (ARROW-6001) Add from_pydict(), from_pylist() and to_pylist() to pyarrow.Table + improve pandas.to_dict()
David Lee created ARROW-6001: Summary: Add from_pydict(), from_pylist() and to_pylist() to pyarrow.Table + improve pandas.to_dict() Key: ARROW-6001 URL: https://issues.apache.org/jira/browse/ARROW-6001 Project: Apache Arrow Issue Type: Improvement Reporter: David Lee I noticed that pyarrow.Table.to_pydict() exists, but there is no pyarrow.Table.from_pydict() doesn't exist. There is a proposed ticket to create one, but it doesn't take into account potential mismatches between column order and number of columns. I'm attached some code I've written which I've been using to handle arrow to ordered dictionaries and arrow to lists of dictionaries.. I've also included an example where this can be used to speed up pandas.to_dict() by a factor of 20x. {code:java} def from_pylist(pylist, names=None, schema=None, safe=True): """ Converts a python list of dictionaries to a pyarrow table :param pylist: pylist list of dictionaries :param names: list of column names :param schema: pyarrow schema :param safe: True or False :return: arrow table """ arrow_columns = list() if schema: for column in schema.names: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist], safe=safe, type=schema.types[schema.get_field_index(column)])) arrow_table = pa.Table.from_arrays(arrow_columns, schema.names) else: for column in names: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist], safe=safe)) arrow_table = pa.Table.from_arrays(arrow_columns, names) return arrow_table def to_pylist(arrow_table, index_columns=None): """ Converts a pyarrow table to a python list of dictionaries :param arrow_table: arrow table :param index_columns: columns to index :return: python list of dictionaries """ pydict = arrow_table.to_pydict() if index_columns: columns = arrow_table.schema.names columns.append("_index") pylist = [{column: tuple([pydict[index_column][row] for index_column in index_columns]) if column == '_index' else pydict[column][row] for column in columns} for row in range(arrow_table.num_rows)] else: pylist = [{column: pydict[column][row] for column in arrow_table.schema.names} for row in range(arrow_table.num_rows)] return pylist def from_pydict(pydict, names=None, schema=None, safe=True): """ Converts a pyarrow table to a python ordered dictionary :param pydict: ordered dictionary :param names: list of column names :param schema: pyarrow schema :param safe: True or False :return: arrow table """ arrow_columns = list() dict_columns = list(pydict.keys()) if schema: for column in schema.names: if column in pydict: arrow_columns.append(pa.array(pydict[column], safe=safe, type=schema.types[schema.get_field_index(column)])) else: arrow_columns.append(pa.array([None] * len(pydict[dict_columns[0]]), safe=safe, type=schema.types[schema.get_field_index(column)])) arrow_table = pa.Table.from_arrays(arrow_columns, schema.names) else: if not names: names = dict_columns for column in names: if column in dict_columns: arrow_columns.append(pa.array(pydict[column], safe=safe)) else: arrow_columns.append(pa.array([None] * len(pydict[dict_columns[0]]), safe=safe)) arrow_table = pa.Table.from_arrays(arrow_columns, names) return arrow_table def get_indexed_values(arrow_table, index_columns): """ returns back a set of unique values for a list of columns. :param arrow_table: arrow_table :param index_columns: list of column names :return: set of tuples """ pydict = arrow_table.to_pydict() index_set = set([tuple([pydict[index_column][row] for index_column in index_columns]) for row in range(arrow_table.num_rows)]) return index_set {code} Here are my benchmarks using pandas to arrow to python vs of pandas.to_dict() {code:java} # benchmark panda conversion to python objects. start_time = time.time() python_df1 = panda_df1.to_dict(orient='records') total_time = time.time() - start_time print("pandas to python - 1 million rows - " + str(total_time)) start_time = time.time() python_df4 = panda_df4.to_dict(orient='records') total_time = time.time() - start_time print("pandas to python - 4 million rows - " + str(total_time)) start_time = time.time() arrow_df1 = pa.Table.from_pandas(panda_df1) pydict = arrow_df1.to_pydict() python_df1 = [{column: pydict[column][row] for column in arrow_df1.schema.names} for row in range(arrow_df1.num_rows)] total_time = time.time() - start_time print("pandas to arrow to python - 1 million rows - " + str(total_time)) star
[jira] [Commented] (ARROW-4814) [Python] Exception when writing nested columns that are tuples to parquet
[ https://issues.apache.org/jira/browse/ARROW-4814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16789782#comment-16789782 ] David Lee commented on ARROW-4814: -- Same issue as https://issues.apache.org/jira/browse/ARROW-1644. I also don't think you can just write a tuple to parquet without defined names for each tuple element. A tuple doesn't really convert into the JSON schema model. [https://github.com/apache/parquet-format/blob/master/LogicalTypes.md] > [Python] Exception when writing nested columns that are tuples to parquet > - > > Key: ARROW-4814 > URL: https://issues.apache.org/jira/browse/ARROW-4814 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.12.1 > Environment: 4.20.8-100.fc28.x86_64 >Reporter: Suvayu Ali >Priority: Major > Labels: pandas, parquet > Attachments: df_to_parquet_fail.py, test.csv > > > I get an exception when I try to write a {{pandas.DataFrame}} to a parquet > file where one of the columns has tuples in them. I use tuples here because > it allows for easier querying in pandas (see ARROW-3806 for a more detailed > description). > {code} > Traceback (most recent call last): > File "df_to_parquet_fail.py", line 5, in > df.to_parquet("test.parquet") # crashes > File "/home/user/.local/lib/python3.6/site-packages/pandas/core/frame.py", > line 2203, in to_parquet > > partition_cols=partition_cols, **kwargs) > File "/home/user/.local/lib/python3.6/site-packages/pandas/io/parquet.py", > line 252, in to_parquet > > partition_cols=partition_cols, **kwargs) > File "/home/user/.local/lib/python3.6/site-packages/pandas/io/parquet.py", > line 113, in write > > table = self.api.Table.from_pandas(df, **from_pandas_kwargs) > File "pyarrow/table.pxi", line 1141, in pyarrow.lib.Table.from_pandas > File > "/home/user/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", > line 431, in dataframe_to_arrays > > convert_types)] > File > "/home/user/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", > line 430, in > > for c, t in zip(columns_to_convert, > File > "/home/user/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", > line 426, in convert_column > > raise e > File > "/home/user/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", > line 420, in convert_column > > return pa.array(col, type=ty, from_pandas=True, safe=safe) > File "pyarrow/array.pxi", line 176, in pyarrow.lib.array > File "pyarrow/array.pxi", line 85, in pyarrow.lib._ndarray_to_array > File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status > pyarrow.lib.ArrowInvalid: ("Could not convert ('G',) with type tuple: did not > recognize Python value type when inferring an Arrow data type", 'Conversion > failed for column ALTS with type object') > {code} > The issue maybe replicated with the attached script and csv file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (ARROW-1644) [Python] Read and write nested Parquet data with a mix of struct and list nesting levels
[ https://issues.apache.org/jira/browse/ARROW-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16784864#comment-16784864 ] David Lee edited comment on ARROW-1644 at 3/6/19 1:21 AM: -- I've been able to write parquet columns which are lists, but I haven't been able to write a column which is a list of struct(s) This works: {code:java} schema = pa.schema([ pa.field('test_id', pa.string()), pa.field('a', pa.list_(pa.string())), pa.field('b', pa.list_(pa.int32())) ]) {code} This structure isn't supported yet {code:java} schema = pa.schema([ pa.field('test_id', pa.string()), pa.field('testlist', pa.list_(pa.struct([('a', pa.string()), ('b', pa.int32())]))) ]) new_records = list() new_records.append({'test_id': '123', 'testlist': [{'a': 'xyz', 'b': 22}]}) new_records.append({'test_id': '789', 'testlist': [{'a': 'aaa', 'b': 33}]}) arrow_columns = list() for column in schema.names: arrow_columns.append(pa.array([v[column] for v in new_records], type=schema.types[schema.get_field_index(column)])) arrow_table = pa.Table.from_arrays(arrow_columns, schema.names) arrow_table arrow_table[0] arrow_table[1] arrow_table[1][0] arrow_table[1][1] >>> pq.write_table(arrow_table, "test.parquet") Traceback (most recent call last): packages/pyarrow/parquet.py", line 1160, in write_table writer.write_table(table, row_group_size=row_group_size) self.writer.write_table(table, row_group_size=row_group_size) File "pyarrow/_parquet.pyx", line 924, in pyarrow._parquet.ParquetWriter.write_table File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Nested column branch had multiple children {code} Supporting structs is the missing piece to being able to save structured JSON as columnar parquet which would make json searchable. was (Author: davlee1...@yahoo.com): I've been able to write parquet columns which are lists, but I haven't been able to write a column which is a list of struct(s) This works: {code:java} schema = pa.schema([ pa.field('test_id', pa.string()), pa.field('a', pa.list_(pa.string())), pa.field('b', pa.list_(pa.int32())) ]) {code} This structure isn't supported yet {code:java} schema = pa.schema([ pa.field('test_id', pa.string()), pa.field('testlist', pa.list_(pa.struct([('a', pa.string()), ('b', pa.int32())]))) ]) new_records = list() new_records.append({'test_id': '123', 'testlist': [{'a': 'xyz', 'b': 22}]}) new_records.append({'test_id': '789', 'testlist': [{'a': 'aaa', 'b': 33}]}) arrow_columns = list() for column in schema.names: arrow_columns.append(pa.array([v[column] for v in new_records], type=schema.types[schema.get_field_index(column)])) arrow_table = pa.Table.from_arrays(arrow_columns, schema.names) arrow_table arrow_table[0] arrow_table[1] arrow_table[1][0] arrow_table[1][1] >>> pq.write_table(arrow_table, "test.parquet") Traceback (most recent call last): packages/pyarrow/parquet.py", line 1160, in write_table writer.write_table(table, row_group_size=row_group_size) File "/proj/pag/python/current/lib/python3.6/site-packages/pyarrow/parquet.py", line 405, in write_table self.writer.write_table(table, row_group_size=row_group_size) File "pyarrow/_parquet.pyx", line 924, in pyarrow._parquet.ParquetWriter.write_table File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Nested column branch had multiple children {code} Supporting structs is the missing piece to being able to save structured JSON as columnar parquet which would make json searchable. > [Python] Read and write nested Parquet data with a mix of struct and list > nesting levels > > > Key: ARROW-1644 > URL: https://issues.apache.org/jira/browse/ARROW-1644 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Affects Versions: 0.8.0 >Reporter: DB Tsai >Assignee: Joshua Storck >Priority: Major > Labels: parquet, pull-request-available > Fix For: 0.14.0 > > > We have many nested parquet files generated from Apache Spark for ranking > problems, and we would like to load them in python for other programs to > consume. > The schema looks like > {code:java} > root > |-- profile_id: long (nullable = true) > |-- country_iso_code: string (nullable = true) > |-- items: array (nullable = false) > ||-- element: struct (containsNull = false) > |||-- show_title_id: integer (nullable = true) > |||-- duration: double (nullable = true) > {code} > And when I tried to load it with nightly build pyarrow on Oct 4, 2017, I got > the following error. > {code:python} > Python 3.6.2 |Anaconda, Inc.| (default, Sep 30 2017, 18:42:57) > [GCC 7.2.0] on linux > Type
[jira] [Commented] (ARROW-1644) [Python] Read and write nested Parquet data with a mix of struct and list nesting levels
[ https://issues.apache.org/jira/browse/ARROW-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16784864#comment-16784864 ] David Lee commented on ARROW-1644: -- I've been able to write parquet columns which are lists, but I haven't been able to write a column which is a list of struct(s) This works: {code:java} schema = pa.schema([ pa.field('test_id', pa.string()), pa.field('a', pa.list_(pa.string())), pa.field('b', pa.list_(pa.int32())) ]) {code} This structure isn't supported yet {code:java} schema = pa.schema([ pa.field('test_id', pa.string()), pa.field('testlist', pa.list_(pa.struct([('a', pa.string()), ('b', pa.int32())]))) ]) new_records = list() new_records.append({'test_id': '123', 'testlist': [{'a': 'xyz', 'b': 22}]}) new_records.append({'test_id': '789', 'testlist': [{'a': 'aaa', 'b': 33}]}) arrow_columns = list() for column in schema.names: arrow_columns.append(pa.array([v[column] for v in new_records], type=schema.types[schema.get_field_index(column)])) arrow_table = pa.Table.from_arrays(arrow_columns, schema.names) arrow_table arrow_table[0] arrow_table[1] arrow_table[1][0] arrow_table[1][1] >>> pq.write_table(arrow_table, "test.parquet") Traceback (most recent call last): packages/pyarrow/parquet.py", line 1160, in write_table writer.write_table(table, row_group_size=row_group_size) File "/proj/pag/python/current/lib/python3.6/site-packages/pyarrow/parquet.py", line 405, in write_table self.writer.write_table(table, row_group_size=row_group_size) File "pyarrow/_parquet.pyx", line 924, in pyarrow._parquet.ParquetWriter.write_table File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Nested column branch had multiple children {code} Supporting structs is the missing piece to being able to save structured JSON as columnar parquet which would make json searchable. > [Python] Read and write nested Parquet data with a mix of struct and list > nesting levels > > > Key: ARROW-1644 > URL: https://issues.apache.org/jira/browse/ARROW-1644 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Affects Versions: 0.8.0 >Reporter: DB Tsai >Assignee: Joshua Storck >Priority: Major > Labels: parquet, pull-request-available > Fix For: 0.14.0 > > > We have many nested parquet files generated from Apache Spark for ranking > problems, and we would like to load them in python for other programs to > consume. > The schema looks like > {code:java} > root > |-- profile_id: long (nullable = true) > |-- country_iso_code: string (nullable = true) > |-- items: array (nullable = false) > ||-- element: struct (containsNull = false) > |||-- show_title_id: integer (nullable = true) > |||-- duration: double (nullable = true) > {code} > And when I tried to load it with nightly build pyarrow on Oct 4, 2017, I got > the following error. > {code:python} > Python 3.6.2 |Anaconda, Inc.| (default, Sep 30 2017, 18:42:57) > [GCC 7.2.0] on linux > Type "help", "copyright", "credits" or "license" for more information. > >>> import numpy as np > >>> import pandas as pd > >>> import pyarrow as pa > >>> import pyarrow.parquet as pq > >>> table2 = pq.read_table('part-0') > Traceback (most recent call last): > File "", line 1, in > File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", > line 823, in read_table > use_pandas_metadata=use_pandas_metadata) > File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", > line 119, in read > nthreads=nthreads) > File "_parquet.pyx", line 466, in pyarrow._parquet.ParquetReader.read_all > File "error.pxi", line 85, in pyarrow.lib.check_status > pyarrow.lib.ArrowNotImplementedError: lists with structs are not supported. > {code} > I somehow get the impression that after > https://issues.apache.org/jira/browse/PARQUET-911 is merged, we should be > able to load the nested parquet in pyarrow. > Any insight about this? > Thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4032) [Python] New pyarrow.Table functions: from_pydict(), from_pylist() and to_pylist()
[ https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16738614#comment-16738614 ] David Lee commented on ARROW-4032: -- Tests: With and Without safe=False {code:python} my_list = [ {'a':'one', 'b': 1}, {'a':'two', 'b': 2}, {'a':'three', 'b': 3}, {'a':'missing', 'b': None} ] schema = pa.schema([ pa.field('a', pa.string()), pa.field('b', pa.int16()) ]) arrow_table = from_pylist(my_list, schema=schema) arrow_table2 = pa.Table.from_pandas(pd.DataFrame(my_list), preserve_index=False) arrow_table3 = pa.Table.from_pandas(pd.DataFrame(my_list), schema = schema, preserve_index=False, safe=False) >>> arrow_table.schema a: string b: int16 >>> arrow_table2.schema a: string b: double metadata OrderedDict([(b'pandas', b'{"index_columns": [], "column_indexes": [], "columns": [{"na' b'me": "a", "field_name": "a", "pandas_type": "unicode", "nump' b'y_type": "object", "metadata": null}, {"name": "b", "field_n' b'ame": "b", "pandas_type": "float64", "numpy_type": "float64"' b', "metadata": null}], "pandas_version": "0.23.4"}')]) >>> arrow_table3.schema a: string b: int16 metadata OrderedDict([(b'pandas', b'{"index_columns": [], "column_indexes": [], "columns": [{"na' b'me": "a", "field_name": "a", "pandas_type": "unicode", "nump' b'y_type": "object", "metadata": null}, {"name": "b", "field_n' b'ame": "b", "pandas_type": "int16", "numpy_type": "float64", ' b'"metadata": null}], "pandas_version": "0.23.4"}')]) {code} > [Python] New pyarrow.Table functions: from_pydict(), from_pylist() and > to_pylist() > -- > > Key: ARROW-4032 > URL: https://issues.apache.org/jira/browse/ARROW-4032 > Project: Apache Arrow > Issue Type: Task > Components: Python >Reporter: David Lee >Priority: Minor > > Here's a proposal to create a pyarrow.Table.from_pydict() function. > Right now only pyarrow.Table.from_pandas() exist and there are inherit > problems using Pandas with NULL support for Int(s) and Boolean(s) > [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html] > {{NaN}}, Integer {{NA}} values and {{NA}} type promotions: > Sample python code on how this would work. > > {code:java} > import pyarrow as pa > from datetime import datetime > # convert microseconds to milliseconds. More support for MS in parquet. > today = datetime.now() > today = datetime(today.year, today.month, today.day, today.hour, > today.minute, today.second, today.microsecond - today.microsecond % 1000) > test_list = [ > {"name": "Tom", "age": 10}, > {"name": "Mark", "age": 5, "city": "San Francisco"}, > {"name": "Pam", "age": 7, "birthday": today} > ] > def from_pylist(pylist, schema=None, columns=None, safe=True): > arrow_columns = list() > if schema: > columns = schema.names > if not columns: > return > for column in columns: > arrow_columns.append(pa.array([v[column] if column in v else None for > v in pylist], safe=safe)) > arrow_table = pa.Table.from_arrays(arrow_columns, columns) > if schema: > arrow_table = arrow_table.cast(schema, safe=safe) > return arrow_table > test = from_pylist(test_list, columns=['name' , 'age', 'city', 'birthday', > 'dummy']) > test_schema = pa.schema([ > pa.field('name', pa.string()), > pa.field('age', pa.int16()), > pa.field('city', pa.string()), > pa.field('birthday', pa.timestamp('ms')) > ]) > test2 = from_pylist(test_list, schema=test_schema) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (ARROW-4032) [Python] New pyarrow.Table functions: from_pydict(), from_pylist() and to_pylist()
[ https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16738590#comment-16738590 ] David Lee edited comment on ARROW-4032 at 1/9/19 7:42 PM: -- Been testing this internally and haven't seen any problems or performance issues.. Removed all my pyarrow <> pandas code so I don't have to deal with all the numpy problems with types and NULL support. {code:python} def from_pylist(pylist, names=None, schema=None, safe=True): arrow_columns = list() if schema: for column in schema.names: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist], safe=safe, type=schema.types[schema.get_field_index(column)])) arrow_table = pa.Table.from_arrays(arrow_columns, schema.names) else: for column in names: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist], safe=safe)) arrow_table = pa.Table.from_arrays(arrow_columns, names) return arrow_table def to_pylist(arrow_table): pydict = arrow_table.to_pydict() pylist = [{column: pydict[column][row] for column in arrow_table.schema.names} for row in range(arrow_table.num_rows)] return pylist def from_pydict(pydict, names=None, schema=None, safe=True): arrow_columns = list() dict_columns = list(pydict.keys()) if schema: for column in schema.names: if column in pydict: arrow_columns.append(pa.array(pydict[column], safe=safe, type=schema.types[schema.get_field_index(column)])) else: arrow_columns.append(pa.array([None] * len(pydict[dict_columns[0]]), safe=safe, type=schema.types[schema.get_field_index(column)])) arrow_table = pa.Table.from_arrays(arrow_columns, schema.names) else: if not names: names = dict_columns for column in names: if column in dict_columns: arrow_columns.append(pa.array(pydict[column], safe=safe)) else: arrow_columns.append(pa.array([None] * len(pydict[dict_columns[0]]), safe=safe)) arrow_table = pa.Table.from_arrays(arrow_columns, names) return arrow_table def get_table_keys(arrow_table, key_columns): pydict = arrow_table.to_pydict() keys_set = set([tuple([pydict[key_column][row] for key_column in key_columns]) for row in range(arrow_table.num_rows)]) return keys_set {code} was (Author: davlee1...@yahoo.com): Been testing this internally and haven't seen any problems or performance issues.. Removed all my pyarrow <> pandas code so I don't have to deal with all the numpy problems with types and NULL support. {code} def from_pylist(pylist, names=None, schema=None, safe=True): arrow_columns = list() if schema: for column in schema.names: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist], safe=safe, type=schema.types[schema.get_field_index(column)])) arrow_table = pa.Table.from_arrays(arrow_columns, schema.names) else: for column in names: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist], safe=safe)) arrow_table = pa.Table.from_arrays(arrow_columns, names) return arrow_table def to_pylist(arrow_table): pydict = arrow_table.to_pydict() pylist = [{column: pydict[column][row] for column in arrow_table.schema.names} for row in range(arrow_table.num_rows)] return pylist def from_pydict(pydict, names=None, schema=None, safe=True): arrow_columns = list() dict_columns = list(pydict.keys()) if schema: for column in schema.names: if column in pydict: arrow_columns.append(pa.array(pydict[column], safe=safe, type=schema.types[schema.get_field_index(column)])) else: arrow_columns.append(pa.array([None] * len(pydict[dict_columns[0]]), safe=safe, type=schema.types[schema.get_field_index(column)])) arrow_table = pa.Table.from_arrays(arrow_columns, schema.names) else: if not names: names = dict_columns for column in names: if column in dict_columns: arrow_columns.append(pa.array(pydict[column], safe=safe)) else: arrow_columns.append(pa.array([None] * len(pydict[dict_columns[0]]), safe=safe)) arrow_table = pa.Table.from_arrays(arrow_columns, names) return arrow_table def get_table_keys(arrow_table, key_columns): pydict = arrow_table.to_pydict() keys_set = set([tuple([pydict[key_column][row] for key_column in key_columns]) return keys_set {code} > [Python] New pyarrow.Table functions: from_pydict(), from_pylist() and > to_pylist() > -- > >
[jira] [Comment Edited] (ARROW-4032) [Python] New pyarrow.Table functions: from_pydict(), from_pylist() and to_pylist()
[ https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16738590#comment-16738590 ] David Lee edited comment on ARROW-4032 at 1/9/19 7:40 PM: -- Been testing this internally and haven't seen any problems or performance issues.. Removed all my pyarrow <> pandas code so I don't have to deal with all the numpy problems with types and NULL support. {code} def from_pylist(pylist, names=None, schema=None, safe=True): arrow_columns = list() if schema: for column in schema.names: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist], safe=safe, type=schema.types[schema.get_field_index(column)])) arrow_table = pa.Table.from_arrays(arrow_columns, schema.names) else: for column in names: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist], safe=safe)) arrow_table = pa.Table.from_arrays(arrow_columns, names) return arrow_table def to_pylist(arrow_table): pydict = arrow_table.to_pydict() pylist = [{column: pydict[column][row] for column in arrow_table.schema.names} for row in range(arrow_table.num_rows)] return pylist def from_pydict(pydict, names=None, schema=None, safe=True): arrow_columns = list() dict_columns = list(pydict.keys()) if schema: for column in schema.names: if column in pydict: arrow_columns.append(pa.array(pydict[column], safe=safe, type=schema.types[schema.get_field_index(column)])) else: arrow_columns.append(pa.array([None] * len(pydict[dict_columns[0]]), safe=safe, type=schema.types[schema.get_field_index(column)])) arrow_table = pa.Table.from_arrays(arrow_columns, schema.names) else: if not names: names = dict_columns for column in names: if column in dict_columns: arrow_columns.append(pa.array(pydict[column], safe=safe)) else: arrow_columns.append(pa.array([None] * len(pydict[dict_columns[0]]), safe=safe)) arrow_table = pa.Table.from_arrays(arrow_columns, names) return arrow_table def get_table_keys(arrow_table, key_columns): pydict = arrow_table.to_pydict() keys_set = set([tuple([pydict[key_column][row] for key_column in key_columns]) return keys_set {code} was (Author: davlee1...@yahoo.com): Been testing this internally and haven't seen any problems or performance issues.. Removed all my pyarrow <> pandas code so I don't have to deal with all the numpy problems with types and NULL support. {code:python} def from_pylist(pylist, names=None, schema=None, safe=True): arrow_columns = list() if schema: for column in schema.names: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist], safe=safe, type=schema.types[schema.get_field_index(column)])) arrow_table = pa.Table.from_arrays(arrow_columns, schema.names) else: for column in names: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist], safe=safe)) arrow_table = pa.Table.from_arrays(arrow_columns, names) return arrow_table def to_pylist(arrow_table): pydict = arrow_table.to_pydict() pylist = [{column: pydict[column][row] for column in arrow_table.schema.names} for row in range(arrow_table.num_rows)] return pylist def from_pydict(pydict, names=None, schema=None, safe=True): arrow_columns = list() dict_columns = list(pydict.keys()) if schema: for column in schema.names: if column in pydict: arrow_columns.append(pa.array(pydict[column], safe=safe, type=schema.types[schema.get_field_index(column)])) else: arrow_columns.append(pa.array([None] * len(pydict[dict_columns[0]]), safe=safe, type=schema.types[schema.get_field_index(column)])) arrow_table = pa.Table.from_arrays(arrow_columns, schema.names) else: if not names: names = dict_columns for column in names: if column in dict_columns: arrow_columns.append(pa.array(pydict[column], safe=safe)) else: arrow_columns.append(pa.array([None] * len(pydict[dict_columns[0]]), safe=safe)) arrow_table = pa.Table.from_arrays(arrow_columns, names) return arrow_table def get_table_keys(arrow_table, key_columns): pydict = arrow_table.to_pydict() keys_set = set([tuple([pydict[key_column][row] for key_column in key_columns]) for row in range(arrow_table.num_rows)]) return keys_set {code} > [Python] New pyarrow.Table functions: from_pydict(), from_pylist() and > to_pylist() > -- > >
[jira] [Commented] (ARROW-4032) [Python] New pyarrow.Table functions: from_pydict(), from_pylist() and to_pylist()
[ https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16738590#comment-16738590 ] David Lee commented on ARROW-4032: -- Been testing this internally and haven't seen any problems or performance issues.. Removed all my pyarrow <> pandas code so I don't have to deal with all the numpy problems with types and NULL support. {code:python} def from_pylist(pylist, names=None, schema=None, safe=True): arrow_columns = list() if schema: for column in schema.names: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist], safe=safe, type=schema.types[schema.get_field_index(column)])) arrow_table = pa.Table.from_arrays(arrow_columns, schema.names) else: for column in names: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist], safe=safe)) arrow_table = pa.Table.from_arrays(arrow_columns, names) return arrow_table def to_pylist(arrow_table): pydict = arrow_table.to_pydict() pylist = [{column: pydict[column][row] for column in arrow_table.schema.names} for row in range(arrow_table.num_rows)] return pylist def from_pydict(pydict, names=None, schema=None, safe=True): arrow_columns = list() dict_columns = list(pydict.keys()) if schema: for column in schema.names: if column in pydict: arrow_columns.append(pa.array(pydict[column], safe=safe, type=schema.types[schema.get_field_index(column)])) else: arrow_columns.append(pa.array([None] * len(pydict[dict_columns[0]]), safe=safe, type=schema.types[schema.get_field_index(column)])) arrow_table = pa.Table.from_arrays(arrow_columns, schema.names) else: if not names: names = dict_columns for column in names: if column in dict_columns: arrow_columns.append(pa.array(pydict[column], safe=safe)) else: arrow_columns.append(pa.array([None] * len(pydict[dict_columns[0]]), safe=safe)) arrow_table = pa.Table.from_arrays(arrow_columns, names) return arrow_table def get_table_keys(arrow_table, key_columns): pydict = arrow_table.to_pydict() keys_set = set([tuple([pydict[key_column][row] for key_column in key_columns]) for row in range(arrow_table.num_rows)]) return keys_set {code} > [Python] New pyarrow.Table functions: from_pydict(), from_pylist() and > to_pylist() > -- > > Key: ARROW-4032 > URL: https://issues.apache.org/jira/browse/ARROW-4032 > Project: Apache Arrow > Issue Type: Task > Components: Python >Reporter: David Lee >Priority: Minor > > Here's a proposal to create a pyarrow.Table.from_pydict() function. > Right now only pyarrow.Table.from_pandas() exist and there are inherit > problems using Pandas with NULL support for Int(s) and Boolean(s) > [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html] > {{NaN}}, Integer {{NA}} values and {{NA}} type promotions: > Sample python code on how this would work. > > {code:java} > import pyarrow as pa > from datetime import datetime > # convert microseconds to milliseconds. More support for MS in parquet. > today = datetime.now() > today = datetime(today.year, today.month, today.day, today.hour, > today.minute, today.second, today.microsecond - today.microsecond % 1000) > test_list = [ > {"name": "Tom", "age": 10}, > {"name": "Mark", "age": 5, "city": "San Francisco"}, > {"name": "Pam", "age": 7, "birthday": today} > ] > def from_pylist(pylist, schema=None, columns=None, safe=True): > arrow_columns = list() > if schema: > columns = schema.names > if not columns: > return > for column in columns: > arrow_columns.append(pa.array([v[column] if column in v else None for > v in pylist], safe=safe)) > arrow_table = pa.Table.from_arrays(arrow_columns, columns) > if schema: > arrow_table = arrow_table.cast(schema, safe=safe) > return arrow_table > test = from_pylist(test_list, columns=['name' , 'age', 'city', 'birthday', > 'dummy']) > test_schema = pa.schema([ > pa.field('name', pa.string()), > pa.field('age', pa.int16()), > pa.field('city', pa.string()), > pa.field('birthday', pa.timestamp('ms')) > ]) > test2 = from_pylist(test_list, schema=test_schema) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (ARROW-4032) [Python] New pyarrow.Table functions: from_pydict(), from_pylist() and to_pylist()
[ https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16721986#comment-16721986 ] David Lee edited comment on ARROW-4032 at 12/17/18 7:44 PM: Ended up just writing from_pylist() and to_pylist().. They run much faster than going through pandas.. {code} def from_pylist(pylist, names=None, schema=None, safe=True): arrow_columns = list() if schema: for column in schema.names: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist], safe=safe, type=schema.types[schema.get_field_index(column)])) arrow_table = pa.Table.from_arrays(arrow_columns, schema.names) else: for column in names: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist], safe=safe)) arrow_table = pa.Table.from_arrays(arrow_columns, names) return arrow_table def to_pylist(arrow_table): pylist = list() for row in range(arrow_table.num_rows): pylist.append({arrow_table.schema.names[i]: arrow_table[i][row] for i in range(arrow_table.num_columns)}) return pylist def from_pydict(pydict, names=None, schema=None, safe=True): arrow_columns = list() dict_columns = list(pydict.keys()) if schema: for column in schema.names: if column in pydict: arrow_columns.append(pa.array(pydict[column], safe=safe, type=schema.types[schema.get_field_index(column)])) else: arrow_columns.append(pa.array([None] * len(pydict[dict_columns[0]]), safe=safe, type=schema.types[schema.get_field_index(column)])) arrow_table = pa.Table.from_arrays(arrow_columns, schema.names) else: if not names: names = dict_columns for column in names: if column in dict_columns: arrow_columns.append(pa.array(pydict[column], safe=safe)) else: arrow_columns.append(pa.array([None] * len(pydict[dict_columns[0]]), safe=safe)) arrow_table = pa.Table.from_arrays(arrow_columns, names) return arrow_table {code} was (Author: davlee1...@yahoo.com): Ended up just writing from_pylist() and to_pylist().. They run much faster than going through pandas.. {code:java} def from_pylist(pylist, names=None, schema=None, safe=True): arrow_columns = list() if schema: for column in schema.names: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist], safe=safe, type=schema.types[schema.get_field_index(column)])) arrow_table = pa.Table.from_arrays(arrow_columns, schema.names) else: for column in names: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist], safe=safe)) arrow_table = pa.Table.from_arrays(arrow_columns, names) return arrow_table def to_pylist(arrow_table): pylist = list() for row in range(arrow_table.num_rows): pylist.append({arrow_table.schema.names[i]: arrow_table[i][row] for i in range(arrow_table.num_columns)}) return pylist def from_pydict(pydict, names=None, schema=None, safe=True): arrow_names = list() arrow_columns = list() for column, values in pydict.items(): arrow_names.append(column) arrow_columns.append(pa.array(values)) arrow_table = pa.Table.from_arrays(arrow_columns, arrow_names) return arrow_table{code} > [Python] New pyarrow.Table functions: from_pydict(), from_pylist() and > to_pylist() > -- > > Key: ARROW-4032 > URL: https://issues.apache.org/jira/browse/ARROW-4032 > Project: Apache Arrow > Issue Type: Task > Components: Python >Reporter: David Lee >Priority: Minor > > Here's a proposal to create a pyarrow.Table.from_pydict() function. > Right now only pyarrow.Table.from_pandas() exist and there are inherit > problems using Pandas with NULL support for Int(s) and Boolean(s) > [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html] > {{NaN}}, Integer {{NA}} values and {{NA}} type promotions: > Sample python code on how this would work. > > {code:java} > import pyarrow as pa > from datetime import datetime > # convert microseconds to milliseconds. More support for MS in parquet. > today = datetime.now() > today = datetime(today.year, today.month, today.day, today.hour, > today.minute, today.second, today.microsecond - today.microsecond % 1000) > test_list = [ > {"name": "Tom", "age": 10}, > {"name": "Mark", "age": 5, "city": "San Francisco"}, > {"name": "Pam", "age": 7, "birthday": today} > ] > def from_pylist(pylist, schema=None, columns=None, safe=True): > arrow_columns = list() > if schema: > columns = schema.names
[jira] [Commented] (ARROW-4032) [Python] New pyarrow.Table functions: from_pydict(), from_pylist() and to_pylist()
[ https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16723259#comment-16723259 ] David Lee commented on ARROW-4032: -- I'll see if I can do a git pull and submit a change.. > [Python] New pyarrow.Table functions: from_pydict(), from_pylist() and > to_pylist() > -- > > Key: ARROW-4032 > URL: https://issues.apache.org/jira/browse/ARROW-4032 > Project: Apache Arrow > Issue Type: Task > Components: Python >Reporter: David Lee >Priority: Minor > > Here's a proposal to create a pyarrow.Table.from_pydict() function. > Right now only pyarrow.Table.from_pandas() exist and there are inherit > problems using Pandas with NULL support for Int(s) and Boolean(s) > [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html] > {{NaN}}, Integer {{NA}} values and {{NA}} type promotions: > Sample python code on how this would work. > > {code:java} > import pyarrow as pa > from datetime import datetime > # convert microseconds to milliseconds. More support for MS in parquet. > today = datetime.now() > today = datetime(today.year, today.month, today.day, today.hour, > today.minute, today.second, today.microsecond - today.microsecond % 1000) > test_list = [ > {"name": "Tom", "age": 10}, > {"name": "Mark", "age": 5, "city": "San Francisco"}, > {"name": "Pam", "age": 7, "birthday": today} > ] > def from_pylist(pylist, schema=None, columns=None, safe=True): > arrow_columns = list() > if schema: > columns = schema.names > if not columns: > return > for column in columns: > arrow_columns.append(pa.array([v[column] if column in v else None for > v in pylist], safe=safe)) > arrow_table = pa.Table.from_arrays(arrow_columns, columns) > if schema: > arrow_table = arrow_table.cast(schema, safe=safe) > return arrow_table > test = from_pylist(test_list, columns=['name' , 'age', 'city', 'birthday', > 'dummy']) > test_schema = pa.schema([ > pa.field('name', pa.string()), > pa.field('age', pa.int16()), > pa.field('city', pa.string()), > pa.field('birthday', pa.timestamp('ms')) > ]) > test2 = from_pylist(test_list, schema=test_schema) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4032) [Python] New pyarrow.Table functions: from_pydict(), from_pylist() and to_pylist()
[ https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Lee updated ARROW-4032: - Summary: [Python] New pyarrow.Table functions: from_pydict(), from_pylist() and to_pylist() (was: [Python] New pyarrow.Table.from_pylist() function) > [Python] New pyarrow.Table functions: from_pydict(), from_pylist() and > to_pylist() > -- > > Key: ARROW-4032 > URL: https://issues.apache.org/jira/browse/ARROW-4032 > Project: Apache Arrow > Issue Type: Task > Components: Python >Reporter: David Lee >Priority: Minor > > Here's a proposal to create a pyarrow.Table.from_pydict() function. > Right now only pyarrow.Table.from_pandas() exist and there are inherit > problems using Pandas with NULL support for Int(s) and Boolean(s) > [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html] > {{NaN}}, Integer {{NA}} values and {{NA}} type promotions: > Sample python code on how this would work. > > {code:java} > import pyarrow as pa > from datetime import datetime > # convert microseconds to milliseconds. More support for MS in parquet. > today = datetime.now() > today = datetime(today.year, today.month, today.day, today.hour, > today.minute, today.second, today.microsecond - today.microsecond % 1000) > test_list = [ > {"name": "Tom", "age": 10}, > {"name": "Mark", "age": 5, "city": "San Francisco"}, > {"name": "Pam", "age": 7, "birthday": today} > ] > def from_pylist(pylist, schema=None, columns=None, safe=True): > arrow_columns = list() > if schema: > columns = schema.names > if not columns: > return > for column in columns: > arrow_columns.append(pa.array([v[column] if column in v else None for > v in pylist], safe=safe)) > arrow_table = pa.Table.from_arrays(arrow_columns, columns) > if schema: > arrow_table = arrow_table.cast(schema, safe=safe) > return arrow_table > test = from_pylist(test_list, columns=['name' , 'age', 'city', 'birthday', > 'dummy']) > test_schema = pa.schema([ > pa.field('name', pa.string()), > pa.field('age', pa.int16()), > pa.field('city', pa.string()), > pa.field('birthday', pa.timestamp('ms')) > ]) > test2 = from_pylist(test_list, schema=test_schema) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (ARROW-4032) [Python] New pyarrow.Table.from_pylist() function
[ https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16721986#comment-16721986 ] David Lee edited comment on ARROW-4032 at 12/17/18 6:28 PM: Ended up just writing from_pylist() and to_pylist().. They run much faster than going through pandas.. {code:java} def from_pylist(pylist, names=None, schema=None, safe=True): arrow_columns = list() if schema: for column in schema.names: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist], safe=safe, type=schema.types[schema.get_field_index(column)])) arrow_table = pa.Table.from_arrays(arrow_columns, schema.names) else: for column in names: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist], safe=safe)) arrow_table = pa.Table.from_arrays(arrow_columns, names) return arrow_table def to_pylist(arrow_table): pylist = list() for row in range(arrow_table.num_rows): pylist.append({arrow_table.schema.names[i]: arrow_table[i][row] for i in range(arrow_table.num_columns)}) return pylist def from_pydict(pydict, names=None, schema=None, safe=True): arrow_names = list() arrow_columns = list() for column, values in pydict.items(): arrow_names.append(column) arrow_columns.append(pa.array(values)) arrow_table = pa.Table.from_arrays(arrow_columns, arrow_names) return arrow_table{code} was (Author: davlee1...@yahoo.com): Ended up just writing from_pylist() and to_pylist().. They run much faster than going through pandas.. {code:java} def from_pylist(pylist, names=None, schema=None, safe=True): arrow_columns = list() if schema: for column in schema.names: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist], safe=safe, type = schema.types[schema.get_field_index(column)])) arrow_table = pa.Table.from_arrays(arrow_columns, schema.names) else: for column in names: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist], safe=safe)) arrow_table = pa.Table.from_arrays(arrow_columns, names) return arrow_table def to_pylist(arrow_table): pylist = list() for row in range(arrow_table.num_rows): pylist.append({arrow_table.schema.names[i]: arrow_table[i][row] for i in range(arrow_table.num_columns)}) return pylist def from_pydict(pydict, names=None, schema=None, safe=True): arrow_names = list() arrow_columns = list() for column, values in pydict.items(): arrow_names.append(column) arrow_columns.append(pa.array(values)) arrow_table = pa.Table.from_arrays(arrow_columns, arrow_names) return arrow_table{code} > [Python] New pyarrow.Table.from_pylist() function > - > > Key: ARROW-4032 > URL: https://issues.apache.org/jira/browse/ARROW-4032 > Project: Apache Arrow > Issue Type: Task > Components: Python >Reporter: David Lee >Priority: Minor > > Here's a proposal to create a pyarrow.Table.from_pydict() function. > Right now only pyarrow.Table.from_pandas() exist and there are inherit > problems using Pandas with NULL support for Int(s) and Boolean(s) > [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html] > {{NaN}}, Integer {{NA}} values and {{NA}} type promotions: > Sample python code on how this would work. > > {code:java} > import pyarrow as pa > from datetime import datetime > # convert microseconds to milliseconds. More support for MS in parquet. > today = datetime.now() > today = datetime(today.year, today.month, today.day, today.hour, > today.minute, today.second, today.microsecond - today.microsecond % 1000) > test_list = [ > {"name": "Tom", "age": 10}, > {"name": "Mark", "age": 5, "city": "San Francisco"}, > {"name": "Pam", "age": 7, "birthday": today} > ] > def from_pylist(pylist, schema=None, columns=None, safe=True): > arrow_columns = list() > if schema: > columns = schema.names > if not columns: > return > for column in columns: > arrow_columns.append(pa.array([v[column] if column in v else None for > v in pylist], safe=safe)) > arrow_table = pa.Table.from_arrays(arrow_columns, columns) > if schema: > arrow_table = arrow_table.cast(schema, safe=safe) > return arrow_table > test = from_pylist(test_list, columns=['name' , 'age', 'city', 'birthday', > 'dummy']) > test_schema = pa.schema([ > pa.field('name', pa.string()), > pa.field('age', pa.int16()), > pa.field('city', pa.string()), > pa.field('birthday', pa.timestamp('ms')) > ]) > test2 = from_pylist(test_list, schema=test_schema) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#7
[jira] [Comment Edited] (ARROW-4032) [Python] New pyarrow.Table.from_pylist() function
[ https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16721986#comment-16721986 ] David Lee edited comment on ARROW-4032 at 12/17/18 6:28 PM: Ended up just writing from_pylist() and to_pylist().. They run much faster than going through pandas.. {code:java} def from_pylist(pylist, names=None, schema=None, safe=True): arrow_columns = list() if schema: for column in schema.names: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist], safe=safe, type = schema.types[schema.get_field_index(column)])) arrow_table = pa.Table.from_arrays(arrow_columns, schema.names) else: for column in names: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist], safe=safe)) arrow_table = pa.Table.from_arrays(arrow_columns, names) return arrow_table def to_pylist(arrow_table): pylist = list() for row in range(arrow_table.num_rows): pylist.append({arrow_table.schema.names[i]: arrow_table[i][row] for i in range(arrow_table.num_columns)}) return pylist def from_pydict(pydict, names=None, schema=None, safe=True): arrow_names = list() arrow_columns = list() for column, values in pydict.items(): arrow_names.append(column) arrow_columns.append(pa.array(values)) arrow_table = pa.Table.from_arrays(arrow_columns, arrow_names) return arrow_table{code} was (Author: davlee1...@yahoo.com): Ended up just writing from_pylist() and to_pylist().. They run much faster than going through pandas.. {code:java} def from_pylist(pylist, schema, safe=True): arrow_columns = list() for column in schema.names: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist], safe=safe, type=schema.types[schema.get_field_index(column)])) arrow_table = pa.Table.from_arrays(arrow_columns, schema.names) return arrow_table def to_pylist(arrow_table): pylist = list() columns = arrow_table.schema.names rows = len(arrow_table[columns[0]]) for row in range(rows): pylist.append({key: arrow_table[key][row] for key in columns}) return pylist {code} > [Python] New pyarrow.Table.from_pylist() function > - > > Key: ARROW-4032 > URL: https://issues.apache.org/jira/browse/ARROW-4032 > Project: Apache Arrow > Issue Type: Task > Components: Python >Reporter: David Lee >Priority: Minor > > Here's a proposal to create a pyarrow.Table.from_pydict() function. > Right now only pyarrow.Table.from_pandas() exist and there are inherit > problems using Pandas with NULL support for Int(s) and Boolean(s) > [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html] > {{NaN}}, Integer {{NA}} values and {{NA}} type promotions: > Sample python code on how this would work. > > {code:java} > import pyarrow as pa > from datetime import datetime > # convert microseconds to milliseconds. More support for MS in parquet. > today = datetime.now() > today = datetime(today.year, today.month, today.day, today.hour, > today.minute, today.second, today.microsecond - today.microsecond % 1000) > test_list = [ > {"name": "Tom", "age": 10}, > {"name": "Mark", "age": 5, "city": "San Francisco"}, > {"name": "Pam", "age": 7, "birthday": today} > ] > def from_pylist(pylist, schema=None, columns=None, safe=True): > arrow_columns = list() > if schema: > columns = schema.names > if not columns: > return > for column in columns: > arrow_columns.append(pa.array([v[column] if column in v else None for > v in pylist], safe=safe)) > arrow_table = pa.Table.from_arrays(arrow_columns, columns) > if schema: > arrow_table = arrow_table.cast(schema, safe=safe) > return arrow_table > test = from_pylist(test_list, columns=['name' , 'age', 'city', 'birthday', > 'dummy']) > test_schema = pa.schema([ > pa.field('name', pa.string()), > pa.field('age', pa.int16()), > pa.field('city', pa.string()), > pa.field('birthday', pa.timestamp('ms')) > ]) > test2 = from_pylist(test_list, schema=test_schema) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (ARROW-4032) [Python] New pyarrow.Table.from_pylist() function
[ https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16721986#comment-16721986 ] David Lee edited comment on ARROW-4032 at 12/17/18 3:53 PM: Ended up just writing from_pylist() and to_pylist().. They run much faster than going through pandas.. {code:java} def from_pylist(pylist, schema, safe=True): arrow_columns = list() for column in schema.names: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist], safe=safe, type=schema.types[schema.get_field_index(column)])) arrow_table = pa.Table.from_arrays(arrow_columns, schema.names) return arrow_table def to_pylist(arrow_table): pylist = list() columns = arrow_table.schema.names rows = len(arrow_table[columns[0]]) for row in range(rows): pylist.append({key: arrow_table[key][row] for key in columns}) return pylist {code} was (Author: davlee1...@yahoo.com): Ended up just writing from_pylist() and to_pylist().. They run much faster than going through pandas.. {code:java} def from_pylist(pylist, schema, safe=True): arrow_columns = list() for column in schema.names: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist], safe=safe, type=schema.types[schema.get_field_index(column)])) arrow_table = pa.Table.from_arrays(arrow_columns, schema.names) return arrow_table def to_pylist(arrow_table): od = pyarrow.Table.to_pydict(arrow_table) pylist = list() columns = arrow_table.schema.names rows = len(arrow_table[columns[0]]) for row in range(rows): pylist.append({key: arrow_table[key][row] for key in columns}) return pylist {code} > [Python] New pyarrow.Table.from_pylist() function > - > > Key: ARROW-4032 > URL: https://issues.apache.org/jira/browse/ARROW-4032 > Project: Apache Arrow > Issue Type: Task > Components: Python >Reporter: David Lee >Priority: Minor > > Here's a proposal to create a pyarrow.Table.from_pydict() function. > Right now only pyarrow.Table.from_pandas() exist and there are inherit > problems using Pandas with NULL support for Int(s) and Boolean(s) > [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html] > {{NaN}}, Integer {{NA}} values and {{NA}} type promotions: > Sample python code on how this would work. > > {code:java} > import pyarrow as pa > from datetime import datetime > # convert microseconds to milliseconds. More support for MS in parquet. > today = datetime.now() > today = datetime(today.year, today.month, today.day, today.hour, > today.minute, today.second, today.microsecond - today.microsecond % 1000) > test_list = [ > {"name": "Tom", "age": 10}, > {"name": "Mark", "age": 5, "city": "San Francisco"}, > {"name": "Pam", "age": 7, "birthday": today} > ] > def from_pylist(pylist, schema=None, columns=None, safe=True): > arrow_columns = list() > if schema: > columns = schema.names > if not columns: > return > for column in columns: > arrow_columns.append(pa.array([v[column] if column in v else None for > v in pylist], safe=safe)) > arrow_table = pa.Table.from_arrays(arrow_columns, columns) > if schema: > arrow_table = arrow_table.cast(schema, safe=safe) > return arrow_table > test = from_pylist(test_list, columns=['name' , 'age', 'city', 'birthday', > 'dummy']) > test_schema = pa.schema([ > pa.field('name', pa.string()), > pa.field('age', pa.int16()), > pa.field('city', pa.string()), > pa.field('birthday', pa.timestamp('ms')) > ]) > test2 = from_pylist(test_list, schema=test_schema) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (ARROW-4032) [Python] New pyarrow.Table.from_pylist() function
[ https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16721986#comment-16721986 ] David Lee edited comment on ARROW-4032 at 12/15/18 3:58 AM: Ended up just writing from_pylist() and to_pylist().. They run much faster than going through pandas.. {code:java} def from_pylist(pylist, schema, safe=True): arrow_columns = list() for column in schema.names: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist], safe=safe, type=schema.types[schema.get_field_index(column)])) arrow_table = pa.Table.from_arrays(arrow_columns, schema.names) return arrow_table def to_pylist(arrow_table): od = pyarrow.Table.to_pydict(arrow_table) pylist = list() columns = arrow_table.schema.names rows = len(arrow_table[columns[0]]) for row in range(rows): pylist.append({key: arrow_table[key][row] for key in columns}) return pylist {code} was (Author: davlee1...@yahoo.com): Ended up just writing from_pylist() and to_pylist().. They run much faster than going through pandas.. {code:java} def from_pylist(pylist, schema, safe=True): arrow_columns = list() for column in schema.names: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist], safe=safe, type=schema.types[schema.get_field_index(column)])) arrow_table = pa.Table.from_arrays(arrow_columns, schema.names) return arrow_table def to_pylist(arrow_table): od = pyarrow.Table.to_pydict(arrow_table) pylist = list() columns = list(arrow_table.keys()) rows = len(arrow_table[columns[0]]) for row in range(rows): pylist.append({key: arrow_table[key][row] for key in columns}) return pylist {code} > [Python] New pyarrow.Table.from_pylist() function > - > > Key: ARROW-4032 > URL: https://issues.apache.org/jira/browse/ARROW-4032 > Project: Apache Arrow > Issue Type: Task > Components: Python >Reporter: David Lee >Priority: Minor > > Here's a proposal to create a pyarrow.Table.from_pydict() function. > Right now only pyarrow.Table.from_pandas() exist and there are inherit > problems using Pandas with NULL support for Int(s) and Boolean(s) > [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html] > {{NaN}}, Integer {{NA}} values and {{NA}} type promotions: > Sample python code on how this would work. > > {code:java} > import pyarrow as pa > from datetime import datetime > # convert microseconds to milliseconds. More support for MS in parquet. > today = datetime.now() > today = datetime(today.year, today.month, today.day, today.hour, > today.minute, today.second, today.microsecond - today.microsecond % 1000) > test_list = [ > {"name": "Tom", "age": 10}, > {"name": "Mark", "age": 5, "city": "San Francisco"}, > {"name": "Pam", "age": 7, "birthday": today} > ] > def from_pylist(pylist, schema=None, columns=None, safe=True): > arrow_columns = list() > if schema: > columns = schema.names > if not columns: > return > for column in columns: > arrow_columns.append(pa.array([v[column] if column in v else None for > v in pylist], safe=safe)) > arrow_table = pa.Table.from_arrays(arrow_columns, columns) > if schema: > arrow_table = arrow_table.cast(schema, safe=safe) > return arrow_table > test = from_pylist(test_list, columns=['name' , 'age', 'city', 'birthday', > 'dummy']) > test_schema = pa.schema([ > pa.field('name', pa.string()), > pa.field('age', pa.int16()), > pa.field('city', pa.string()), > pa.field('birthday', pa.timestamp('ms')) > ]) > test2 = from_pylist(test_list, schema=test_schema) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4032) [Python] New pyarrow.Table.from_pylist() function
[ https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16721986#comment-16721986 ] David Lee commented on ARROW-4032: -- Ended up just writing from_pylist() and to_pylist().. They run much faster than going through pandas.. {code:java} def from_pylist(pylist, schema, safe=True): arrow_columns = list() for column in schema.names: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist], safe=safe, type=schema.types[schema.get_field_index(column)])) arrow_table = pa.Table.from_arrays(arrow_columns, columns) return arrow_table def to_pylist(arrow_table): od = pyarrow.Table.to_pydict(arrow_table) pylist = list() columns = list(arrow_table.keys()) rows = len(arrow_table[columns[0]]) for row in range(rows): pylist.append({key: arrow_table[key][row] for key in columns}) return pylist {code} > [Python] New pyarrow.Table.from_pylist() function > - > > Key: ARROW-4032 > URL: https://issues.apache.org/jira/browse/ARROW-4032 > Project: Apache Arrow > Issue Type: Task > Components: Python >Reporter: David Lee >Priority: Minor > > Here's a proposal to create a pyarrow.Table.from_pydict() function. > Right now only pyarrow.Table.from_pandas() exist and there are inherit > problems using Pandas with NULL support for Int(s) and Boolean(s) > [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html] > {{NaN}}, Integer {{NA}} values and {{NA}} type promotions: > Sample python code on how this would work. > > {code:java} > import pyarrow as pa > from datetime import datetime > # convert microseconds to milliseconds. More support for MS in parquet. > today = datetime.now() > today = datetime(today.year, today.month, today.day, today.hour, > today.minute, today.second, today.microsecond - today.microsecond % 1000) > test_list = [ > {"name": "Tom", "age": 10}, > {"name": "Mark", "age": 5, "city": "San Francisco"}, > {"name": "Pam", "age": 7, "birthday": today} > ] > def from_pylist(pylist, schema=None, columns=None, safe=True): > arrow_columns = list() > if schema: > columns = schema.names > if not columns: > return > for column in columns: > arrow_columns.append(pa.array([v[column] if column in v else None for > v in pylist], safe=safe)) > arrow_table = pa.Table.from_arrays(arrow_columns, columns) > if schema: > arrow_table = arrow_table.cast(schema, safe=safe) > return arrow_table > test = from_pylist(test_list, columns=['name' , 'age', 'city', 'birthday', > 'dummy']) > test_schema = pa.schema([ > pa.field('name', pa.string()), > pa.field('age', pa.int16()), > pa.field('city', pa.string()), > pa.field('birthday', pa.timestamp('ms')) > ]) > test2 = from_pylist(test_list, schema=test_schema) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (ARROW-4032) [Python] New pyarrow.Table.from_pylist() function
[ https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16721986#comment-16721986 ] David Lee edited comment on ARROW-4032 at 12/15/18 3:53 AM: Ended up just writing from_pylist() and to_pylist().. They run much faster than going through pandas.. {code:java} def from_pylist(pylist, schema, safe=True): arrow_columns = list() for column in schema.names: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist], safe=safe, type=schema.types[schema.get_field_index(column)])) arrow_table = pa.Table.from_arrays(arrow_columns, schema.names) return arrow_table def to_pylist(arrow_table): od = pyarrow.Table.to_pydict(arrow_table) pylist = list() columns = list(arrow_table.keys()) rows = len(arrow_table[columns[0]]) for row in range(rows): pylist.append({key: arrow_table[key][row] for key in columns}) return pylist {code} was (Author: davlee1...@yahoo.com): Ended up just writing from_pylist() and to_pylist().. They run much faster than going through pandas.. {code:java} def from_pylist(pylist, schema, safe=True): arrow_columns = list() for column in schema.names: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist], safe=safe, type=schema.types[schema.get_field_index(column)])) arrow_table = pa.Table.from_arrays(arrow_columns, columns) return arrow_table def to_pylist(arrow_table): od = pyarrow.Table.to_pydict(arrow_table) pylist = list() columns = list(arrow_table.keys()) rows = len(arrow_table[columns[0]]) for row in range(rows): pylist.append({key: arrow_table[key][row] for key in columns}) return pylist {code} > [Python] New pyarrow.Table.from_pylist() function > - > > Key: ARROW-4032 > URL: https://issues.apache.org/jira/browse/ARROW-4032 > Project: Apache Arrow > Issue Type: Task > Components: Python >Reporter: David Lee >Priority: Minor > > Here's a proposal to create a pyarrow.Table.from_pydict() function. > Right now only pyarrow.Table.from_pandas() exist and there are inherit > problems using Pandas with NULL support for Int(s) and Boolean(s) > [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html] > {{NaN}}, Integer {{NA}} values and {{NA}} type promotions: > Sample python code on how this would work. > > {code:java} > import pyarrow as pa > from datetime import datetime > # convert microseconds to milliseconds. More support for MS in parquet. > today = datetime.now() > today = datetime(today.year, today.month, today.day, today.hour, > today.minute, today.second, today.microsecond - today.microsecond % 1000) > test_list = [ > {"name": "Tom", "age": 10}, > {"name": "Mark", "age": 5, "city": "San Francisco"}, > {"name": "Pam", "age": 7, "birthday": today} > ] > def from_pylist(pylist, schema=None, columns=None, safe=True): > arrow_columns = list() > if schema: > columns = schema.names > if not columns: > return > for column in columns: > arrow_columns.append(pa.array([v[column] if column in v else None for > v in pylist], safe=safe)) > arrow_table = pa.Table.from_arrays(arrow_columns, columns) > if schema: > arrow_table = arrow_table.cast(schema, safe=safe) > return arrow_table > test = from_pylist(test_list, columns=['name' , 'age', 'city', 'birthday', > 'dummy']) > test_schema = pa.schema([ > pa.field('name', pa.string()), > pa.field('age', pa.int16()), > pa.field('city', pa.string()), > pa.field('birthday', pa.timestamp('ms')) > ]) > test2 = from_pylist(test_list, schema=test_schema) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4032) [Python] New pyarrow.Table.from_pylist() function
[ https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Lee updated ARROW-4032: - Summary: [Python] New pyarrow.Table.from_pylist() function (was: [Python] New pyarrow.Table.from_pydict() function) > [Python] New pyarrow.Table.from_pylist() function > - > > Key: ARROW-4032 > URL: https://issues.apache.org/jira/browse/ARROW-4032 > Project: Apache Arrow > Issue Type: Task > Components: Python >Reporter: David Lee >Priority: Minor > > Here's a proposal to create a pyarrow.Table.from_pydict() function. > Right now only pyarrow.Table.from_pandas() exist and there are inherit > problems using Pandas with NULL support for Int(s) and Boolean(s) > [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html] > {{NaN}}, Integer {{NA}} values and {{NA}} type promotions: > Sample python code on how this would work. > > {code:java} > import pyarrow as pa > from datetime import datetime > # convert microseconds to milliseconds. More support for MS in parquet. > today = datetime.now() > today = datetime(today.year, today.month, today.day, today.hour, > today.minute, today.second, today.microsecond - today.microsecond % 1000) > test_list = [ > {"name": "Tom", "age": 10}, > {"name": "Mark", "age": 5, "city": "San Francisco"}, > {"name": "Pam", "age": 7, "birthday": today} > ] > def from_pylist(pylist, schema=None, columns=None, safe=True): > arrow_columns = list() > if schema: > columns = schema.names > if not columns: > return > for column in columns: > arrow_columns.append(pa.array([v[column] if column in v else None for > v in pylist], safe=safe)) > arrow_table = pa.Table.from_arrays(arrow_columns, columns) > if schema: > arrow_table = arrow_table.cast(schema, safe=safe) > return arrow_table > test = from_pylist(test_list, columns=['name' , 'age', 'city', 'birthday', > 'dummy']) > test_schema = pa.schema([ > pa.field('name', pa.string()), > pa.field('age', pa.int16()), > pa.field('city', pa.string()), > pa.field('birthday', pa.timestamp('ms')) > ]) > test2 = from_pylist(test_list, schema=test_schema) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4032) [Python] New pyarrow.Table.from_pydict() function
[ https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Lee updated ARROW-4032: - Description: Here's a proposal to create a pyarrow.Table.from_pydict() function. Right now only pyarrow.Table.from_pandas() exist and there are inherit problems using Pandas with NULL support for Int(s) and Boolean(s) [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html] {{NaN}}, Integer {{NA}} values and {{NA}} type promotions: Sample python code on how this would work. {code:java} import pyarrow as pa from datetime import datetime # convert microseconds to milliseconds. More support for MS in parquet. today = datetime.now() today = datetime(today.year, today.month, today.day, today.hour, today.minute, today.second, today.microsecond - today.microsecond % 1000) test_list = [ {"name": "Tom", "age": 10}, {"name": "Mark", "age": 5, "city": "San Francisco"}, {"name": "Pam", "age": 7, "birthday": today} ] def from_pylist(pylist, schema=None, columns=None, safe=True): arrow_columns = list() if schema: columns = schema.names if not columns: return for column in columns: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist], safe=safe)) arrow_table = pa.Table.from_arrays(arrow_columns, columns) if schema: arrow_table = arrow_table.cast(schema, safe=safe) return arrow_table test = from_pylist(test_list, columns=['name' , 'age', 'city', 'birthday', 'dummy']) test_schema = pa.schema([ pa.field('name', pa.string()), pa.field('age', pa.int16()), pa.field('city', pa.string()), pa.field('birthday', pa.timestamp('ms')) ]) test2 = from_pylist(test_list, schema=test_schema) {code} was: Here's a proposal to create a pyarrow.Table.from_pydict() function. Right now only pyarrow.Table.from_pandas() exist and there are inherit problems using Pandas with NULL support for Int(s) and Boolean(s) [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html] {{NaN}}, Integer {{NA}} values and {{NA}} type promotions: Sample python code on how this would work. {code:java} import pyarrow as pa from datetime import datetime # convert microseconds to milliseconds. More support for MS in parquet. today = datetime.now() today = datetime(today.year, today.month, today.day, today.hour, today.minute, today.second, today.microsecond - today.microsecond % 1000) test_list = [ {"name": "Tom", "age": 10}, {"name": "Mark", "age": 5, "city": "San Francisco"}, {"name": "Pam", "age": 7, "birthday": today} ] def from_pydict(pylist, schema=None, columns=None, safe=True): arrow_columns = list() if schema: columns = schema.names if not columns: return for column in columns: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist], safe=safe)) arrow_table = pa.Table.from_arrays(arrow_columns, columns) if schema: arrow_table = arrow_table.cast(schema, safe=safe) return arrow_table test = from_pydict(test_list, columns=['name' , 'age', 'city', 'birthday', 'dummy']) test_schema = pa.schema([ pa.field('name', pa.string()), pa.field('age', pa.int16()), pa.field('city', pa.string()), pa.field('birthday', pa.timestamp('ms')) ]) test2 = from_pydict(test_list, schema=test_schema) {code} > [Python] New pyarrow.Table.from_pydict() function > - > > Key: ARROW-4032 > URL: https://issues.apache.org/jira/browse/ARROW-4032 > Project: Apache Arrow > Issue Type: Task > Components: Python >Reporter: David Lee >Priority: Minor > > Here's a proposal to create a pyarrow.Table.from_pydict() function. > Right now only pyarrow.Table.from_pandas() exist and there are inherit > problems using Pandas with NULL support for Int(s) and Boolean(s) > [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html] > {{NaN}}, Integer {{NA}} values and {{NA}} type promotions: > Sample python code on how this would work. > > {code:java} > import pyarrow as pa > from datetime import datetime > # convert microseconds to milliseconds. More support for MS in parquet. > today = datetime.now() > today = datetime(today.year, today.month, today.day, today.hour, > today.minute, today.second, today.microsecond - today.microsecond % 1000) > test_list = [ > {"name": "Tom", "age": 10}, > {"name": "Mark", "age": 5, "city": "San Francisco"}, > {"name": "Pam", "age": 7, "birthday": today} > ] > def from_pylist(pylist, schema=None, columns=None, safe=True): > arrow_columns = list() > if schema: > columns = schema.names > if not columns: > return > for column in columns: > arrow_columns.append(pa.array([v[column] if column in v else None for > v in pylist], safe=safe)) > arrow_table = pa.Table.from_arrays(arrow
[jira] [Updated] (ARROW-4032) [Python] New pyarrow.Table.from_pydict() function
[ https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Lee updated ARROW-4032: - Description: Here's a proposal to create a pyarrow.Table.from_pydict() function. Right now only pyarrow.Table.from_pandas() exist and there are inherit problems using Pandas with NULL support for Int(s) and Boolean(s) [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html] {{NaN}}, Integer {{NA}} values and {{NA}} type promotions: Sample python code on how this would work. {code:java} import pyarrow as pa from datetime import datetime # convert microseconds to milliseconds. More support for MS in parquet. today = datetime.now() today = datetime(today.year, today.month, today.day, today.hour, today.minute, today.second, today.microsecond - today.microsecond % 1000) test_list = [ {"name": "Tom", "age": 10}, {"name": "Mark", "age": 5, "city": "San Francisco"}, {"name": "Pam", "age": 7, "birthday": today} ] def from_pydict(pylist, schema=None, columns=None, safe=True): arrow_columns = list() if schema: columns = schema.names if not columns: return for column in columns: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist], safe=safe)) arrow_table = pa.Table.from_arrays(arrow_columns, columns) if schema: arrow_table = arrow_table.cast(schema, safe=safe) return arrow_table test = from_pydict(test_list, columns=['name' , 'age', 'city', 'birthday', 'dummy']) test_schema = pa.schema([ pa.field('name', pa.string()), pa.field('age', pa.int16()), pa.field('city', pa.string()), pa.field('birthday', pa.timestamp('ms')) ]) test2 = from_pydict(test_list, schema=test_schema) {code} was: Here's a proposal to create a pyarrow.Table.from_pydict() function. Right now only pyarrow.Table.from_pandas() exist and there are inherit problems using Pandas with NULL support for Int(s) and Boolean(s) [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html] {{NaN}}, Integer {{NA}} values and {{NA}} type promotions: Sample python code on how this would work. {code:java} import pyarrow as pa from datetime import datetime # convert microseconds to milliseconds. More support for MS in parquet. today = datetime.now() today = datetime(today.year, today.month, today.day, today.hour, today.minute, today.second, today.microsecond - today.microsecond % 1000) pylist = [ {"name": "Tom", "age": 10}, {"name": "Mark", "age": 5, "city": "San Francisco"}, {"name": "Pam", "age": 7, "birthday": today} ] def from_pydict(pylist, schema=None, columns=None, safe=True): arrow_columns = list() if schema: columns = schema.names if not columns: return for column in columns: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist], safe=safe)) arrow_table = pa.Table.from_arrays(arrow_columns, columns) if schema: arrow_table = arrow_table.cast(schema, safe=safe) return arrow_table test = from_pydict(pylist, columns=['name' , 'age', 'city', 'birthday', 'dummy']) test_schema = pa.schema([ pa.field('name', pa.string()), pa.field('age', pa.int16()), pa.field('city', pa.string()), pa.field('birthday', pa.timestamp('ms')) ]) test2 = from_pydict(pylist, schema=test_schema) {code} > [Python] New pyarrow.Table.from_pydict() function > - > > Key: ARROW-4032 > URL: https://issues.apache.org/jira/browse/ARROW-4032 > Project: Apache Arrow > Issue Type: Task > Components: Python >Reporter: David Lee >Priority: Minor > > Here's a proposal to create a pyarrow.Table.from_pydict() function. > Right now only pyarrow.Table.from_pandas() exist and there are inherit > problems using Pandas with NULL support for Int(s) and Boolean(s) > [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html] > {{NaN}}, Integer {{NA}} values and {{NA}} type promotions: > Sample python code on how this would work. > > {code:java} > import pyarrow as pa > from datetime import datetime > # convert microseconds to milliseconds. More support for MS in parquet. > today = datetime.now() > today = datetime(today.year, today.month, today.day, today.hour, > today.minute, today.second, today.microsecond - today.microsecond % 1000) > test_list = [ > {"name": "Tom", "age": 10}, > {"name": "Mark", "age": 5, "city": "San Francisco"}, > {"name": "Pam", "age": 7, "birthday": today} > ] > def from_pydict(pylist, schema=None, columns=None, safe=True): > arrow_columns = list() > if schema: > columns = schema.names > if not columns: > return > for column in columns: > arrow_columns.append(pa.array([v[column] if column in v else None for > v in pylist], safe=safe)) > arrow_table = pa.Table.from_arrays(arrow_columns,
[jira] [Updated] (ARROW-4032) [Python] New pyarrow.Table.from_pydict() function
[ https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Lee updated ARROW-4032: - Description: Here's a proposal to create a pyarrow.Table.from_pydict() function. Right now only pyarrow.Table.from_pandas() exist and there are inherit problems using Pandas with NULL support for Int(s) and Boolean(s) [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html] {{NaN}}, Integer {{NA}} values and {{NA}} type promotions: Sample python code on how this would work. {code:java} import pyarrow as pa from datetime import datetime # convert microseconds to milliseconds. More support for MS in parquet. today = datetime.now() today = datetime(today.year, today.month, today.day, today.hour, today.minute, today.second, today.microsecond - today.microsecond % 1000) pylist = [ {"name": "Tom", "age": 10}, {"name": "Mark", "age": 5, "city": "San Francisco"}, {"name": "Pam", "age": 7, "birthday": today} ] def from_pydict(pylist, schema=None, columns=None, safe=True): arrow_columns = list() if schema: columns = schema.names if not columns: return for column in columns: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist], safe=safe)) arrow_table = pa.Table.from_arrays(arrow_columns, columns) if schema: arrow_table = arrow_table.cast(schema, safe=safe) return arrow_table test = from_pydict(pylist, columns=['name' , 'age', 'city', 'birthday', 'dummy']) test_schema = pa.schema([ pa.field('name', pa.string()), pa.field('age', pa.int16()), pa.field('city', pa.string()), pa.field('birthday', pa.timestamp('ms')) ]) test2 = from_pydict(pylist, schema=test_schema) {code} was: Here's a proposal to create a pyarrow.Table.from_pydict() function. Right now only pyarrow.Table.from_pandas() exist and there are inherit problems using Pandas with NULL support for Int(s) and Boolean(s) [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html] {{NaN}}, Integer {{NA}} values and {{NA}} type promotions: Sample python code on how this would work. {code:java} import pyarrow as pa from datetime import datetime # convert microseconds to milliseconds. More support for MS in parquet. today = datetime.now() today = datetime(today.year, today.month, today.day, today.hour, today.minute, today.second, today.microsecond - today.microsecond % 1000) pylist = [ {"name": "Tom", "age": 10}, {"name": "Mark", "age": 5, "city": "San Francisco"}, {"name": "Pam", "age": 7, "birthday": today} ] def from_pydict(pylist, schema=None, columns=None, safe=True): arrow_columns = list() if schema: columns = schema.names if not columns: return for column in columns: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist])) arrow_table = pa.Table.from_arrays(arrow_columns, columns) if schema: arrow_table = arrow_table.cast(schema, safe=safe) return arrow_table test = from_pydict(pylist, columns=['name' , 'age', 'city', 'birthday', 'dummy']) test_schema = pa.schema([ pa.field('name', pa.string()), pa.field('age', pa.int16()), pa.field('city', pa.string()), pa.field('birthday', pa.timestamp('ms')) ]) test2 = from_pydict(pylist, schema=test_schema) {code} > [Python] New pyarrow.Table.from_pydict() function > - > > Key: ARROW-4032 > URL: https://issues.apache.org/jira/browse/ARROW-4032 > Project: Apache Arrow > Issue Type: Task > Components: Python >Reporter: David Lee >Priority: Minor > > Here's a proposal to create a pyarrow.Table.from_pydict() function. > Right now only pyarrow.Table.from_pandas() exist and there are inherit > problems using Pandas with NULL support for Int(s) and Boolean(s) > [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html] > {{NaN}}, Integer {{NA}} values and {{NA}} type promotions: > Sample python code on how this would work. > > {code:java} > import pyarrow as pa > from datetime import datetime > # convert microseconds to milliseconds. More support for MS in parquet. > today = datetime.now() > today = datetime(today.year, today.month, today.day, today.hour, > today.minute, today.second, today.microsecond - today.microsecond % 1000) > pylist = [ > {"name": "Tom", "age": 10}, > {"name": "Mark", "age": 5, "city": "San Francisco"}, > {"name": "Pam", "age": 7, "birthday": today} > ] > def from_pydict(pylist, schema=None, columns=None, safe=True): > arrow_columns = list() > if schema: > columns = schema.names > if not columns: > return > for column in columns: > arrow_columns.append(pa.array([v[column] if column in v else None for > v in pylist], safe=safe)) > arrow_table = pa.Table.from_arrays(arrow_columns, columns) > if sche
[jira] [Updated] (ARROW-4032) [Python] New pyarrow.Table.from_pydict() function
[ https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Lee updated ARROW-4032: - Description: Here's a proposal to create a pyarrow.Table.from_pydict() function. Right now only pyarrow.Table.from_pandas() exist and there are inherit problems using Pandas with NULL support for Int(s) and Boolean(s) [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html] {{NaN}}, Integer {{NA}} values and {{NA}} type promotions: Sample python code on how this would work. {code:java} import pyarrow as pa from datetime import datetime # convert microseconds to milliseconds. More support for MS in parquet. today = datetime.now() today = datetime(today.year, today.month, today.day, today.hour, today.minute, today.second, today.microsecond - today.microsecond % 1000) pylist = [ {"name": "Tom", "age": 10}, {"name": "Mark", "age": 5, "city": "San Francisco"}, {"name": "Pam", "age": 7, "birthday": today} ] def from_pydict(pylist, schema=None, columns=None, safe=True): arrow_columns = list() if schema: columns = schema.names if not columns: return for column in columns: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist])) arrow_table = pa.Table.from_arrays(arrow_columns, columns) if schema: arrow_table = arrow_table.cast(schema, safe=safe) return arrow_table test = from_pydict(pylist, columns=['name' , 'age', 'city', 'birthday', 'dummy']) test_schema = pa.schema([ pa.field('name', pa.string()), pa.field('age', pa.int16()), pa.field('city', pa.string()), pa.field('birthday', pa.timestamp('ms')) ]) test2 = from_pydict(pylist, schema=test_schema) {code} was: Here's a proposal to create a pyarrow.Table.from_pydict() function. Right now only pyarrow.Table.from_pandas() exist and there are inherit problems using Pandas with NULL support for Int(s) and Boolean(s) [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html] {{NaN}}, Integer {{NA}} values and {{NA}} type promotions: Sample python code on how this would work. {code:java} import pyarrow as pa from datetime import datetime # convert microseconds to milliseconds. More support for MS in parquet. today = datetime.now() today = datetime(today.year, today.month, today.day, today.hour, today.minute, today.second, today.microsecond - today.microsecond % 1000) pylist = [ {"name": "Tom", "age": 10}, {"name": "Mark", "age": 5, "city": "San Francisco"}, {"name": "Pam", "age": 7, "birthday": today} ] def from_pydict(pylist, schema=None, columns=None, safe=True): arrow_columns = list() if schema: columns = schema.names if not columns: return for column in columns: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist])) arrow_table = pa.Table.from_arrays(arrow_columns, columns) if schema: arrow_table = arrow_table.cast(schema, safe=safe) return arrow_table test = from_pydict(pylist, columns=['name' , 'age', 'city', 'birthday', 'dummy']) test_schema = pa.schema([ pa.field('name', pa.string()), pa.field('age', pa.int16()), pa.field('city', pa.string()), pa.field('birthday', pa.timestamp('ms')) ]) test2 = from_pydict(pylist, schema=test_schema) {code} > [Python] New pyarrow.Table.from_pydict() function > - > > Key: ARROW-4032 > URL: https://issues.apache.org/jira/browse/ARROW-4032 > Project: Apache Arrow > Issue Type: Task > Components: Python >Reporter: David Lee >Priority: Minor > > Here's a proposal to create a pyarrow.Table.from_pydict() function. > Right now only pyarrow.Table.from_pandas() exist and there are inherit > problems using Pandas with NULL support for Int(s) and Boolean(s) > [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html] > {{NaN}}, Integer {{NA}} values and {{NA}} type promotions: > Sample python code on how this would work. > > {code:java} > import pyarrow as pa > from datetime import datetime > # convert microseconds to milliseconds. More support for MS in parquet. > today = datetime.now() > today = datetime(today.year, today.month, today.day, today.hour, > today.minute, today.second, today.microsecond - today.microsecond % 1000) > pylist = [ > {"name": "Tom", "age": 10}, > {"name": "Mark", "age": 5, "city": "San Francisco"}, > {"name": "Pam", "age": 7, "birthday": today} > ] > def from_pydict(pylist, schema=None, columns=None, safe=True): > arrow_columns = list() > if schema: > columns = schema.names > if not columns: > return > for column in columns: > arrow_columns.append(pa.array([v[column] if column in v else None for > v in pylist])) > arrow_table = pa.Table.from_arrays(arrow_columns, columns) > if schema: > arrow_table = arrow_table.cast(schema, safe=safe) > return arrow
[jira] [Commented] (ARROW-4032) [Python] New pyarrow.Table.from_pydict() function
[ https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16721745#comment-16721745 ] David Lee commented on ARROW-4032: -- Updated the sample code to include Schema and Safe options.. Passing in a schema will allow conversions from microseconds to milliseconds. > [Python] New pyarrow.Table.from_pydict() function > - > > Key: ARROW-4032 > URL: https://issues.apache.org/jira/browse/ARROW-4032 > Project: Apache Arrow > Issue Type: Task > Components: Python >Reporter: David Lee >Priority: Minor > > Here's a proposal to create a pyarrow.Table.from_pydict() function. > Right now only pyarrow.Table.from_pandas() exist and there are inherit > problems using Pandas with NULL support for Int(s) and Boolean(s) > [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html] > {{NaN}}, Integer {{NA}} values and {{NA}} type promotions: > Sample python code on how this would work. > > {code:java} > import pyarrow as pa > from datetime import datetime > # convert microseconds to milliseconds. More support for MS in parquet. > today = datetime.now() > today = datetime(today.year, today.month, today.day, today.hour, > today.minute, today.second, today.microsecond - today.microsecond % 1000) > pylist = [ > {"name": "Tom", "age": 10}, > {"name": "Mark", "age": 5, "city": "San Francisco"}, > {"name": "Pam", "age": 7, "birthday": today} > ] > def from_pydict(pylist, schema=None, columns=None, safe=True): > arrow_columns = list() > if schema: > columns = schema.names > if not columns: > return > for column in columns: > arrow_columns.append(pa.array([v[column] if column in v else None for v in > pylist])) > arrow_table = pa.Table.from_arrays(arrow_columns, columns) > if schema: > arrow_table = arrow_table.cast(schema, safe=safe) > return arrow_table > test = from_pydict(pylist, columns=['name' , 'age', 'city', 'birthday', > 'dummy']) > test_schema = pa.schema([ > pa.field('name', pa.string()), > pa.field('age', pa.int16()), > pa.field('city', pa.string()), > pa.field('birthday', pa.timestamp('ms')) > ]) > test2 = from_pydict(pylist, schema=test_schema) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4032) [Python] New pyarrow.Table.from_pydict() function
[ https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Lee updated ARROW-4032: - Description: Here's a proposal to create a pyarrow.Table.from_pydict() function. Right now only pyarrow.Table.from_pandas() exist and there are inherit problems using Pandas with NULL support for Int(s) and Boolean(s) [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html] {{NaN}}, Integer {{NA}} values and {{NA}} type promotions: Sample python code on how this would work. {code:java} import pyarrow as pa from datetime import datetime # convert microseconds to milliseconds. More support for MS in parquet. today = datetime.now() today = datetime(today.year, today.month, today.day, today.hour, today.minute, today.second, today.microsecond - today.microsecond % 1000) pylist = [ {"name": "Tom", "age": 10}, {"name": "Mark", "age": 5, "city": "San Francisco"}, {"name": "Pam", "age": 7, "birthday": today} ] def from_pydict(pylist, schema=None, columns=None, safe=True): arrow_columns = list() if schema: columns = schema.names if not columns: return for column in columns: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist])) arrow_table = pa.Table.from_arrays(arrow_columns, columns) if schema: arrow_table = arrow_table.cast(schema, safe=safe) return arrow_table test = from_pydict(pylist, columns=['name' , 'age', 'city', 'birthday', 'dummy']) test_schema = pa.schema([ pa.field('name', pa.string()), pa.field('age', pa.int16()), pa.field('city', pa.string()), pa.field('birthday', pa.timestamp('ms')) ]) test2 = from_pydict(pylist, schema=test_schema) {code} was: Here's a proposal to create a pyarrow.Table.from_pydict() function. Right now only pyarrow.Table.from_pandas() exist and there are inherit problems using Pandas with NULL support for Int(s) and Boolean(s) [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html] {{NaN}}, Integer {{NA}} values and {{NA}} type promotions: Sample python code on how this would work. {code:java} import pyarrow as pa from datetime import datetime test_list = [ {"name": "Tom", "age": 10}, {"name": "Mark", "age": 5, "city": "San Francisco"}, {"name": "Pam", "age": 7, "birthday": datetime.now()} ] def from_pydict(pylist, columns): arrow_columns = list() for column in columns: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist])) arrow_table = pa.Table.from_arrays(arrow_columns, columns) return arrow_table test = from_pydict(test_list, ['name' , 'age', 'city', 'birthday', 'dummy']) {code} Additional work would be needed to pass in a schema object if you want to refine data types further. I think the existing code from from_pandas() to do that would work. > [Python] New pyarrow.Table.from_pydict() function > - > > Key: ARROW-4032 > URL: https://issues.apache.org/jira/browse/ARROW-4032 > Project: Apache Arrow > Issue Type: Task > Components: Python >Reporter: David Lee >Priority: Minor > > Here's a proposal to create a pyarrow.Table.from_pydict() function. > Right now only pyarrow.Table.from_pandas() exist and there are inherit > problems using Pandas with NULL support for Int(s) and Boolean(s) > [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html] > {{NaN}}, Integer {{NA}} values and {{NA}} type promotions: > Sample python code on how this would work. > > {code:java} > import pyarrow as pa > from datetime import datetime > # convert microseconds to milliseconds. More support for MS in parquet. > today = datetime.now() > today = datetime(today.year, today.month, today.day, today.hour, > today.minute, today.second, today.microsecond - today.microsecond % 1000) > pylist = [ > {"name": "Tom", "age": 10}, > {"name": "Mark", "age": 5, "city": "San Francisco"}, > {"name": "Pam", "age": 7, "birthday": today} > ] > def from_pydict(pylist, schema=None, columns=None, safe=True): > arrow_columns = list() > if schema: > columns = schema.names > if not columns: > return > for column in columns: > arrow_columns.append(pa.array([v[column] if column in v else None for v in > pylist])) > arrow_table = pa.Table.from_arrays(arrow_columns, columns) > if schema: > arrow_table = arrow_table.cast(schema, safe=safe) > return arrow_table > test = from_pydict(pylist, columns=['name' , 'age', 'city', 'birthday', > 'dummy']) > test_schema = pa.schema([ > pa.field('name', pa.string()), > pa.field('age', pa.int16()), > pa.field('city', pa.string()), > pa.field('birthday', pa.timestamp('ms')) > ]) > test2 = from_pydict(pylist, schema=test_schema) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4032) [Python] New pyarrow.Table.from_pydict() function
[ https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Lee updated ARROW-4032: - Description: Here's a proposal to create a pyarrow.Table.from_pydict() function. Right now only pyarrow.Table.from_pandas() exist and there are inherit problems using Pandas with NULL support for Int(s) and Boolean(s) [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html] {{NaN}}, Integer {{NA}} values and {{NA}} type promotions: Sample python code on how this would work. {code:java} import pyarrow as pa from datetime import datetime test_list = [ {"name": "Tom", "age": 10}, {"name": "Mark", "age": 5, "city": "San Francisco"}, {"name": "Pam", "age": 7, "birthday": datetime.now()} ] def from_pydict(pylist, columns): arrow_columns = list() for column in columns: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist])) arrow_table = pa.Table.from_arrays(arrow_columns, columns) return arrow_table test = from_pydict(test_list, ['name' , 'age', 'city', 'birthday', 'dummy']) {code} was: Here's a proposal to create a pyarrow.Table.from_pydict() function. Right now only pyarrow.Table.from_pandas() exist and there are inherit problems using Pandas with NULL support for Int(s) and Boolean(s) [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html] {{NaN}}, Integer {{NA}} values and {{NA}} type promotions: Sample python code on how this would work. {code:java} import pyarrow as pa from datetime import datetime pylist = [ {"name": "Tom", "age": 10}, {"name": "Mark", "age": 5, "city": "San Francisco"}, {"name": "Pam", "age": 7, "birthday": datetime.now()} ] def from_pydict(pylist, columns): arrow_columns = list() for column in columns: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist])) arrow_table = pa.Table.from_arrays(arrow_columns, columns) return arrow_table test = from_pydict(pylist, ['name' , 'age', 'city', 'birthday', 'dummy']) {code} > [Python] New pyarrow.Table.from_pydict() function > - > > Key: ARROW-4032 > URL: https://issues.apache.org/jira/browse/ARROW-4032 > Project: Apache Arrow > Issue Type: Task > Components: Python >Reporter: David Lee >Priority: Minor > > Here's a proposal to create a pyarrow.Table.from_pydict() function. > Right now only pyarrow.Table.from_pandas() exist and there are inherit > problems using Pandas with NULL support for Int(s) and Boolean(s) > [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html] > {{NaN}}, Integer {{NA}} values and {{NA}} type promotions: > Sample python code on how this would work. > > {code:java} > import pyarrow as pa > from datetime import datetime > test_list = [ > {"name": "Tom", "age": 10}, > {"name": "Mark", "age": 5, "city": "San Francisco"}, > {"name": "Pam", "age": 7, "birthday": datetime.now()} > ] > def from_pydict(pylist, columns): > arrow_columns = list() > for column in columns: > arrow_columns.append(pa.array([v[column] if column in v else None for > v in pylist])) > arrow_table = pa.Table.from_arrays(arrow_columns, columns) > return arrow_table > test = from_pydict(test_list, ['name' , 'age', 'city', 'birthday', 'dummy']) > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4032) [Python] New pyarrow.Table.from_pydict() function
[ https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Lee updated ARROW-4032: - Description: Here's a proposal to create a pyarrow.Table.from_pydict() function. Right now only pyarrow.Table.from_pandas() exist and there are inherit problems using Pandas with NULL support for Int(s) and Boolean(s) [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html] {{NaN}}, Integer {{NA}} values and {{NA}} type promotions: Sample python code on how this would work. {code:java} import pyarrow as pa from datetime import datetime test_list = [ {"name": "Tom", "age": 10}, {"name": "Mark", "age": 5, "city": "San Francisco"}, {"name": "Pam", "age": 7, "birthday": datetime.now()} ] def from_pydict(pylist, columns): arrow_columns = list() for column in columns: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist])) arrow_table = pa.Table.from_arrays(arrow_columns, columns) return arrow_table test = from_pydict(test_list, ['name' , 'age', 'city', 'birthday', 'dummy']) {code} Additional work would be needed to pass in a schema object if you want to refine data types further. I think the existing code from from_pandas() to do that would work. was: Here's a proposal to create a pyarrow.Table.from_pydict() function. Right now only pyarrow.Table.from_pandas() exist and there are inherit problems using Pandas with NULL support for Int(s) and Boolean(s) [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html] {{NaN}}, Integer {{NA}} values and {{NA}} type promotions: Sample python code on how this would work. {code:java} import pyarrow as pa from datetime import datetime test_list = [ {"name": "Tom", "age": 10}, {"name": "Mark", "age": 5, "city": "San Francisco"}, {"name": "Pam", "age": 7, "birthday": datetime.now()} ] def from_pydict(pylist, columns): arrow_columns = list() for column in columns: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist])) arrow_table = pa.Table.from_arrays(arrow_columns, columns) return arrow_table test = from_pydict(test_list, ['name' , 'age', 'city', 'birthday', 'dummy']) {code} > [Python] New pyarrow.Table.from_pydict() function > - > > Key: ARROW-4032 > URL: https://issues.apache.org/jira/browse/ARROW-4032 > Project: Apache Arrow > Issue Type: Task > Components: Python >Reporter: David Lee >Priority: Minor > > Here's a proposal to create a pyarrow.Table.from_pydict() function. > Right now only pyarrow.Table.from_pandas() exist and there are inherit > problems using Pandas with NULL support for Int(s) and Boolean(s) > [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html] > {{NaN}}, Integer {{NA}} values and {{NA}} type promotions: > Sample python code on how this would work. > > {code:java} > import pyarrow as pa > from datetime import datetime > test_list = [ > {"name": "Tom", "age": 10}, > {"name": "Mark", "age": 5, "city": "San Francisco"}, > {"name": "Pam", "age": 7, "birthday": datetime.now()} > ] > def from_pydict(pylist, columns): > arrow_columns = list() > for column in columns: > arrow_columns.append(pa.array([v[column] if column in v else None for > v in pylist])) > arrow_table = pa.Table.from_arrays(arrow_columns, columns) > return arrow_table > test = from_pydict(test_list, ['name' , 'age', 'city', 'birthday', 'dummy']) > {code} > Additional work would be needed to pass in a schema object if you want to > refine data types further. I think the existing code from from_pandas() to do > that would work. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4032) [Python] New pyarrow.Table.from_pydict() function
David Lee created ARROW-4032: Summary: [Python] New pyarrow.Table.from_pydict() function Key: ARROW-4032 URL: https://issues.apache.org/jira/browse/ARROW-4032 Project: Apache Arrow Issue Type: Task Components: Python Reporter: David Lee Here's a proposal to create a pyarrow.Table.from_pydict() function. Right now only pyarrow.Table.from_pandas() exist and there are inherit problems using Pandas with NULL support for Int(s) and Boolean(s) [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html] {{NaN}}, Integer {{NA}} values and {{NA}} type promotions: Sample python code on how this would work. {code:java} import pyarrow as pa from datetime import datetime pylist = [ {"name": "Tom", "age": 10}, {"name": "Mark", "age": 5, "city": "San Francisco"}, {"name": "Pam", "age": 7, "birthday": datetime.now()} ] def from_pydict(pylist, columns): arrow_columns = list() for column in columns: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist])) arrow_table = pa.Table.from_arrays(arrow_columns, columns) return arrow_table test = from_pydict(pylist, ['name' , 'age', 'city', 'birthday', 'dummy']) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3907) [Python] from_pandas errors when schemas are used with lower resolution timestamps
[ https://issues.apache.org/jira/browse/ARROW-3907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16720638#comment-16720638 ] David Lee commented on ARROW-3907: -- Yeah i'm trying to figure out what the best way to preserve INTs when converting json to parquet.. The problem is more or less summarized here. [https://pandas.pydata.org/pandas-docs/stable/gotchas.html] There are a lot of gotchas with each step. json.loads() works fine. pandas.DataFrame() is a problem if every record doesn't contain the same columns. Using pandas.DataFrame.reindex() to add missing columns adds a bunch of NaN values. Adding NaN values will force change a column's dtype from INT64 to FLOAT64. NaNs are a problem to begin with because if you convert it to Parquet you end up with Zeros instead of Nulls. Running pandas.DataFrame.reindex(fill_value=None) doesn't work because passing in None is equal to pandas.DataFrame.reindex() without any params. Only way to replace NaNs with None is with pandas.DataFrame.where(). After replacing NaNs you can then change the dtype of the column from FLOAT64 back to INT64 It's basically a lot of hoops to go through to preserve your original JSON INT as a Parquet INT. Maybe the best solution is to create a pyarrow.Table.from_pydict() function to create a arrow table from a python dictionary. We have this gap with pyarrow.Table.to_pydict(), pyarrow.Table.to_pandas() and pyarrow.Table.from_pandas(). > [Python] from_pandas errors when schemas are used with lower resolution > timestamps > -- > > Key: ARROW-3907 > URL: https://issues.apache.org/jira/browse/ARROW-3907 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.11.1 >Reporter: David Lee >Priority: Major > Fix For: 0.11.1 > > > When passing in a schema object to from_pandas a resolution error occurs if > the schema uses a lower resolution timestamp. Do we need to also add > "coerce_timestamps" and "allow_truncated_timestamps" parameters found in > write_table() to from_pandas()? > Error: > pyarrow.lib.ArrowInvalid: ('Casting from timestamp[ns] to timestamp[ms] would > lose data: 1532015191753713000', 'Conversion failed for column modified with > type datetime64[ns]') > Code: > > {code:java} > processed_schema = pa.schema([ > pa.field('Id', pa.string()), > pa.field('modified', pa.timestamp('ms')), > pa.field('records', pa.int32()) > ]) > pa.Table.from_pandas(df, schema=processed_schema, preserve_index=False) > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3992) pyarrow compile from source issues on RedHat 7.4
[ https://issues.apache.org/jira/browse/ARROW-3992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16715888#comment-16715888 ] David Lee commented on ARROW-3992: -- Ok this worked. The instructions are missing one line after conda create: conda activate pyarrow-dev > pyarrow compile from source issues on RedHat 7.4 > > > Key: ARROW-3992 > URL: https://issues.apache.org/jira/browse/ARROW-3992 > Project: Apache Arrow > Issue Type: Bug >Reporter: David Lee >Priority: Minor > > Opening a ticket for: [https://github.com/apache/arrow/issues/2281] after > running into the same problems with RedHat 7.4. > [https://arrow.apache.org/docs/python/development.html#development] > Additional steps taken: > Added double-conversion, glog and hypothesis: > {code:java} > conda create -y -q -n pyarrow-dev \ > python=3.6 numpy six setuptools cython pandas pytest double-conversion \ > cmake flatbuffers rapidjson boost-cpp thrift-cpp snappy zlib glog hypothesis\ > gflags brotli jemalloc lz4-c zstd -c conda-forge > {code} > > Added export LD_LIBRARY_PATH to conda lib64 before running py.test pyarrow: > {code:java} > export LD_LIBRARY_PATH=/home/my_login/anaconda3/envs/pyarrow-dev/lib64 > py.test pyarrow > {code} > > Added extra symlinks with a period at the end to fix string concatenation > issues. Running setup.py for the first time didn't need this, but running > setup.py a second time would error out with: > {code:java} > CMake Error: File > /home/my_login/anaconda3/envs/pyarrow-dev/lib64/libarrow.so. does not exist. > {code} > > There is an extra period at the end of the *.so files so I had to make > symlinks with extra periods. > {code:java} > ln -s libparquet.so.12.0.0 libparquet.so. > ln -s libplasma.so.12.0.0 libplasma.so. > ln -s libarrow.so.12.0.0 libarrow.so. > ln -s libarrow_python.so.12.0.0 libarrow_python.so. > {code} > > Creating a wheel file using --with-plasma gives the following error: > {code:java} > error: [Errno 2] No such file or directory: 'release/plasma_store_server' > {code} > Had to create the wheel file without plasma, but it isn't packaged correctly. > The hacked symlinked shared libs are included instead of libarrow.so.12 > {code:java} > copying build/lib.linux-x86_64-3.6/pyarrow/libarrow.so. -> > build/bdist.linux-x86_64/wheel/pyarrow > copying build/lib.linux-x86_64-3.6/pyarrow/libarrow.so -> > build/bdist.linux-x86_64/wheel/pyarrow > copying build/lib.linux-x86_64-3.6/pyarrow/libarrow_python.so. -> > build/bdist.linux-x86_64/wheel/pyarrow > copying build/lib.linux-x86_64-3.6/pyarrow/libarrow_python.so -> > build/bdist.linux-x86_64/wheel/pyarrow > copying build/lib.linux-x86_64-3.6/pyarrow/libplasma.so. -> > build/bdist.linux-x86_64/wheel/pyarrow > copying build/lib.linux-x86_64-3.6/pyarrow/libplasma.so -> > build/bdist.linux-x86_64/wheel/pyarrow > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (ARROW-3992) pyarrow compile from source issues on RedHat 7.4
[ https://issues.apache.org/jira/browse/ARROW-3992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Lee closed ARROW-3992. Updated instructions work > pyarrow compile from source issues on RedHat 7.4 > > > Key: ARROW-3992 > URL: https://issues.apache.org/jira/browse/ARROW-3992 > Project: Apache Arrow > Issue Type: Bug >Reporter: David Lee >Priority: Minor > > Opening a ticket for: [https://github.com/apache/arrow/issues/2281] after > running into the same problems with RedHat 7.4. > [https://arrow.apache.org/docs/python/development.html#development] > Additional steps taken: > Added double-conversion, glog and hypothesis: > {code:java} > conda create -y -q -n pyarrow-dev \ > python=3.6 numpy six setuptools cython pandas pytest double-conversion \ > cmake flatbuffers rapidjson boost-cpp thrift-cpp snappy zlib glog hypothesis\ > gflags brotli jemalloc lz4-c zstd -c conda-forge > {code} > > Added export LD_LIBRARY_PATH to conda lib64 before running py.test pyarrow: > {code:java} > export LD_LIBRARY_PATH=/home/my_login/anaconda3/envs/pyarrow-dev/lib64 > py.test pyarrow > {code} > > Added extra symlinks with a period at the end to fix string concatenation > issues. Running setup.py for the first time didn't need this, but running > setup.py a second time would error out with: > {code:java} > CMake Error: File > /home/my_login/anaconda3/envs/pyarrow-dev/lib64/libarrow.so. does not exist. > {code} > > There is an extra period at the end of the *.so files so I had to make > symlinks with extra periods. > {code:java} > ln -s libparquet.so.12.0.0 libparquet.so. > ln -s libplasma.so.12.0.0 libplasma.so. > ln -s libarrow.so.12.0.0 libarrow.so. > ln -s libarrow_python.so.12.0.0 libarrow_python.so. > {code} > > Creating a wheel file using --with-plasma gives the following error: > {code:java} > error: [Errno 2] No such file or directory: 'release/plasma_store_server' > {code} > Had to create the wheel file without plasma, but it isn't packaged correctly. > The hacked symlinked shared libs are included instead of libarrow.so.12 > {code:java} > copying build/lib.linux-x86_64-3.6/pyarrow/libarrow.so. -> > build/bdist.linux-x86_64/wheel/pyarrow > copying build/lib.linux-x86_64-3.6/pyarrow/libarrow.so -> > build/bdist.linux-x86_64/wheel/pyarrow > copying build/lib.linux-x86_64-3.6/pyarrow/libarrow_python.so. -> > build/bdist.linux-x86_64/wheel/pyarrow > copying build/lib.linux-x86_64-3.6/pyarrow/libarrow_python.so -> > build/bdist.linux-x86_64/wheel/pyarrow > copying build/lib.linux-x86_64-3.6/pyarrow/libplasma.so. -> > build/bdist.linux-x86_64/wheel/pyarrow > copying build/lib.linux-x86_64-3.6/pyarrow/libplasma.so -> > build/bdist.linux-x86_64/wheel/pyarrow > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-3992) pyarrow compile from source issues on RedHat 7.4
[ https://issues.apache.org/jira/browse/ARROW-3992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Lee resolved ARROW-3992. -- Resolution: Not A Problem Updated install instructions work > pyarrow compile from source issues on RedHat 7.4 > > > Key: ARROW-3992 > URL: https://issues.apache.org/jira/browse/ARROW-3992 > Project: Apache Arrow > Issue Type: Bug >Reporter: David Lee >Priority: Minor > > Opening a ticket for: [https://github.com/apache/arrow/issues/2281] after > running into the same problems with RedHat 7.4. > [https://arrow.apache.org/docs/python/development.html#development] > Additional steps taken: > Added double-conversion, glog and hypothesis: > {code:java} > conda create -y -q -n pyarrow-dev \ > python=3.6 numpy six setuptools cython pandas pytest double-conversion \ > cmake flatbuffers rapidjson boost-cpp thrift-cpp snappy zlib glog hypothesis\ > gflags brotli jemalloc lz4-c zstd -c conda-forge > {code} > > Added export LD_LIBRARY_PATH to conda lib64 before running py.test pyarrow: > {code:java} > export LD_LIBRARY_PATH=/home/my_login/anaconda3/envs/pyarrow-dev/lib64 > py.test pyarrow > {code} > > Added extra symlinks with a period at the end to fix string concatenation > issues. Running setup.py for the first time didn't need this, but running > setup.py a second time would error out with: > {code:java} > CMake Error: File > /home/my_login/anaconda3/envs/pyarrow-dev/lib64/libarrow.so. does not exist. > {code} > > There is an extra period at the end of the *.so files so I had to make > symlinks with extra periods. > {code:java} > ln -s libparquet.so.12.0.0 libparquet.so. > ln -s libplasma.so.12.0.0 libplasma.so. > ln -s libarrow.so.12.0.0 libarrow.so. > ln -s libarrow_python.so.12.0.0 libarrow_python.so. > {code} > > Creating a wheel file using --with-plasma gives the following error: > {code:java} > error: [Errno 2] No such file or directory: 'release/plasma_store_server' > {code} > Had to create the wheel file without plasma, but it isn't packaged correctly. > The hacked symlinked shared libs are included instead of libarrow.so.12 > {code:java} > copying build/lib.linux-x86_64-3.6/pyarrow/libarrow.so. -> > build/bdist.linux-x86_64/wheel/pyarrow > copying build/lib.linux-x86_64-3.6/pyarrow/libarrow.so -> > build/bdist.linux-x86_64/wheel/pyarrow > copying build/lib.linux-x86_64-3.6/pyarrow/libarrow_python.so. -> > build/bdist.linux-x86_64/wheel/pyarrow > copying build/lib.linux-x86_64-3.6/pyarrow/libarrow_python.so -> > build/bdist.linux-x86_64/wheel/pyarrow > copying build/lib.linux-x86_64-3.6/pyarrow/libplasma.so. -> > build/bdist.linux-x86_64/wheel/pyarrow > copying build/lib.linux-x86_64-3.6/pyarrow/libplasma.so -> > build/bdist.linux-x86_64/wheel/pyarrow > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3992) pyarrow compile from source issues on RedHat 7.4
[ https://issues.apache.org/jira/browse/ARROW-3992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Lee updated ARROW-3992: - Description: Opening a ticket for: [https://github.com/apache/arrow/issues/2281] after running into the same problems with RedHat 7.4. [https://arrow.apache.org/docs/python/development.html#development] Additional steps taken: Added double-conversion, glog and hypothesis: {code:java} conda create -y -q -n pyarrow-dev \ python=3.6 numpy six setuptools cython pandas pytest double-conversion \ cmake flatbuffers rapidjson boost-cpp thrift-cpp snappy zlib glog hypothesis\ gflags brotli jemalloc lz4-c zstd -c conda-forge {code} Added export LD_LIBRARY_PATH to conda lib64 before running py.test pyarrow: {code:java} export LD_LIBRARY_PATH=/home/my_login/anaconda3/envs/pyarrow-dev/lib64 py.test pyarrow {code} Added extra symlinks with a period at the end to fix string concatenation issues. Running setup.py for the first time didn't need this, but running setup.py a second time would error out with: {code:java} CMake Error: File /home/my_login/anaconda3/envs/pyarrow-dev/lib64/libarrow.so. does not exist. {code} There is an extra period at the end of the *.so files so I had to make symlinks with extra periods. {code:java} ln -s libparquet.so.12.0.0 libparquet.so. ln -s libplasma.so.12.0.0 libplasma.so. ln -s libarrow.so.12.0.0 libarrow.so. ln -s libarrow_python.so.12.0.0 libarrow_python.so. {code} Creating a wheel file using --with-plasma gives the following error: {code:java} error: [Errno 2] No such file or directory: 'release/plasma_store_server' {code} Had to create the wheel file without plasma, but it isn't packaged correctly. The hacked symlinked shared libs are included instead of libarrow.so.12 {code:java} copying build/lib.linux-x86_64-3.6/pyarrow/libarrow.so. -> build/bdist.linux-x86_64/wheel/pyarrow copying build/lib.linux-x86_64-3.6/pyarrow/libarrow.so -> build/bdist.linux-x86_64/wheel/pyarrow copying build/lib.linux-x86_64-3.6/pyarrow/libarrow_python.so. -> build/bdist.linux-x86_64/wheel/pyarrow copying build/lib.linux-x86_64-3.6/pyarrow/libarrow_python.so -> build/bdist.linux-x86_64/wheel/pyarrow copying build/lib.linux-x86_64-3.6/pyarrow/libplasma.so. -> build/bdist.linux-x86_64/wheel/pyarrow copying build/lib.linux-x86_64-3.6/pyarrow/libplasma.so -> build/bdist.linux-x86_64/wheel/pyarrow {code} was: Opening a ticket for: [https://github.com/apache/arrow/issues/2281] after running into the same problems with RedHat 7.4. [https://arrow.apache.org/docs/python/development.html#development] Additional steps taken: Added double-conversion, glog and hypothesis: {code:java} conda create -y -q -n pyarrow-dev \ python=3.6 numpy six setuptools cython pandas pytest double-conversion \ cmake flatbuffers rapidjson boost-cpp thrift-cpp snappy zlib glog hypothesis\ gflags brotli jemalloc lz4-c zstd -c conda-forge {code} Added export LD_LIBRARY_PATH to conda lib64 before running py.test pyarrow: {code:java} export LD_LIBRARY_PATH=/home/my_login/anaconda3/envs/pyarrow-dev/lib64 py.test pyarrow {code} Added extra symlinks with a period at the end to fix string concatenation issues. Running setup.py for the first time didn't need this, but running setup.py a second time would error out with: {code:java} CMake Error: File /home/my_login/anaconda3/envs/pyarrow-dev/lib64/libarrow.so. does not exist. {code} There is an extra period at the end of the *.so files so I had to make symlinks with extra periods. {code:java} ln -s libparquet.so.12.0.0 libparquet.so. ln -s libplasma.so.12.0.0 libplasma.so. ln -s libarrow.so.12.0.0 libarrow.so. ln -s libarrow_python.so.12.0.0 libarrow_python.so. {code} Creating a wheel file using --with-plasma gives the following error: {code:java} error: [Errno 2] No such file or directory: 'release/plasma_store_server' {code} Had to create the wheel file without plasma.. > pyarrow compile from source issues on RedHat 7.4 > > > Key: ARROW-3992 > URL: https://issues.apache.org/jira/browse/ARROW-3992 > Project: Apache Arrow > Issue Type: Bug >Reporter: David Lee >Priority: Minor > > Opening a ticket for: [https://github.com/apache/arrow/issues/2281] after > running into the same problems with RedHat 7.4. > [https://arrow.apache.org/docs/python/development.html#development] > Additional steps taken: > Added double-conversion, glog and hypothesis: > {code:java} > conda create -y -q -n pyarrow-dev \ > python=3.6 numpy six setuptools cython pandas pytest double-conversion \ > cmake flatbuffers rapidjson boost-cpp thrift-cpp snappy zlib glog hypothesis\ > gflags brotli jemalloc lz4-c zstd -c conda-forge > {code} > > Added export LD_LIBRARY_PATH to conda lib64 before running py.test pyarrow: > {code:java} > export LD_LIBRARY_P
[jira] [Updated] (ARROW-3992) pyarrow compile from source issues on RedHat 7.4
[ https://issues.apache.org/jira/browse/ARROW-3992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Lee updated ARROW-3992: - Description: Opening a ticket for: [https://github.com/apache/arrow/issues/2281] after running into the same problems with RedHat 7.4. [https://arrow.apache.org/docs/python/development.html#development] Additional steps taken: Added double-conversion, glog and hypothesis: {code:java} conda create -y -q -n pyarrow-dev \ python=3.6 numpy six setuptools cython pandas pytest double-conversion \ cmake flatbuffers rapidjson boost-cpp thrift-cpp snappy zlib glog hypothesis\ gflags brotli jemalloc lz4-c zstd -c conda-forge {code} Added export LD_LIBRARY_PATH to conda lib64 before running py.test pyarrow: {code:java} export LD_LIBRARY_PATH=/home/my_login/anaconda3/envs/pyarrow-dev/lib64 py.test pyarrow {code} Added extra symlinks with a period at the end to fix string concatenation issues. Running setup.py for the first time didn't need this, but running setup.py a second time would error out with: {code:java} CMake Error: File /home/my_login/anaconda3/envs/pyarrow-dev/lib64/libarrow.so. does not exist. {code} There is an extra period at the end of the *.so files so I had to make symlinks with extra periods. {code:java} ln -s libparquet.so.12.0.0 libparquet.so. ln -s libplasma.so.12.0.0 libplasma.so. ln -s libarrow.so.12.0.0 libarrow.so. ln -s libarrow_python.so.12.0.0 libarrow_python.so. {code} Creating a wheel file using --with-plasma gives the following error: {code:java} error: [Errno 2] No such file or directory: 'release/plasma_store_server' {code} Had to create the wheel file without plasma.. was: Opening a ticket for: [https://github.com/apache/arrow/issues/2281] after running into the same problems with RedHat 7.4. [https://arrow.apache.org/docs/python/development.html#development] Additional steps taken: Added double-conversion, glog and hypothesis: {code:java} conda create -y -q -n pyarrow-dev \ python=3.6 numpy six setuptools cython pandas pytest double-conversion \ cmake flatbuffers rapidjson boost-cpp thrift-cpp snappy zlib glog hypothesis\ gflags brotli jemalloc lz4-c zstd -c conda-forge {code} Added export LD_LIBRARY_PATH to conda lib64 before running py.test pyarrow: {code:java} export LD_LIBRARY_PATH=/home/my_login/anaconda3/envs/pyarrow-dev/lib64 py.test pyarrow {code} Added extra symlinks with a period at the end to fix string concatenation issues. Running setup.py for the first time didn't need this, but running setup.py a second time would error out with: {code:java} CMake Error: File /home/my_login/anaconda3/envs/pyarrow-dev/lib64/libarrow.so. does not exist. {code} There is an extra period at the end of the *.so files so I had to make symlinks with extra periods. {code:java} ln -s libparquet.so.12.0.0 libparquet.so. ln -s libplasma.so.12.0.0 libplasma.so. ln -s libarrow.so.12.0.0 libarrow.so. ln -s libarrow_python.so.12.0.0 libarrow_python.so. {code} Creating a wheel file using --with-plasma gives the following error: {code:java} error: [Errno 2] No such file or directory: 'release/plasma_store_server' {code} Had to create the wheel file without plasma.. > pyarrow compile from source issues on RedHat 7.4 > > > Key: ARROW-3992 > URL: https://issues.apache.org/jira/browse/ARROW-3992 > Project: Apache Arrow > Issue Type: Bug >Reporter: David Lee >Priority: Minor > > Opening a ticket for: [https://github.com/apache/arrow/issues/2281] after > running into the same problems with RedHat 7.4. > [https://arrow.apache.org/docs/python/development.html#development] > Additional steps taken: > Added double-conversion, glog and hypothesis: > {code:java} > conda create -y -q -n pyarrow-dev \ > python=3.6 numpy six setuptools cython pandas pytest double-conversion \ > cmake flatbuffers rapidjson boost-cpp thrift-cpp snappy zlib glog hypothesis\ > gflags brotli jemalloc lz4-c zstd -c conda-forge > {code} > > Added export LD_LIBRARY_PATH to conda lib64 before running py.test pyarrow: > {code:java} > export LD_LIBRARY_PATH=/home/my_login/anaconda3/envs/pyarrow-dev/lib64 > py.test pyarrow > {code} > > Added extra symlinks with a period at the end to fix string concatenation > issues. Running setup.py for the first time didn't need this, but running > setup.py a second time would error out with: > {code:java} > CMake Error: File > /home/my_login/anaconda3/envs/pyarrow-dev/lib64/libarrow.so. does not exist. > {code} > > There is an extra period at the end of the *.so files so I had to make > symlinks with extra periods. > {code:java} > ln -s libparquet.so.12.0.0 libparquet.so. > ln -s libplasma.so.12.0.0 libplasma.so. > ln -s libarrow.so.12.0.0 libarrow.so. > ln -s libarrow_python.so.12.0.0 libarrow_python.so. > {code} > > Creati
[jira] [Created] (ARROW-3992) pyarrow compile from source issues on RedHat 7.4
David Lee created ARROW-3992: Summary: pyarrow compile from source issues on RedHat 7.4 Key: ARROW-3992 URL: https://issues.apache.org/jira/browse/ARROW-3992 Project: Apache Arrow Issue Type: Bug Reporter: David Lee Opening a ticket for: [https://github.com/apache/arrow/issues/2281] after running into the same problems with RedHat 7.4. [https://arrow.apache.org/docs/python/development.html#development] Additional steps taken: Added double-conversion, glog and hypothesis: {code:java} conda create -y -q -n pyarrow-dev \ python=3.6 numpy six setuptools cython pandas pytest double-conversion \ cmake flatbuffers rapidjson boost-cpp thrift-cpp snappy zlib glog hypothesis\ gflags brotli jemalloc lz4-c zstd -c conda-forge {code} Added export LD_LIBRARY_PATH to conda lib64 before running py.test pyarrow: {code:java} export LD_LIBRARY_PATH=/home/my_login/anaconda3/envs/pyarrow-dev/lib64 py.test pyarrow {code} Added extra symlinks with a period at the end to fix string concatenation issues. Running setup.py for the first time didn't need this, but running setup.py a second time would error out with: {code:java} CMake Error: File /home/my_login/anaconda3/envs/pyarrow-dev/lib64/libarrow.so. does not exist. {code} There is an extra period at the end of the *.so files so I had to make symlinks with extra periods. {code:java} ln -s libparquet.so.12.0.0 libparquet.so. ln -s libplasma.so.12.0.0 libplasma.so. ln -s libarrow.so.12.0.0 libarrow.so. ln -s libarrow_python.so.12.0.0 libarrow_python.so. {code} Creating a wheel file using --with-plasma gives the following error: {code:java} error: [Errno 2] No such file or directory: 'release/plasma_store_server' {code} Had to create the wheel file without plasma.. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3918) [Python] ParquetWriter.write_table doesn't support coerce_timestamps or allow_truncated_timestamps
[ https://issues.apache.org/jira/browse/ARROW-3918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16713374#comment-16713374 ] David Lee commented on ARROW-3918: -- Fixed in Master.. https://github.com/apache/arrow/commit/10b204ec2532d8e30be157bcfd3af53d41f42ffb > [Python] ParquetWriter.write_table doesn't support coerce_timestamps or > allow_truncated_timestamps > -- > > Key: ARROW-3918 > URL: https://issues.apache.org/jira/browse/ARROW-3918 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.11.1 >Reporter: David Lee >Priority: Major > > Error: Table Schema does not match schema used to create file. > The 0.11.1 release added these parameters to pyarrow.parquet.write_table(), > but they are missing from pyarrow.parquet.ParquetWriter.write_table().. I'm > seeing mismatches between the table schema and the file schema, but they are > identical in the error message with modified: timestamp[ms] column types in > both schemas. The only thing which looks odd is the Pandas metadata that has > a modified column with a panda datatype of datetime and a numpy datatype of > datetime64[ns] > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3956) [Python] ParquetWriter.write_table isn't working
David Lee created ARROW-3956: Summary: [Python] ParquetWriter.write_table isn't working Key: ARROW-3956 URL: https://issues.apache.org/jira/browse/ARROW-3956 Project: Apache Arrow Issue Type: Bug Affects Versions: 0.11.1 Reporter: David Lee ParquetWriter.write_table is erroring out on table schema doesn't match file schema, but it does match. Error: {code:java} >>> writer.write_table(arrow_table) Traceback (most recent call last): File "", line 1, in File "../lib/python3.6/site-packages/pyarrow/parquet.py", line 374, in write_table raise ValueError(msg) ValueError: Table schema does not match schema used to create file: table: col1: int64 col2: int64 metadata {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [{"name":' b' "col1", "field_name": "col1", "pandas_type": "int64", "numpy_ty' b'pe": "int64", "metadata": null}, {"name": "col2", "field_name": ' b'"col2", "pandas_type": "int64", "numpy_type": "int64", "metadata' b'": null}], "pandas_version": "0.23.4"}'} vs. file: col1: int64 col2: int64 {code} Test Script: {code:java} import pyarrow as pa import pyarrow.parquet as pq import pandas as pd d = {'col1': [1, 2], 'col2': [3, 4]} df = pd.DataFrame(data=d) arrow_table = pa.Table.from_pandas(df, preserve_index=False) arrow_table pq.write_table(arrow_table, "test.parquet") test_schema = pa.schema([ pa.field('col1', pa.int64()), pa.field('col2', pa.int64()) ]) writer = pq.ParquetWriter("test2.parquet", use_dictionary=True, schema = test_schema, compression='snappy') writer.write_table(arrow_table) writer.close() {code} write_table() works, but ParquetWriter.write_table does not.. I think something is wrong with the schema object. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (ARROW-3918) [Python] ParquetWriter.write_table doesn't support coerce_timestamps or allow_truncated_timestamps
[ https://issues.apache.org/jira/browse/ARROW-3918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Lee closed ARROW-3918. Resolution: Unresolved Closing and re-opening a new ticket. Looks like write_table is broken. > [Python] ParquetWriter.write_table doesn't support coerce_timestamps or > allow_truncated_timestamps > -- > > Key: ARROW-3918 > URL: https://issues.apache.org/jira/browse/ARROW-3918 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.11.1 >Reporter: David Lee >Priority: Major > > Error: Table Schema does not match schema used to create file. > The 0.11.1 release added these parameters to pyarrow.parquet.write_table(), > but they are missing from pyarrow.parquet.ParquetWriter.write_table().. I'm > seeing mismatches between the table schema and the file schema, but they are > identical in the error message with modified: timestamp[ms] column types in > both schemas. The only thing which looks odd is the Pandas metadata that has > a modified column with a panda datatype of datetime and a numpy datatype of > datetime64[ns] > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (ARROW-3907) [Python] from_pandas errors when schemas are used with lower resolution timestamps
[ https://issues.apache.org/jira/browse/ARROW-3907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Lee closed ARROW-3907. Resolution: Not A Problem Fix Version/s: 0.11.1 Closing for now. Not convinced Safe is the best solution to address timestamp resolution. If a schema is used it should be clear the intent is to convert pandas nanoseconds to a lower resolution. I think the same can be said for other types of conversions like floats to int. > [Python] from_pandas errors when schemas are used with lower resolution > timestamps > -- > > Key: ARROW-3907 > URL: https://issues.apache.org/jira/browse/ARROW-3907 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.11.1 >Reporter: David Lee >Priority: Major > Fix For: 0.11.1 > > > When passing in a schema object to from_pandas a resolution error occurs if > the schema uses a lower resolution timestamp. Do we need to also add > "coerce_timestamps" and "allow_truncated_timestamps" parameters found in > write_table() to from_pandas()? > Error: > pyarrow.lib.ArrowInvalid: ('Casting from timestamp[ns] to timestamp[ms] would > lose data: 1532015191753713000', 'Conversion failed for column modified with > type datetime64[ns]') > Code: > > {code:java} > processed_schema = pa.schema([ > pa.field('Id', pa.string()), > pa.field('modified', pa.timestamp('ms')), > pa.field('records', pa.int32()) > ]) > pa.Table.from_pandas(df, schema=processed_schema, preserve_index=False) > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (ARROW-3918) [Python] ParquetWriter.write_table doesn't support coerce_timestamps or allow_truncated_timestamps
[ https://issues.apache.org/jira/browse/ARROW-3918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16706664#comment-16706664 ] David Lee edited comment on ARROW-3918 at 12/3/18 5:17 AM: --- Passed them into ParquetWriter and it still gives the same error.. File "../python3.6/site-packages/pyarrow/parquet.py", line 374, in write_table raise ValueError(msg) ValueError: Table schema does not match schema used to create file: table: Id: string modified: timestamp[ms] converter: string records: int32 metadata {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [ {"name":' b' "Id", "field_name": "Id", "' b'pandas_type": "unicode", "numpy_type": "object", "metadata": nul' b'l} , {"name": "modified", "field_name": "modified", "pandas_type"' b': "datetime", "numpy_type": "datetime64[ns]", "metadata": null} ,' b' {"name": "converter", "field_name": "converter", "pandas_type":' b' "unicode", "numpy_type": "object", "metadata": null} , {"name": ' b'"records", "field_name": "records", "pandas_type": "int32", "num' b'py_type": "int64", "metadata": null} ], "pandas_version": "0.23.4' b'"}'} vs. file: Id: string modified: timestamp[ms] converter: string records: int32 Code: {code:java} processed_schema = pa.schema([ pa.field('Id', pa.string()), pa.field('modified', pa.timestamp('ms')), pa.field('converter', pa.string()), pa.field('records', pa.int32()) ]) . arrow_tables.append(pa.Table.from_pandas(df, schema=processed_schema, preserve_index=False, safe=False)) . if len(arrow_tables) > 0: writer = pq.ParquetWriter(os.path.join(self.conf['work_dir'], processed_file), schema=processed_schema, use_dictionary=True, compression='snappy', coerce_timestamps='ms', allow_truncated_timestamps=True) for v in arrow_tables: writer.write_table(v) writer.close() {code} was (Author: davlee1...@yahoo.com): Passed them into ParquetWriter and it still gives the same error.. File "../python3.6/site-packages/pyarrow/parquet.py", line 374, in write_table raise ValueError(msg) ValueError: Table schema does not match schema used to create file: table: Id: string modified: timestamp[ms] converter: string records: int32 metadata {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [{"name":' b' "Id", "field_name": "Id", "' b'pandas_type": "unicode", "numpy_type": "object", "metadata": nul' b'l}, {"name": "modified", "field_name": "modified", "pandas_type"' b': "datetime", "numpy_type": "datetime64[ns]", "metadata": null},' b' {"name": "converter", "field_name": "converter", "pandas_type":' b' "unicode", "numpy_type": "object", "metadata": null}, {"name": ' b'"records", "field_name": "records", "pandas_type": "int32", "num' b'py_type": "int64", "metadata": null}], "pandas_version": "0.23.4' b'"}'} vs. file: Id: string modified: timestamp[ms] converter: string records: int32 Code: {code:java} processed_schema = pa.schema([ pa.field('Id', pa.string()), pa.field('modified', pa.timestamp('ms')), pa.field('converter', pa.string()), pa.field('records', pa.int32()) ]) if len(arrow_tables) > 0: writer = pq.ParquetWriter(os.path.join(self.conf['work_dir'], processed_file), schema=processed_schema, use_dictionary=True, compression='snappy', coerce_timestamps='ms', allow_truncated_timestamps=True) for v in arrow_tables: writer.write_table(v) writer.close() {code} > [Python] ParquetWriter.write_table doesn't support coerce_timestamps or > allow_truncated_timestamps > -- > > Key: ARROW-3918 > URL: https://issues.apache.org/jira/browse/ARROW-3918 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.11.1 >Reporter: David Lee >Priority: Major > > Error: Table Schema does not match schema used to create file. > The 0.11.1 release added these parameters to pyarrow.parquet.write_table(), > but they are missing from pyarrow.parquet.ParquetWriter.write_table().. I'm > seeing mismatches between the table schema and the file schema, but they are > identical in the error message with modified: timestamp[ms] column types in > both schemas. The only thing which looks odd is the Pandas metadata that has > a modified column with a panda datatype of datetime and a numpy datatype of > datetime64[ns] > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3918) [Python] ParquetWriter.write_table doesn't support coerce_timestamps or allow_truncated_timestamps
[ https://issues.apache.org/jira/browse/ARROW-3918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16706664#comment-16706664 ] David Lee commented on ARROW-3918: -- Passed them into ParquetWriter and it still gives the same error.. File "../python3.6/site-packages/pyarrow/parquet.py", line 374, in write_table raise ValueError(msg) ValueError: Table schema does not match schema used to create file: table: Id: string modified: timestamp[ms] converter: string records: int32 metadata {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [{"name":' b' "Id", "field_name": "Id", "' b'pandas_type": "unicode", "numpy_type": "object", "metadata": nul' b'l}, {"name": "modified", "field_name": "modified", "pandas_type"' b': "datetime", "numpy_type": "datetime64[ns]", "metadata": null},' b' {"name": "converter", "field_name": "converter", "pandas_type":' b' "unicode", "numpy_type": "object", "metadata": null}, {"name": ' b'"records", "field_name": "records", "pandas_type": "int32", "num' b'py_type": "int64", "metadata": null}], "pandas_version": "0.23.4' b'"}'} vs. file: Id: string modified: timestamp[ms] converter: string records: int32 Code: {code:java} processed_schema = pa.schema([ pa.field('Id', pa.string()), pa.field('modified', pa.timestamp('ms')), pa.field('converter', pa.string()), pa.field('records', pa.int32()) ]) if len(arrow_tables) > 0: writer = pq.ParquetWriter(os.path.join(self.conf['work_dir'], processed_file), schema=processed_schema, use_dictionary=True, compression='snappy', coerce_timestamps='ms', allow_truncated_timestamps=True) for v in arrow_tables: writer.write_table(v) writer.close() {code} > [Python] ParquetWriter.write_table doesn't support coerce_timestamps or > allow_truncated_timestamps > -- > > Key: ARROW-3918 > URL: https://issues.apache.org/jira/browse/ARROW-3918 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.11.1 >Reporter: David Lee >Priority: Major > > Error: Table Schema does not match schema used to create file. > The 0.11.1 release added these parameters to pyarrow.parquet.write_table(), > but they are missing from pyarrow.parquet.ParquetWriter.write_table().. I'm > seeing mismatches between the table schema and the file schema, but they are > identical in the error message with modified: timestamp[ms] column types in > both schemas. The only thing which looks odd is the Pandas metadata that has > a modified column with a panda datatype of datetime and a numpy datatype of > datetime64[ns] > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3907) [Python] from_pandas errors when schemas are used with lower resolution timestamps
[ https://issues.apache.org/jira/browse/ARROW-3907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16705210#comment-16705210 ] David Lee commented on ARROW-3907: -- passing in safe=False works, but it is pretty hacky.. Another problem also pops up with ParquetWriter.write_table(). I'll open a separate ticket for that one. The conversion from pandas nanoseconds to whatever timestamp resolution declared using pa.timestamp() in the schema object worked fine in 0.11.0. Having to pass in coerce_timestamps, allow_truncated_timestamps and safe is pretty messy. > [Python] from_pandas errors when schemas are used with lower resolution > timestamps > -- > > Key: ARROW-3907 > URL: https://issues.apache.org/jira/browse/ARROW-3907 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.11.1 >Reporter: David Lee >Priority: Major > > When passing in a schema object to from_pandas a resolution error occurs if > the schema uses a lower resolution timestamp. Do we need to also add > "coerce_timestamps" and "allow_truncated_timestamps" parameters found in > write_table() to from_pandas()? > Error: > pyarrow.lib.ArrowInvalid: ('Casting from timestamp[ns] to timestamp[ms] would > lose data: 1532015191753713000', 'Conversion failed for column modified with > type datetime64[ns]') > Code: > > {code:java} > processed_schema = pa.schema([ > pa.field('Id', pa.string()), > pa.field('modified', pa.timestamp('ms')), > pa.field('records', pa.int32()) > ]) > pa.Table.from_pandas(df, schema=processed_schema, preserve_index=False) > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3918) [Python] ParquetWriter.write_table doesn't support coerce_timestamps or allow_truncated_timestamps
David Lee created ARROW-3918: Summary: [Python] ParquetWriter.write_table doesn't support coerce_timestamps or allow_truncated_timestamps Key: ARROW-3918 URL: https://issues.apache.org/jira/browse/ARROW-3918 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.11.1 Reporter: David Lee Error: Table Schema does not match schema used to create file. The 0.11.1 release added these parameters to pyarrow.parquet.write_table(), but they are missing from pyarrow.parquet.ParquetWriter.write_table().. I'm seeing mismatches between the table schema and the file schema, but they are identical in the error message with modified: timestamp[ms] column types in both schemas. The only thing which looks odd is the Pandas metadata that has a modified column with a panda datatype of datetime and a numpy datatype of datetime64[ns] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3907) [Python] from_pandas errors when schemas are used with lower resolution timestamps
David Lee created ARROW-3907: Summary: [Python] from_pandas errors when schemas are used with lower resolution timestamps Key: ARROW-3907 URL: https://issues.apache.org/jira/browse/ARROW-3907 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.11.1 Reporter: David Lee When passing in a schema object to from_pandas a resolution error occurs if the schema uses a lower resolution timestamp. Do we need to also add "coerce_timestamps" and "allow_truncated_timestamps" parameters found in write_table() to from_pandas()? Error: pyarrow.lib.ArrowInvalid: ('Casting from timestamp[ns] to timestamp[ms] would lose data: 1532015191753713000', 'Conversion failed for column modified with type datetime64[ns]') Code: {code:java} processed_schema = pa.schema([ pa.field('Id', pa.string()), pa.field('modified', pa.timestamp('ms')), pa.field('records', pa.int32()) ]) pa.Table.from_pandas(df, schema=processed_schema, preserve_index=False) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (ARROW-3728) [Python] Merging Parquet Files - Pandas Meta in Schema Mismatch
[ https://issues.apache.org/jira/browse/ARROW-3728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16703544#comment-16703544 ] David Lee edited comment on ARROW-3728 at 11/29/18 5:33 PM: I'm finding the same problem as well.. This is similar to: https://jira.apache.org/jira/browse/ARROW-3065 I think the underlying panda schema has changed between pyarrow releases so I can't merge old files with new files. On the topic of merging parquet files.. This is something I do to try to create 128 meg parquet files to match the HDFS blocksize configured in Hadoop. It is not possible to predetermine the size of a parquet file when you mix in dictionary encoding + snappy compression, but you can work around it be merging smaller parquet files together as row groups. Save two million rows of data per parquet file. This ends up creating multiple parquet files around 10 megs each after encoding and compression. Figure out which files should be merged by adding their file sizes together until it the sum comes in just under 128 megs which is between 95% and 100% of 128 * 1024 * 1024 bytes. Read each parquet file in as a arrow table and write the arrow table to a new file as a row group. This is both fast and memory efficient since you only need to put two million rows of data in memory at a time. On a separate topic I should probably open up a new issue / enhancement request. A. Would it be possible to read a row group out of parquet file, modify it as a panda and then write it back to the original parquet file? B. Would it be possible to add a boolean hidden status column to every parquet file? A status of True would mean the row is valid. A status of False would mean the row is deleted. Dremio uses an internal flag in Arrow data sets when doing SQL Union operations. It is more efficient to flag a record as deleted instead of trying to delete it out of a columnar memory format. If we could introduce something for columnar parquet you could in theory update parquet files by flagging the old record as deleted and reinserting the replacement record at the end of the existing file without having to shuffle / re-write the entire file. was (Author: davlee1...@yahoo.com): I'm finding the same problem as well.. This is similar to: https://jira.apache.org/jira/browse/ARROW-3065 I think the underlying panda schema has changed between pyarrow releases so I can't merge old files with new files. On the topic of merging parquet files.. This is something I do to try to create 128 meg parquet files to match the HDFS blocksize configured in Hadoop. It is not possible to predetermine the size of a parquet file when you mix in dictionary encoding + snappy compression, but you can work around it be merging smaller parquet files together as row groups. Save two million rows of data per parquet file. This ends up creating multiple parquet files around 10 megs each after encoding and compression. Figure out which files should be merged by adding their file sizes together until it the sum comes in just under 128 megs which is between 95% and 100% of 128 * 1024 * 1024 bytes. Read each parquet file in as a arrow table and write the arrow table to a new file as a row group. This is both fast and memory efficient since you only need to put two million rows of data in memory at a time. On a separate topic I should probably open up a new issue / enhancement request. A. Would it be possible to read a row group out of parquet file, modify it as a panda and then write it back to the original parquet file? B. Would it be possible to add a boolean hidden status column to every parquet file? A status of True would mean the row is valid. A status of False would mean the row is deleted. Dremio uses an internal flag in Arrow data sets when doing SQL Union operations. It is more efficient to flag a record as deleted instead of trying to delete it out of a columnar memory format. If we could introduce something for columnar parquet you could in theory update parquet files by flagging the old record as deleted and reinserting the replacement record at the end of the existing file without having to shuffle / re-write the entire file. > [Python] Merging Parquet Files - Pandas Meta in Schema Mismatch > --- > > Key: ARROW-3728 > URL: https://issues.apache.org/jira/browse/ARROW-3728 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.10.0, 0.11.0, 0.11.1 > Environment: Python 3.6.3 > OSX 10.14 >Reporter: Micah Williamson >Assignee: Krisztian Szucs >Priority: Major > Labels: parquet, pull-request-available > Fix For: 0.12.0 > > Time Spent: 1h > Remaining Estimate: 0h
[jira] [Commented] (ARROW-3728) [Python] Merging Parquet Files - Pandas Meta in Schema Mismatch
[ https://issues.apache.org/jira/browse/ARROW-3728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16703544#comment-16703544 ] David Lee commented on ARROW-3728: -- I'm finding the same problem as well.. This is similar to: https://jira.apache.org/jira/browse/ARROW-3065 I think the underlying panda schema has changed between pyarrow releases so I can't merge old files with new files. On the topic of merging parquet files.. This is something I do to try to create 128 meg parquet files to match the HDFS blocksize configured in Hadoop. It is not possible to predetermine the size of a parquet file when you mix in dictionary encoding + snappy compression, but you can work around it be merging smaller parquet files together as row groups. Save two million rows of data per parquet file. This ends up creating multiple parquet files around 10 megs each after encoding and compression. Figure out which files should be merged by adding their file sizes together until it the sum comes in just under 128 megs which is between 95% and 100% of 128 * 1024 * 1024 bytes. Read each parquet file in as a arrow table and write the arrow table to a new file as a row group. This is both fast and memory efficient since you only need to put two million rows of data in memory at a time. On a separate topic I should probably open up a new issue / enhancement request. A. Would it be possible to read a row group out of parquet file, modify it as a panda and then write it back to the original parquet file? B. Would it be possible to add a boolean hidden status column to every parquet file? A status of True would mean the row is valid. A status of False would mean the row is deleted. Dremio uses an internal flag in Arrow data sets when doing SQL Union operations. It is more efficient to flag a record as deleted instead of trying to delete it out of a columnar memory format. If we could introduce something for columnar parquet you could in theory update parquet files by flagging the old record as deleted and reinserting the replacement record at the end of the existing file without having to shuffle / re-write the entire file. > [Python] Merging Parquet Files - Pandas Meta in Schema Mismatch > --- > > Key: ARROW-3728 > URL: https://issues.apache.org/jira/browse/ARROW-3728 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.10.0, 0.11.0, 0.11.1 > Environment: Python 3.6.3 > OSX 10.14 >Reporter: Micah Williamson >Assignee: Krisztian Szucs >Priority: Major > Labels: parquet, pull-request-available > Fix For: 0.12.0 > > Time Spent: 1h > Remaining Estimate: 0h > > From: > https://stackoverflow.com/questions/53214288/merging-parquet-files-pandas-meta-in-schema-mismatch > > I am trying to merge multiple parquet files into one. Their schemas are > identical field-wise but my {{ParquetWriter}} is complaining that they are > not. After some investigation I found that the pandas meta in the schemas are > different, causing this error. > > Sample- > {code:python} > import pyarrow.parquet as pq > pq_tables=[] > for file_ in files: > pq_table = pq.read_table(f'{MESS_DIR}/{file_}') > pq_tables.append(pq_table) > if writer is None: > writer = pq.ParquetWriter(COMPRESSED_FILE, schema=pq_table.schema, > use_deprecated_int96_timestamps=True) > writer.write_table(table=pq_table) > {code} > The error- > {code} > Traceback (most recent call last): > File "{PATH_TO}/main.py", line 68, in lambda_handler > writer.write_table(table=pq_table) > File > "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pyarrow/parquet.py", > line 335, in write_table > raise ValueError(msg) > ValueError: Table schema does not match schema used to create file: > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3065) [Python] concat_tables() failing from bad Pandas Metadata
[ https://issues.apache.org/jira/browse/ARROW-3065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16626470#comment-16626470 ] David Lee commented on ARROW-3065: -- In pyarrow 0.9.0 the pandas metadata still says float64, but it works.. {code:java} >>> tbl1.schema col1: string col2: string metadata {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [{"name":' b' "col1", "field_name": "col1", "pandas_type": "unicode", "numpy_' b'type": "object", "metadata": null}, {"name": "col2", "field_name' b'": "col2", "pandas_type": "unicode", "numpy_type": "object", "me' b'tadata": null}], "pandas_version": "0.23.0"}'} >>> tbl2.schema col1: string col2: string metadata {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [{"name":' b' "col1", "field_name": "col1", "pandas_type": "unicode", "numpy_' b'type": "float64", "metadata": null}, {"name": "col2", "field_nam' b'e": "col2", "pandas_type": "unicode", "numpy_type": "object", "m' b'etadata": null}], "pandas_version": "0.23.0"}'} >>> tbl3.schema col1: string col2: string metadata {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [{"name":' b' "col1", "field_name": "col1", "pandas_type": "unicode", "numpy_' b'type": "object", "metadata": null}, {"name": "col2", "field_name' b'": "col2", "pandas_type": "unicode", "numpy_type": "object", "me' b'tadata": null}], "pandas_version": "0.23.0"}'} >>> tbl3[0] chunk 0: [ 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h' ] chunk 1: [ '', '', '', '', '', '', '', '' ] {code} In the 0.10.0 example above that can't produce the error tbl3[0] comes back with: {code:java} >>> tbl3[0] [ [ "a", "b", "c", "d", "e", "f", "g", "h" ], [ "", "", "", "", "", "", "", "" ] ] {code} > [Python] concat_tables() failing from bad Pandas Metadata > - > > Key: ARROW-3065 > URL: https://issues.apache.org/jira/browse/ARROW-3065 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.10.0 >Reporter: David Lee >Priority: Major > Fix For: 0.11.0 > > > Looks like the major bug from > https://issues.apache.org/jira/browse/ARROW-1941 is back... > After I downgraded from 0.10.0 to 0.9.0, the error disappeared.. > {code:python} > new_arrow_table = pa.concat_tables(my_arrow_tables) > File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables > File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status > pyarrow.lib.ArrowInvalid: Schema at index 2 was different: > {code} > In order to debug this I saved the first 4 arrow tables to 4 parquet files > and inspected the parquet files. The parquet schema is identical, but the > Pandas Metadata is different. > {code:python} > for i in range(5): > pq.write_table(my_arrow_tables[i], "test" + str(i) + ".parquet") > {code} > It looks like a column which contains empty strings is getting typed as > float64. > {code:python} > >>> test1.schema > HoldingDetail_Id: string > metadata > > {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [ > {"name": "HoldingDetail_Id", "field_name": "HoldingDetail_Id", "pandas_type": > "unicode", "numpy_type": "object", "metadata": null}, > >>> test1[0] > > [ > [ > "Z4", > "SF", > "J7", > "W6", > "L7", > "Q9", > "NE", > "F7", > >>> test2.schema > HoldingDetail_Id: string > metadata > > {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [ > {"name": "HoldingDetail_Id", "field_name": "HoldingDetail_Id", "pandas_type": > "unicode", "numpy_type": "float64", "metadata": null}, > >>> test2[0] > > [ > [ > "", > "", > "", > "", > "", > "", > "", > "", > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (ARROW-3065) [Python] concat_tables() failing from bad Pandas Metadata
[ https://issues.apache.org/jira/browse/ARROW-3065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16626336#comment-16626336 ] David Lee edited comment on ARROW-3065 at 9/24/18 8:17 PM: --- This test fails.. Tested against 0.10.0.. Works in 0.9.0. In one table the column doesn't exist to start and is added using pandas.reindex(). The reasoning behind this is the original file(s) being converted to parquet may or may not contain all 100+ columns. {quote} {code:java} import pandas as pd import pyarrow as pa import pyarrow.parquet as pq schema = pa.schema([ pa.field('col1', pa.string()), pa.field('col2', pa.string()), ]) df1 = pd.DataFrame([{"col1": v, "col2": v} for v in list("abcdefgh")]) df2 = pd.DataFrame([{"col2": v} for v in list("abcdefgh")]) df1 = df1.reindex(columns=schema.names) df2 = df2.reindex(columns=schema.names) tbl1 = pa.Table.from_pandas(df1, schema = schema, preserve_index=False) tbl2 = pa.Table.from_pandas(df2, schema = schema, preserve_index=False) tbl3 = pa.concat_tables([tbl1, tbl2]) Traceback (most recent call last): {{ File "", line 1, in }} {{ File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables}} {{ File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status}} pyarrow.lib.ArrowInvalid: Schema at index 1 was different: {code} {quote} was (Author: davlee1...@yahoo.com): This test fails.. Tested against 0.10.0.. Works in 0.9.0. In one table the column doesn't exist to start and is added using pandas.reindex(). The reasoning behind this is the original file(s) being converted to parquet may or may not contain all 100+ columns. {quote} {code:java} import pandas as pd import pyarrow as pa import pyarrow.parquet as pq schema = pa.schema([ pa.field('col1', pa.string()), pa.field('col2', pa.string()), ]) df1 = pd.DataFrame([{"col1": v, "col2": v} for v in list("abcdefgh")]) df2 = pd.DataFrame([{"col2": v} for v in list("abcdefgh")]) df1 = df1.reindex(columns=schema.names) df2 = df2.reindex(columns=schema.names) tbl1 = pa.Table.from_pandas(df1, schema = schema, preserve_index=False) tbl2 = pa.Table.from_pandas(df2, schema = schema, preserve_index=False) tbl3 = pa.concat_tables([tbl1, tbl2]) Traceback (most recent call last): {{ File "", line 1, in }} {{ File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables}} {{ File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status}} pyarrow.lib.ArrowInvalid: Schema at index 1 was different: {code} {quote} > [Python] concat_tables() failing from bad Pandas Metadata > - > > Key: ARROW-3065 > URL: https://issues.apache.org/jira/browse/ARROW-3065 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.10.0 >Reporter: David Lee >Priority: Major > Fix For: 0.12.0 > > > Looks like the major bug from > https://issues.apache.org/jira/browse/ARROW-1941 is back... > After I downgraded from 0.10.0 to 0.9.0, the error disappeared.. > {code:python} > new_arrow_table = pa.concat_tables(my_arrow_tables) > File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables > File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status > pyarrow.lib.ArrowInvalid: Schema at index 2 was different: > {code} > In order to debug this I saved the first 4 arrow tables to 4 parquet files > and inspected the parquet files. The parquet schema is identical, but the > Pandas Metadata is different. > {code:python} > for i in range(5): > pq.write_table(my_arrow_tables[i], "test" + str(i) + ".parquet") > {code} > It looks like a column which contains empty strings is getting typed as > float64. > {code:python} > >>> test1.schema > HoldingDetail_Id: string > metadata > > {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [ > {"name": "HoldingDetail_Id", "field_name": "HoldingDetail_Id", "pandas_type": > "unicode", "numpy_type": "object", "metadata": null}, > >>> test1[0] > > [ > [ > "Z4", > "SF", > "J7", > "W6", > "L7", > "Q9", > "NE", > "F7", > >>> test2.schema > HoldingDetail_Id: string > metadata > > {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [ > {"name": "HoldingDetail_Id", "field_name": "HoldingDetail_Id", "pandas_type": > "unicode", "numpy_type": "float64", "metadata": null}, > >>> test2[0] > > [ > [ > "", > "", > "", > "", > "", > "", > "", > "", > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (ARROW-3065) [Python] concat_tables() failing from bad Pandas Metadata
[ https://issues.apache.org/jira/browse/ARROW-3065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16626336#comment-16626336 ] David Lee edited comment on ARROW-3065 at 9/24/18 8:17 PM: --- This test fails.. Tested against 0.10.0.. Works in 0.9.0. In one table the column doesn't exist to start and is added using pandas.reindex(). The reasoning behind this is the original file(s) being converted to parquet may or may not contain all 100+ columns. {quote} {code:java} import pandas as pd import pyarrow as pa import pyarrow.parquet as pq schema = pa.schema([ pa.field('col1', pa.string()), pa.field('col2', pa.string()), ]) df1 = pd.DataFrame([{"col1": v, "col2": v} for v in list("abcdefgh")]) df2 = pd.DataFrame([{"col2": v} for v in list("abcdefgh")]) df1 = df1.reindex(columns=schema.names) df2 = df2.reindex(columns=schema.names) tbl1 = pa.Table.from_pandas(df1, schema = schema, preserve_index=False) tbl2 = pa.Table.from_pandas(df2, schema = schema, preserve_index=False) tbl3 = pa.concat_tables([tbl1, tbl2]) Traceback (most recent call last): {{ File "", line 1, in }} {{ File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables}} {{ File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status}} pyarrow.lib.ArrowInvalid: Schema at index 1 was different: {code} {quote} was (Author: davlee1...@yahoo.com): This test fails.. Tested against 0.10.0.. Works in 0.9.0. In one table the column doesn't exist to start and is added using pandas.reindex(). The reasoning behind this is the original file(s) being converted to parquet may or may not contain all 100+ columns. {quote} {code:java} import pandas as pd import pyarrow as pa import pyarrow.parquet as pq schema = pa.schema([ pa.field('col1', pa.string()), pa.field('col2', pa.string()), ]) df1 = pd.DataFrame([{"col1": v, "col2": v} for v in list("abcdefgh")]) df2 = pd.DataFrame([{"col2": v} for v in list("abcdefgh")]) df1 = df1.reindex(columns=schema.names) df2 = df2.reindex(columns=schema.names) tbl1 = pa.Table.from_pandas(df1, schema = schema, preserve_index=False) tbl2 = pa.Table.from_pandas(df2, schema = schema, preserve_index=False) tbl3 = pa.concat_tables([tbl1, tbl2]) Traceback (most recent call last): {{ File "", line 1, in }} {{ File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables}} {{ File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status}} pyarrow.lib.ArrowInvalid: Schema at index 1 was different: {code} {quote} > [Python] concat_tables() failing from bad Pandas Metadata > - > > Key: ARROW-3065 > URL: https://issues.apache.org/jira/browse/ARROW-3065 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.10.0 >Reporter: David Lee >Priority: Major > Fix For: 0.12.0 > > > Looks like the major bug from > https://issues.apache.org/jira/browse/ARROW-1941 is back... > After I downgraded from 0.10.0 to 0.9.0, the error disappeared.. > {code:python} > new_arrow_table = pa.concat_tables(my_arrow_tables) > File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables > File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status > pyarrow.lib.ArrowInvalid: Schema at index 2 was different: > {code} > In order to debug this I saved the first 4 arrow tables to 4 parquet files > and inspected the parquet files. The parquet schema is identical, but the > Pandas Metadata is different. > {code:python} > for i in range(5): > pq.write_table(my_arrow_tables[i], "test" + str(i) + ".parquet") > {code} > It looks like a column which contains empty strings is getting typed as > float64. > {code:python} > >>> test1.schema > HoldingDetail_Id: string > metadata > > {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [ > {"name": "HoldingDetail_Id", "field_name": "HoldingDetail_Id", "pandas_type": > "unicode", "numpy_type": "object", "metadata": null}, > >>> test1[0] > > [ > [ > "Z4", > "SF", > "J7", > "W6", > "L7", > "Q9", > "NE", > "F7", > >>> test2.schema > HoldingDetail_Id: string > metadata > > {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [ > {"name": "HoldingDetail_Id", "field_name": "HoldingDetail_Id", "pandas_type": > "unicode", "numpy_type": "float64", "metadata": null}, > >>> test2[0] > > [ > [ > "", > "", > "", > "", > "", > "", > "", > "", > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (ARROW-3065) [Python] concat_tables() failing from bad Pandas Metadata
[ https://issues.apache.org/jira/browse/ARROW-3065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16626336#comment-16626336 ] David Lee edited comment on ARROW-3065 at 9/24/18 8:16 PM: --- This test fails.. Tested against 0.10.0.. Works in 0.9.0. In one table the column doesn't exist to start and is added using pandas.reindex(). The reasoning behind this is the original file(s) being converted to parquet may or may not contain all 100+ columns. {quote} {code:java} import pandas as pd import pyarrow as pa import pyarrow.parquet as pq schema = pa.schema([ pa.field('col1', pa.string()), pa.field('col2', pa.string()), ]) df1 = pd.DataFrame([{"col1": v, "col2": v} for v in list("abcdefgh")]) df2 = pd.DataFrame([{"col2": v} for v in list("abcdefgh")]) df1 = df1.reindex(columns=schema.names) df2 = df2.reindex(columns=schema.names) tbl1 = pa.Table.from_pandas(df1, schema = schema, preserve_index=False) tbl2 = pa.Table.from_pandas(df2, schema = schema, preserve_index=False) tbl3 = pa.concat_tables([tbl1, tbl2]) Traceback (most recent call last): {{ File "", line 1, in }} {{ File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables}} {{ File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status}} pyarrow.lib.ArrowInvalid: Schema at index 1 was different: {code} {quote} was (Author: davlee1...@yahoo.com): This test fails.. Tested against 0.10.0.. Works in 0.9.0. In one table the column doesn't exist to start and is added using pandas.reindex(). The reasoning behind this is the original file(s) being converted to parquet may or may not contain all 100+ columns. {quote}import pandas as pd import pyarrow as pa import pyarrow.parquet as pq schema = pa.schema([ pa.field('col1', pa.string()), pa.field('col2', pa.string()), ]) df1 = pd.DataFrame([\{"col1": v, "col2": v} for v in list("abcdefgh")]) df2 = pd.DataFrame([\{"col2": v} for v in list("abcdefgh")]) df1 = df1.reindex(columns=schema.names) df2 = df2.reindex(columns=schema.names) tbl1 = pa.Table.from_pandas(df1, schema = schema, preserve_index=False) tbl2 = pa.Table.from_pandas(df2, schema = schema, preserve_index=False) tbl3 = pa.concat_tables([tbl1, tbl2]) Traceback (most recent call last): {\{ File "", line 1, in }} \{{ File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables}} \{{ File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status}} pyarrow.lib.ArrowInvalid: Schema at index 1 was different: {quote} > [Python] concat_tables() failing from bad Pandas Metadata > - > > Key: ARROW-3065 > URL: https://issues.apache.org/jira/browse/ARROW-3065 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.10.0 >Reporter: David Lee >Priority: Major > Fix For: 0.12.0 > > > Looks like the major bug from > https://issues.apache.org/jira/browse/ARROW-1941 is back... > After I downgraded from 0.10.0 to 0.9.0, the error disappeared.. > {code:python} > new_arrow_table = pa.concat_tables(my_arrow_tables) > File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables > File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status > pyarrow.lib.ArrowInvalid: Schema at index 2 was different: > {code} > In order to debug this I saved the first 4 arrow tables to 4 parquet files > and inspected the parquet files. The parquet schema is identical, but the > Pandas Metadata is different. > {code:python} > for i in range(5): > pq.write_table(my_arrow_tables[i], "test" + str(i) + ".parquet") > {code} > It looks like a column which contains empty strings is getting typed as > float64. > {code:python} > >>> test1.schema > HoldingDetail_Id: string > metadata > > {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [ > {"name": "HoldingDetail_Id", "field_name": "HoldingDetail_Id", "pandas_type": > "unicode", "numpy_type": "object", "metadata": null}, > >>> test1[0] > > [ > [ > "Z4", > "SF", > "J7", > "W6", > "L7", > "Q9", > "NE", > "F7", > >>> test2.schema > HoldingDetail_Id: string > metadata > > {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [ > {"name": "HoldingDetail_Id", "field_name": "HoldingDetail_Id", "pandas_type": > "unicode", "numpy_type": "float64", "metadata": null}, > >>> test2[0] > > [ > [ > "", > "", > "", > "", > "", > "", > "", > "", > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (ARROW-3065) [Python] concat_tables() failing from bad Pandas Metadata
[ https://issues.apache.org/jira/browse/ARROW-3065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16626336#comment-16626336 ] David Lee edited comment on ARROW-3065 at 9/24/18 8:16 PM: --- This test fails.. Tested against 0.10.0.. Works in 0.9.0. In one table the column doesn't exist to start and is added using pandas.reindex(). The reasoning behind this is the original file(s) being converted to parquet may or may not contain all 100+ columns. {quote}import pandas as pd import pyarrow as pa import pyarrow.parquet as pq schema = pa.schema([ pa.field('col1', pa.string()), pa.field('col2', pa.string()), ]) df1 = pd.DataFrame([\{"col1": v, "col2": v} for v in list("abcdefgh")]) df2 = pd.DataFrame([\{"col2": v} for v in list("abcdefgh")]) df1 = df1.reindex(columns=schema.names) df2 = df2.reindex(columns=schema.names) tbl1 = pa.Table.from_pandas(df1, schema = schema, preserve_index=False) tbl2 = pa.Table.from_pandas(df2, schema = schema, preserve_index=False) tbl3 = pa.concat_tables([tbl1, tbl2]) Traceback (most recent call last): {\{ File "", line 1, in }} \{{ File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables}} \{{ File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status}} pyarrow.lib.ArrowInvalid: Schema at index 1 was different: {quote} was (Author: davlee1...@yahoo.com): This test fails.. Tested against 0.10.0.. Works in 0.9.0. In one table the column doesn't exist to start and is added using pandas.reindex(). The reasoning behind this is the original file(s) being converted to parquet may or may not contain all 100+ columns. {quote}import pandas as pd import pyarrow as pa import pyarrow.parquet as pq schema = pa.schema([ pa.field('col1', pa.string()), pa.field('col2', pa.string()), ]) df1 = pd.DataFrame([\{"col1": v, "col2": v} for v in list("abcdefgh")]) df2 = pd.DataFrame([\{"col2": v} for v in list("abcdefgh")]) df1 = df1.reindex(columns=schema.names) df2 = df2.reindex(columns=schema.names) tbl1 = pa.Table.from_pandas(df1, schema = schema, preserve_index=False) tbl2 = pa.Table.from_pandas(df2, schema = schema, preserve_index=False) tbl3 = pa.concat_tables([tbl1, tbl2]) Traceback (most recent call last): {\{ File "", line 1, in }} \{{ File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables}} \{{ File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status}} pyarrow.lib.ArrowInvalid: Schema at index 1 was different: {quote} > [Python] concat_tables() failing from bad Pandas Metadata > - > > Key: ARROW-3065 > URL: https://issues.apache.org/jira/browse/ARROW-3065 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.10.0 >Reporter: David Lee >Priority: Major > Fix For: 0.12.0 > > > Looks like the major bug from > https://issues.apache.org/jira/browse/ARROW-1941 is back... > After I downgraded from 0.10.0 to 0.9.0, the error disappeared.. > {code:python} > new_arrow_table = pa.concat_tables(my_arrow_tables) > File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables > File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status > pyarrow.lib.ArrowInvalid: Schema at index 2 was different: > {code} > In order to debug this I saved the first 4 arrow tables to 4 parquet files > and inspected the parquet files. The parquet schema is identical, but the > Pandas Metadata is different. > {code:python} > for i in range(5): > pq.write_table(my_arrow_tables[i], "test" + str(i) + ".parquet") > {code} > It looks like a column which contains empty strings is getting typed as > float64. > {code:python} > >>> test1.schema > HoldingDetail_Id: string > metadata > > {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [ > {"name": "HoldingDetail_Id", "field_name": "HoldingDetail_Id", "pandas_type": > "unicode", "numpy_type": "object", "metadata": null}, > >>> test1[0] > > [ > [ > "Z4", > "SF", > "J7", > "W6", > "L7", > "Q9", > "NE", > "F7", > >>> test2.schema > HoldingDetail_Id: string > metadata > > {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [ > {"name": "HoldingDetail_Id", "field_name": "HoldingDetail_Id", "pandas_type": > "unicode", "numpy_type": "float64", "metadata": null}, > >>> test2[0] > > [ > [ > "", > "", > "", > "", > "", > "", > "", > "", > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (ARROW-3065) [Python] concat_tables() failing from bad Pandas Metadata
[ https://issues.apache.org/jira/browse/ARROW-3065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16626336#comment-16626336 ] David Lee edited comment on ARROW-3065 at 9/24/18 8:15 PM: --- This test fails.. Tested against 0.10.0.. Works in 0.9.0. In one table the column doesn't exist to start and is added using pandas.reindex(). The reasoning behind this is the original file(s) being converted to parquet may or may not contain all 100+ columns. {quote}import pandas as pd import pyarrow as pa import pyarrow.parquet as pq schema = pa.schema([ pa.field('col1', pa.string()), pa.field('col2', pa.string()), ]) df1 = pd.DataFrame([\{"col1": v, "col2": v} for v in list("abcdefgh")]) df2 = pd.DataFrame([\{"col2": v} for v in list("abcdefgh")]) df1 = df1.reindex(columns=schema.names) df2 = df2.reindex(columns=schema.names) tbl1 = pa.Table.from_pandas(df1, schema = schema, preserve_index=False) tbl2 = pa.Table.from_pandas(df2, schema = schema, preserve_index=False) tbl3 = pa.concat_tables([tbl1, tbl2]) Traceback (most recent call last): {\{ File "", line 1, in }} \{{ File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables}} \{{ File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status}} pyarrow.lib.ArrowInvalid: Schema at index 1 was different: {quote} was (Author: davlee1...@yahoo.com): This test fails.. Tested against 0.10.0.. Works in 0.9.0. In one table the column doesn't exist to start and is added using pandas.reindex(). The reasoning behind this is the original file(s) being converted to parquet may or may not contain all 100+ columns. {quote}import pandas as pd import pyarrow as pa import pyarrow.parquet as pq schema = pa.schema([ pa.field('col1', pa.string()), pa.field('col2', pa.string()), ]) df1 = pd.DataFrame([\{"col1": v, "col2": v} for v in list("abcdefgh")]) df2 = pd.DataFrame([\{"col2": v} for v in list("abcdefgh")]) df1 = df1.reindex(columns=schema.names) df2 = df2.reindex(columns=schema.names) tbl1 = pa.Table.from_pandas(df1, schema = schema, preserve_index=False) tbl2 = pa.Table.from_pandas(df2, schema = schema, preserve_index=False) tbl3 = pa.concat_tables([tbl1, tbl2]) Traceback (most recent call last): {\{ File "", line 1, in }} \{{ File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables}} \{{ File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status}} pyarrow.lib.ArrowInvalid: Schema at index 1 was different: {quote} > [Python] concat_tables() failing from bad Pandas Metadata > - > > Key: ARROW-3065 > URL: https://issues.apache.org/jira/browse/ARROW-3065 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.10.0 >Reporter: David Lee >Priority: Major > Fix For: 0.12.0 > > > Looks like the major bug from > https://issues.apache.org/jira/browse/ARROW-1941 is back... > After I downgraded from 0.10.0 to 0.9.0, the error disappeared.. > {code:python} > new_arrow_table = pa.concat_tables(my_arrow_tables) > File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables > File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status > pyarrow.lib.ArrowInvalid: Schema at index 2 was different: > {code} > In order to debug this I saved the first 4 arrow tables to 4 parquet files > and inspected the parquet files. The parquet schema is identical, but the > Pandas Metadata is different. > {code:python} > for i in range(5): > pq.write_table(my_arrow_tables[i], "test" + str(i) + ".parquet") > {code} > It looks like a column which contains empty strings is getting typed as > float64. > {code:python} > >>> test1.schema > HoldingDetail_Id: string > metadata > > {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [ > {"name": "HoldingDetail_Id", "field_name": "HoldingDetail_Id", "pandas_type": > "unicode", "numpy_type": "object", "metadata": null}, > >>> test1[0] > > [ > [ > "Z4", > "SF", > "J7", > "W6", > "L7", > "Q9", > "NE", > "F7", > >>> test2.schema > HoldingDetail_Id: string > metadata > > {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [ > {"name": "HoldingDetail_Id", "field_name": "HoldingDetail_Id", "pandas_type": > "unicode", "numpy_type": "float64", "metadata": null}, > >>> test2[0] > > [ > [ > "", > "", > "", > "", > "", > "", > "", > "", > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (ARROW-3065) [Python] concat_tables() failing from bad Pandas Metadata
[ https://issues.apache.org/jira/browse/ARROW-3065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16626336#comment-16626336 ] David Lee edited comment on ARROW-3065 at 9/24/18 7:58 PM: --- This test fails.. Tested against 0.10.0.. Works in 0.9.0. In one table the column doesn't exist to start and is added using pandas.reindex(). The reasoning behind this is the original file(s) being converted to parquet may or may not contain all 100+ columns. {quote}import pandas as pd import pyarrow as pa import pyarrow.parquet as pq schema = pa.schema([ pa.field('col1', pa.string()), pa.field('col2', pa.string()), ]) df1 = pd.DataFrame([\{"col1": v, "col2": v} for v in list("abcdefgh")]) df2 = pd.DataFrame([\{"col2": v} for v in list("abcdefgh")]) df1 = df1.reindex(columns=schema.names) df2 = df2.reindex(columns=schema.names) tbl1 = pa.Table.from_pandas(df1, schema = schema, preserve_index=False) tbl2 = pa.Table.from_pandas(df2, schema = schema, preserve_index=False) tbl3 = pa.concat_tables([tbl1, tbl2]) Traceback (most recent call last): {\{ File "", line 1, in }} \{{ File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables}} \{{ File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status}} pyarrow.lib.ArrowInvalid: Schema at index 1 was different: {quote} was (Author: davlee1...@yahoo.com): This test fails.. Tested against 0.10.0.. Works in 0.9.0. In one table the column doesn't exist to start and is added using pandas.reindex(). The reasoning behind this is the original file(s) being converted to parquet may or may not contain all 100+ columns. {quote}import pandas as pd import pyarrow as pa import pyarrow.parquet as pq schema = pa.schema([ pa.field('col1', pa.string()), pa.field('col2', pa.string()), ]) {{df1 = pd.DataFrame([ Unknown macro: \{"col1"} for v in list("abcdefgh")])}} {{df2 = pd.DataFrame([ Unknown macro: \{"col2"} for v in list("abcdefgh")])}} df1 = df1.reindex(columns=schema.names) df2 = df2.reindex(columns=schema.names) tbl1 = pa.Table.from_pandas(df1, schema = schema, preserve_index=False) tbl2 = pa.Table.from_pandas(df2, schema = schema, preserve_index=False) tbl3 = pa.concat_tables([tbl1, tbl2]) Traceback (most recent call last): {\{ File "", line 1, in }} \{{ File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables}} \{{ File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status}} pyarrow.lib.ArrowInvalid: Schema at index 1 was different: {quote} > [Python] concat_tables() failing from bad Pandas Metadata > - > > Key: ARROW-3065 > URL: https://issues.apache.org/jira/browse/ARROW-3065 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.10.0 >Reporter: David Lee >Priority: Major > Fix For: 0.12.0 > > > Looks like the major bug from > https://issues.apache.org/jira/browse/ARROW-1941 is back... > After I downgraded from 0.10.0 to 0.9.0, the error disappeared.. > {code:python} > new_arrow_table = pa.concat_tables(my_arrow_tables) > File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables > File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status > pyarrow.lib.ArrowInvalid: Schema at index 2 was different: > {code} > In order to debug this I saved the first 4 arrow tables to 4 parquet files > and inspected the parquet files. The parquet schema is identical, but the > Pandas Metadata is different. > {code:python} > for i in range(5): > pq.write_table(my_arrow_tables[i], "test" + str(i) + ".parquet") > {code} > It looks like a column which contains empty strings is getting typed as > float64. > {code:python} > >>> test1.schema > HoldingDetail_Id: string > metadata > > {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [ > {"name": "HoldingDetail_Id", "field_name": "HoldingDetail_Id", "pandas_type": > "unicode", "numpy_type": "object", "metadata": null}, > >>> test1[0] > > [ > [ > "Z4", > "SF", > "J7", > "W6", > "L7", > "Q9", > "NE", > "F7", > >>> test2.schema > HoldingDetail_Id: string > metadata > > {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [ > {"name": "HoldingDetail_Id", "field_name": "HoldingDetail_Id", "pandas_type": > "unicode", "numpy_type": "float64", "metadata": null}, > >>> test2[0] > > [ > [ > "", > "", > "", > "", > "", > "", > "", > "", > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (ARROW-3065) [Python] concat_tables() failing from bad Pandas Metadata
[ https://issues.apache.org/jira/browse/ARROW-3065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16626336#comment-16626336 ] David Lee edited comment on ARROW-3065 at 9/24/18 7:57 PM: --- This test fails.. Tested against 0.10.0.. Works in 0.9.0. In one table the column doesn't exist to start and is added using pandas.reindex(). The reasoning behind this is the original file(s) being converted to parquet may or may not contain all 100+ columns. {quote}{{import pandas as pd}} {{import pyarrow as pa}} {{import pyarrow.parquet as pq}} {{schema = pa.schema([}} {{pa.field('col1', pa.string()),}} {{pa.field('col2', pa.string()),}} {{])}} {{df1 = pd.DataFrame([\\{"col1": v, "col2": v} for v in list("abcdefgh")])}} {{df2 = pd.DataFrame([\\{"col2": v} for v in list("abcdefgh")])}} {{df1 = df1.reindex(columns=schema.names)}} {{df2 = df2.reindex(columns=schema.names)}} {{tbl1 = pa.Table.from_pandas(df1, schema = schema, preserve_index=False)}} {{tbl2 = pa.Table.from_pandas(df2, schema = schema, preserve_index=False)}} {{tbl3 = pa.concat_tables([tbl1, tbl2])}} {{Traceback (most recent call last):}} {\{ File "", line 1, in }} \{{ File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables}} \{{ File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status}} {{pyarrow.lib.ArrowInvalid: Schema at index 1 was different:}} {quote} was (Author: davlee1...@yahoo.com): This test fails.. Tested against 0.10.0.. Works in 0.9.0 {{import pandas as pd}} {{import pyarrow as pa}} {{import pyarrow.parquet as pq}}{{schema = pa.schema([}} {{pa.field('col1', pa.string()),}} {{pa.field('col2', pa.string()),}} {{])}} {{df1 = pd.DataFrame([\{"col1": v, "col2": v} for v in list("abcdefgh")])}} {{df2 = pd.DataFrame([\{"col2": v} for v in list("abcdefgh")])}}{{df1 = df1.reindex(columns=schema.names)}} {{df2 = df2.reindex(columns=schema.names)}}{{tbl1 = pa.Table.from_pandas(df1, schema = schema, preserve_index=False)}} {{tbl2 = pa.Table.from_pandas(df2, schema = schema, preserve_index=False)}}{{tbl3 = pa.concat_tables([tbl1, tbl2])}}{{Traceback (most recent call last):}} {{ File "", line 1, in }} {{ File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables}} {{ File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status}} {{pyarrow.lib.ArrowInvalid: Schema at index 1 was different:}} > [Python] concat_tables() failing from bad Pandas Metadata > - > > Key: ARROW-3065 > URL: https://issues.apache.org/jira/browse/ARROW-3065 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.10.0 >Reporter: David Lee >Priority: Major > Fix For: 0.12.0 > > > Looks like the major bug from > https://issues.apache.org/jira/browse/ARROW-1941 is back... > After I downgraded from 0.10.0 to 0.9.0, the error disappeared.. > {code:python} > new_arrow_table = pa.concat_tables(my_arrow_tables) > File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables > File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status > pyarrow.lib.ArrowInvalid: Schema at index 2 was different: > {code} > In order to debug this I saved the first 4 arrow tables to 4 parquet files > and inspected the parquet files. The parquet schema is identical, but the > Pandas Metadata is different. > {code:python} > for i in range(5): > pq.write_table(my_arrow_tables[i], "test" + str(i) + ".parquet") > {code} > It looks like a column which contains empty strings is getting typed as > float64. > {code:python} > >>> test1.schema > HoldingDetail_Id: string > metadata > > {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [ > {"name": "HoldingDetail_Id", "field_name": "HoldingDetail_Id", "pandas_type": > "unicode", "numpy_type": "object", "metadata": null}, > >>> test1[0] > > [ > [ > "Z4", > "SF", > "J7", > "W6", > "L7", > "Q9", > "NE", > "F7", > >>> test2.schema > HoldingDetail_Id: string > metadata > > {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [ > {"name": "HoldingDetail_Id", "field_name": "HoldingDetail_Id", "pandas_type": > "unicode", "numpy_type": "float64", "metadata": null}, > >>> test2[0] > > [ > [ > "", > "", > "", > "", > "", > "", > "", > "", > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (ARROW-3065) [Python] concat_tables() failing from bad Pandas Metadata
[ https://issues.apache.org/jira/browse/ARROW-3065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16626336#comment-16626336 ] David Lee edited comment on ARROW-3065 at 9/24/18 7:57 PM: --- This test fails.. Tested against 0.10.0.. Works in 0.9.0. In one table the column doesn't exist to start and is added using pandas.reindex(). The reasoning behind this is the original file(s) being converted to parquet may or may not contain all 100+ columns. {quote}import pandas as pd import pyarrow as pa import pyarrow.parquet as pq schema = pa.schema([ pa.field('col1', pa.string()), pa.field('col2', pa.string()), ]) {{df1 = pd.DataFrame([ Unknown macro: \{"col1"} for v in list("abcdefgh")])}} {{df2 = pd.DataFrame([ Unknown macro: \{"col2"} for v in list("abcdefgh")])}} df1 = df1.reindex(columns=schema.names) df2 = df2.reindex(columns=schema.names) tbl1 = pa.Table.from_pandas(df1, schema = schema, preserve_index=False) tbl2 = pa.Table.from_pandas(df2, schema = schema, preserve_index=False) tbl3 = pa.concat_tables([tbl1, tbl2]) Traceback (most recent call last): {\{ File "", line 1, in }} \{{ File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables}} \{{ File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status}} pyarrow.lib.ArrowInvalid: Schema at index 1 was different: {quote} was (Author: davlee1...@yahoo.com): This test fails.. Tested against 0.10.0.. Works in 0.9.0. In one table the column doesn't exist to start and is added using pandas.reindex(). The reasoning behind this is the original file(s) being converted to parquet may or may not contain all 100+ columns. {quote}{{import pandas as pd}} {{import pyarrow as pa}} {{import pyarrow.parquet as pq}} {{schema = pa.schema([}} {{pa.field('col1', pa.string()),}} {{pa.field('col2', pa.string()),}} {{])}} {{df1 = pd.DataFrame([\\{"col1": v, "col2": v} for v in list("abcdefgh")])}} {{df2 = pd.DataFrame([\\{"col2": v} for v in list("abcdefgh")])}} {{df1 = df1.reindex(columns=schema.names)}} {{df2 = df2.reindex(columns=schema.names)}} {{tbl1 = pa.Table.from_pandas(df1, schema = schema, preserve_index=False)}} {{tbl2 = pa.Table.from_pandas(df2, schema = schema, preserve_index=False)}} {{tbl3 = pa.concat_tables([tbl1, tbl2])}} {{Traceback (most recent call last):}} {\{ File "", line 1, in }} \{{ File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables}} \{{ File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status}} {{pyarrow.lib.ArrowInvalid: Schema at index 1 was different:}} {quote} > [Python] concat_tables() failing from bad Pandas Metadata > - > > Key: ARROW-3065 > URL: https://issues.apache.org/jira/browse/ARROW-3065 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.10.0 >Reporter: David Lee >Priority: Major > Fix For: 0.12.0 > > > Looks like the major bug from > https://issues.apache.org/jira/browse/ARROW-1941 is back... > After I downgraded from 0.10.0 to 0.9.0, the error disappeared.. > {code:python} > new_arrow_table = pa.concat_tables(my_arrow_tables) > File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables > File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status > pyarrow.lib.ArrowInvalid: Schema at index 2 was different: > {code} > In order to debug this I saved the first 4 arrow tables to 4 parquet files > and inspected the parquet files. The parquet schema is identical, but the > Pandas Metadata is different. > {code:python} > for i in range(5): > pq.write_table(my_arrow_tables[i], "test" + str(i) + ".parquet") > {code} > It looks like a column which contains empty strings is getting typed as > float64. > {code:python} > >>> test1.schema > HoldingDetail_Id: string > metadata > > {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [ > {"name": "HoldingDetail_Id", "field_name": "HoldingDetail_Id", "pandas_type": > "unicode", "numpy_type": "object", "metadata": null}, > >>> test1[0] > > [ > [ > "Z4", > "SF", > "J7", > "W6", > "L7", > "Q9", > "NE", > "F7", > >>> test2.schema > HoldingDetail_Id: string > metadata > > {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [ > {"name": "HoldingDetail_Id", "field_name": "HoldingDetail_Id", "pandas_type": > "unicode", "numpy_type": "float64", "metadata": null}, > >>> test2[0] > > [ > [ > "", > "", > "", > "", > "", > "", > "", > "", > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3065) [Python] concat_tables() failing from bad Pandas Metadata
[ https://issues.apache.org/jira/browse/ARROW-3065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16626336#comment-16626336 ] David Lee commented on ARROW-3065: -- This test fails.. Tested against 0.10.0.. Works in 0.9.0 {{import pandas as pd}} {{import pyarrow as pa}} {{import pyarrow.parquet as pq}}{{schema = pa.schema([}} {{pa.field('col1', pa.string()),}} {{pa.field('col2', pa.string()),}} {{])}} {{df1 = pd.DataFrame([\{"col1": v, "col2": v} for v in list("abcdefgh")])}} {{df2 = pd.DataFrame([\{"col2": v} for v in list("abcdefgh")])}}{{df1 = df1.reindex(columns=schema.names)}} {{df2 = df2.reindex(columns=schema.names)}}{{tbl1 = pa.Table.from_pandas(df1, schema = schema, preserve_index=False)}} {{tbl2 = pa.Table.from_pandas(df2, schema = schema, preserve_index=False)}}{{tbl3 = pa.concat_tables([tbl1, tbl2])}}{{Traceback (most recent call last):}} {{ File "", line 1, in }} {{ File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables}} {{ File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status}} {{pyarrow.lib.ArrowInvalid: Schema at index 1 was different:}} > [Python] concat_tables() failing from bad Pandas Metadata > - > > Key: ARROW-3065 > URL: https://issues.apache.org/jira/browse/ARROW-3065 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.10.0 >Reporter: David Lee >Priority: Major > Fix For: 0.12.0 > > > Looks like the major bug from > https://issues.apache.org/jira/browse/ARROW-1941 is back... > After I downgraded from 0.10.0 to 0.9.0, the error disappeared.. > {code:python} > new_arrow_table = pa.concat_tables(my_arrow_tables) > File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables > File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status > pyarrow.lib.ArrowInvalid: Schema at index 2 was different: > {code} > In order to debug this I saved the first 4 arrow tables to 4 parquet files > and inspected the parquet files. The parquet schema is identical, but the > Pandas Metadata is different. > {code:python} > for i in range(5): > pq.write_table(my_arrow_tables[i], "test" + str(i) + ".parquet") > {code} > It looks like a column which contains empty strings is getting typed as > float64. > {code:python} > >>> test1.schema > HoldingDetail_Id: string > metadata > > {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [ > {"name": "HoldingDetail_Id", "field_name": "HoldingDetail_Id", "pandas_type": > "unicode", "numpy_type": "object", "metadata": null}, > >>> test1[0] > > [ > [ > "Z4", > "SF", > "J7", > "W6", > "L7", > "Q9", > "NE", > "F7", > >>> test2.schema > HoldingDetail_Id: string > metadata > > {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [ > {"name": "HoldingDetail_Id", "field_name": "HoldingDetail_Id", "pandas_type": > "unicode", "numpy_type": "float64", "metadata": null}, > >>> test2[0] > > [ > [ > "", > "", > "", > "", > "", > "", > "", > "", > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3065) concat_tables() failing from bad Pandas Metadata
David Lee created ARROW-3065: Summary: concat_tables() failing from bad Pandas Metadata Key: ARROW-3065 URL: https://issues.apache.org/jira/browse/ARROW-3065 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.10.0 Reporter: David Lee Fix For: 0.9.0 Looks like the major bug from https://issues.apache.org/jira/browse/ARROW-1941 is back... After I downgraded from 0.10.0 to 0.9.0, the error disappeared.. {code:python} new_arrow_table = pa.concat_tables(my_arrow_tables) File "pyarrow/table.pxi", line 1562, in pyarrow.lib.concat_tables File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Schema at index 2 was different: {code} In order to debug this I saved the first 4 arrow tables to 4 parquet files and inspected the parquet files. The parquet schema is identical, but the Pandas Metadata is different. {code:python} for i in range(5): pq.write_table(my_arrow_tables[i], "test" + str(i) + ".parquet") {code} It looks like a column which contains empty strings is getting typed as float64. {code:python} >>> test1.schema HoldingDetail_Id: string metadata {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [ {"name": "HoldingDetail_Id", "field_name": "HoldingDetail_Id", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}, >>> test1[0] [ [ "Z4", "SF", "J7", "W6", "L7", "Q9", "NE", "F7", >>> test2.schema HoldingDetail_Id: string metadata {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [ {"name": "HoldingDetail_Id", "field_name": "HoldingDetail_Id", "pandas_type": "unicode", "numpy_type": "float64", "metadata": null}, >>> test2[0] [ [ "", "", "", "", "", "", "", "", {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)