[ https://issues.apache.org/jira/browse/ARROW-7986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joris Van den Bossche closed ARROW-7986. ---------------------------------------- > [Python] pa.Array.from_pandas cannot convert pandas.Series containing > pyspark.ml.linalg.SparseVector > ---------------------------------------------------------------------------------------------------- > > Key: ARROW-7986 > URL: https://issues.apache.org/jira/browse/ARROW-7986 > Project: Apache Arrow > Issue Type: New Feature > Components: C > Affects Versions: 0.14.1, 0.16.0 > Environment: macOS 10.15.3; > setup following the contribution guidelines for koalas: > https://koalas.readthedocs.io/en/latest/development/contributing.html > Reporter: Nikolay Petrov > Priority: Major > > The code > {code:java} > import pandas as pd > from pyspark.ml.linalg import SparseVector > import pyarrow as pa > sparse_values = {0: 0.1, 1: 1.1} > sparse_vector = SparseVector(len(sparse_values), sparse_values) > pds = pd.Series(sparse_vector) > pa.array(pds){code} > results in: > {noformat} > pyarrow/array.pxi:191: in pyarrow.lib.array > ??? > pyarrow/array.pxi:78: in pyarrow.lib._ndarray_to_array > ??? > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > > ??? > E pyarrow.lib.ArrowInvalid: Could not convert (2,[0,1],[0.1,1.1]) with type > SparseVector: did not recognize Python value type when inferring an Arrow > data type > pyarrow/error.pxi:85: ArrowInvalid > {noformat} > > > My initial intention was to test if databricks.koala's functionality is > implemented, which took me to error coming from pyarrow: > {code:java} > import pandas as pd > import databricks.koalas as ks > from pyspark.ml.linalg import SparseVector > sparse_values = {0: 0.1, 1: 1.1} > sparse_vector = SparseVector(len(sparse_values), sparse_values) > pds = pd.Series(sparse_vector) > kss = ks.Series(sparse_vector) > {code} > while pd.Series on the SparseVector works fine, the last line errors as: > {noformat} > databricks/koalas/typedef.py:176: in infer_pd_series_spark_type > return from_arrow_type(pa.Array.from_pandas(s).type) > pyarrow/array.pxi:593: in pyarrow.lib.Array.from_pandas > ??? > pyarrow/array.pxi:191: in pyarrow.lib.array > ??? > pyarrow/array.pxi:78: in pyarrow.lib._ndarray_to_array > ??? > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > > ??? > E pyarrow.lib.ArrowInvalid: Could not convert (2,[0,1],[0.1,1.1]) with type > SparseVector: did not recognize Python value type when inferring an Arrow > data type > pyarrow/error.pxi:85: ArrowInvalid > {noformat} > > -- This message was sent by Atlassian Jira (v8.3.4#803005)