[ https://issues.apache.org/jira/browse/ARROW-7986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17049939#comment-17049939 ]
Nikolay Petrov edited comment on ARROW-7986 at 3/3/20 6:34 AM: --------------------------------------------------------------- I'm trying to wrap my mind around it. I apologise if my description is not very clear. I'm new to _arrow_ (and koalas which is relying on it). _[koalas|http://[https://github.com/databricks/koalas]]_ is implementing the _pandas_ DataFrame API on top of Apache Spark and for example command like {code:python} pds = pd.Series(sparse_vector){code} would eventually have equivalence in _koalas_ as {code:python} kss = ks.Series(sparse_vector) {code} _koalas_ in this case relies on _pyarrow_ to convert the *pyspark.ml.linalg*-specific *SparseVector*, but arrow does not know what to do with it. The SparseVector comes from "libsvm" format. [Here|#linear-regression] is an example using that format. Coming back to your question - what would I be expecting to happen - for the pd.Series defined above I get: {code:python} >>> print(pds) {code} {code:python} 0 (0.1, 1.1) dtype: object {code} {code:python} >>> type(pds) {code} {code:python} <class 'pandas.core.series.Series'>{code} Eventually, I'd expect similar result for the ks.Series: {code:python} >>> print(kds) {code} {code:python} 0 (0.1, 1.1) dtype: object{code} {code:python} >>> type(kds) {code} {code:python} <class 'databricks.koalas.series.Series'>{code} but instead that's where the error comes. was (Author: nikilp): I'm trying to wrap my mind around it. I apologise if my description is not very clear. I'm new to _arrow_ (and koalas which is relying on it). _koalas_ is implementing the _pandas_ DataFrame API on top of Apache Spark and for example command like {code:python} pds = pd.Series(sparse_vector){code} would eventually have equivalence in _koalas_ as {code:python} kss = ks.Series(sparse_vector) {code} _koalas_ in this case relies on _pyarrow_ to convert the *pyspark.ml.linalg*-specific *SparseVector*, but arrow does not know what to do with it. The SparseVector comes from "libsvm" format. [Here|#linear-regression] is an example using that format. Coming back to your question - what would I be expecting to happen - for the pd.Series defined above I get: {code:python} >>> print(pds) {code} {code:python} 0 (0.1, 1.1) dtype: object {code} {code:python} >>> type(pds) {code} {code:python} <class 'pandas.core.series.Series'>{code} Eventually, I'd expect similar result for the ks.Series: {code:python} >>> print(kds) {code} {code:python} 0 (0.1, 1.1) dtype: object{code} {code:python} >>> type(kds) {code} {code:python} <class 'databricks.koalas.series.Series'>{code} but instead that's where the error comes. > [Python] pa.Array.from_pandas cannot convert pandas.Series containing > pyspark.ml.linalg.SparseVector > ---------------------------------------------------------------------------------------------------- > > Key: ARROW-7986 > URL: https://issues.apache.org/jira/browse/ARROW-7986 > Project: Apache Arrow > Issue Type: New Feature > Components: C > Affects Versions: 0.14.1, 0.16.0 > Environment: macOS 10.15.3; > setup following the contribution guidelines for koalas: > https://koalas.readthedocs.io/en/latest/development/contributing.html > Reporter: Nikolay Petrov > Priority: Major > > The code > {code:java} > import pandas as pd > from pyspark.ml.linalg import SparseVector > import pyarrow as pa > sparse_values = {0: 0.1, 1: 1.1} > sparse_vector = SparseVector(len(sparse_values), sparse_values) > pds = pd.Series(sparse_vector) > pa.array(pds){code} > results in: > {noformat} > pyarrow/array.pxi:191: in pyarrow.lib.array > ??? > pyarrow/array.pxi:78: in pyarrow.lib._ndarray_to_array > ??? > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > > ??? > E pyarrow.lib.ArrowInvalid: Could not convert (2,[0,1],[0.1,1.1]) with type > SparseVector: did not recognize Python value type when inferring an Arrow > data type > pyarrow/error.pxi:85: ArrowInvalid > {noformat} > > > My initial intention was to test if databricks.koala's functionality is > implemented, which took me to error coming from pyarrow: > {code:java} > import pandas as pd > import databricks.koalas as ks > from pyspark.ml.linalg import SparseVector > sparse_values = {0: 0.1, 1: 1.1} > sparse_vector = SparseVector(len(sparse_values), sparse_values) > pds = pd.Series(sparse_vector) > kss = ks.Series(sparse_vector) > {code} > while pd.Series on the SparseVector works fine, the last line errors as: > {noformat} > databricks/koalas/typedef.py:176: in infer_pd_series_spark_type > return from_arrow_type(pa.Array.from_pandas(s).type) > pyarrow/array.pxi:593: in pyarrow.lib.Array.from_pandas > ??? > pyarrow/array.pxi:191: in pyarrow.lib.array > ??? > pyarrow/array.pxi:78: in pyarrow.lib._ndarray_to_array > ??? > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > > ??? > E pyarrow.lib.ArrowInvalid: Could not convert (2,[0,1],[0.1,1.1]) with type > SparseVector: did not recognize Python value type when inferring an Arrow > data type > pyarrow/error.pxi:85: ArrowInvalid > {noformat} > > -- This message was sent by Atlassian Jira (v8.3.4#803005)