[ 
https://issues.apache.org/jira/browse/SPARK-22324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-22324:
---------------------------------
    Description: 
Arrow version 0.8.0 is slated for release in early November, but I'd like to 
start discussing to help get all the work that's being done synced up.

Along with upgrading the Arrow Java artifacts, pyarrow on our Jenkins test envs 
will need to be upgraded as well that will take a fair amount of work and 
planning.

One topic I'd like to discuss is if pyarrow should be an installation 
requirement for pyspark, i.e. when a user pip installs pyspark, it will also 
install pyarrow.  If not, then is there a minimum version that needs to be 
supported?  We currently have 0.4.1 installed on Jenkins.

There are a number of improvements and cleanups in the current code that can 
happen depending on what we decide (I'll link them all here later, but off the 
top of my head):

* Decimal bug fix and improved support
* Improved internal casting between pyarrow and pandas (can clean up some 
workarounds), this will also verify data bounds if the user specifies a type 
and data overflows.  see 
https://github.com/apache/spark/pull/19459#discussion_r146421804
* Better type checking when converting Spark types to Arrow
* Timestamp conversion to microseconds (for Spark internal format)
* Full support for using validity mask with 'object' types 
https://github.com/apache/spark/pull/18664#discussion_r146567335
* VectorSchemaRoot can call close more than once to simplify listener 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala#L90

  was:
Arrow version 0.8.0 is slated for release in early November, but I'd like to 
start discussing to help get all the work that's being done synced up.

Along with upgrading the Arrow Java artifacts, pyarrow on our Jenkins test envs 
will need to be upgraded as well that will take a fair amount of work and 
planning.

One topic I'd like to discuss is if pyarrow should be an installation 
requirement for pyspark, i.e. when a user pip installs pyspark, it will also 
install pyarrow.  If not, then is there a minimum version that needs to be 
supported?  We currently have 0.4.1 installed on Jenkins.

There are a number of improvements and cleanups in the current code that can 
happen depending on what we decide (I'll link them all here later, but off the 
top of my head):

* Decimal bug fix and improved support
* Improved internal casting between pyarrow and pandas (can clean up some 
workarounds), this will also verify data bounds if the user specifies a type 
and data overflows.  see 
https://github.com/apache/spark/pull/19459#discussion_r146421804
* Better type checking when converting Spark types to Arrow
* Timestamp conversion to microseconds (for Spark internal format)
* Full support for using validity mask with 'object' types 
https://github.com/apache/spark/pull/18664#discussion_r146567335


> Upgrade Arrow to version 0.8.0
> ------------------------------
>
>                 Key: SPARK-22324
>                 URL: https://issues.apache.org/jira/browse/SPARK-22324
>             Project: Spark
>          Issue Type: Sub-task
>          Components: PySpark, SQL
>    Affects Versions: 2.3.0
>            Reporter: Bryan Cutler
>
> Arrow version 0.8.0 is slated for release in early November, but I'd like to 
> start discussing to help get all the work that's being done synced up.
> Along with upgrading the Arrow Java artifacts, pyarrow on our Jenkins test 
> envs will need to be upgraded as well that will take a fair amount of work 
> and planning.
> One topic I'd like to discuss is if pyarrow should be an installation 
> requirement for pyspark, i.e. when a user pip installs pyspark, it will also 
> install pyarrow.  If not, then is there a minimum version that needs to be 
> supported?  We currently have 0.4.1 installed on Jenkins.
> There are a number of improvements and cleanups in the current code that can 
> happen depending on what we decide (I'll link them all here later, but off 
> the top of my head):
> * Decimal bug fix and improved support
> * Improved internal casting between pyarrow and pandas (can clean up some 
> workarounds), this will also verify data bounds if the user specifies a type 
> and data overflows.  see 
> https://github.com/apache/spark/pull/19459#discussion_r146421804
> * Better type checking when converting Spark types to Arrow
> * Timestamp conversion to microseconds (for Spark internal format)
> * Full support for using validity mask with 'object' types 
> https://github.com/apache/spark/pull/18664#discussion_r146567335
> * VectorSchemaRoot can call close more than once to simplify listener 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala#L90



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to