[jira] [Commented] (SPARK-22324) Upgrade Arrow to version 0.8.0

2017-12-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16304220#comment-16304220
 ] 

Apache Spark commented on SPARK-22324:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/20089

> Upgrade Arrow to version 0.8.0
> --
>
> Key: SPARK-22324
> URL: https://issues.apache.org/jira/browse/SPARK-22324
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
> Fix For: 2.3.0
>
>
> Arrow version 0.8.0 is slated for release in early November, but I'd like to 
> start discussing to help get all the work that's being done synced up.
> Along with upgrading the Arrow Java artifacts, pyarrow on our Jenkins test 
> envs will need to be upgraded as well that will take a fair amount of work 
> and planning.
> One topic I'd like to discuss is if pyarrow should be an installation 
> requirement for pyspark, i.e. when a user pip installs pyspark, it will also 
> install pyarrow.  If not, then is there a minimum version that needs to be 
> supported?  We currently have 0.4.1 installed on Jenkins.
> There are a number of improvements and cleanups in the current code that can 
> happen depending on what we decide (I'll link them all here later, but off 
> the top of my head):
> * Decimal bug fix and improved support
> * Improved internal casting between pyarrow and pandas (can clean up some 
> workarounds), this will also verify data bounds if the user specifies a type 
> and data overflows.  see 
> https://github.com/apache/spark/pull/19459#discussion_r146421804
> * Better type checking when converting Spark types to Arrow
> * Timestamp conversion to microseconds (for Spark internal format)
> * Full support for using validity mask with 'object' types 
> https://github.com/apache/spark/pull/18664#discussion_r146567335
> * VectorSchemaRoot can call close more than once to simplify listener 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala#L90



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22324) Upgrade Arrow to version 0.8.0

2017-12-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16277755#comment-16277755
 ] 

Apache Spark commented on SPARK-22324:
--

User 'BryanCutler' has created a pull request for this issue:
https://github.com/apache/spark/pull/19884

> Upgrade Arrow to version 0.8.0
> --
>
> Key: SPARK-22324
> URL: https://issues.apache.org/jira/browse/SPARK-22324
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>
> Arrow version 0.8.0 is slated for release in early November, but I'd like to 
> start discussing to help get all the work that's being done synced up.
> Along with upgrading the Arrow Java artifacts, pyarrow on our Jenkins test 
> envs will need to be upgraded as well that will take a fair amount of work 
> and planning.
> One topic I'd like to discuss is if pyarrow should be an installation 
> requirement for pyspark, i.e. when a user pip installs pyspark, it will also 
> install pyarrow.  If not, then is there a minimum version that needs to be 
> supported?  We currently have 0.4.1 installed on Jenkins.
> There are a number of improvements and cleanups in the current code that can 
> happen depending on what we decide (I'll link them all here later, but off 
> the top of my head):
> * Decimal bug fix and improved support
> * Improved internal casting between pyarrow and pandas (can clean up some 
> workarounds), this will also verify data bounds if the user specifies a type 
> and data overflows.  see 
> https://github.com/apache/spark/pull/19459#discussion_r146421804
> * Better type checking when converting Spark types to Arrow
> * Timestamp conversion to microseconds (for Spark internal format)
> * Full support for using validity mask with 'object' types 
> https://github.com/apache/spark/pull/18664#discussion_r146567335
> * VectorSchemaRoot can call close more than once to simplify listener 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala#L90



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22324) Upgrade Arrow to version 0.8.0

2017-11-15 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16254054#comment-16254054
 ] 

Bryan Cutler commented on SPARK-22324:
--

I started working on this to test out latest changes in Arrow Java, will submit 
a WIP PR soon

> Upgrade Arrow to version 0.8.0
> --
>
> Key: SPARK-22324
> URL: https://issues.apache.org/jira/browse/SPARK-22324
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>
> Arrow version 0.8.0 is slated for release in early November, but I'd like to 
> start discussing to help get all the work that's being done synced up.
> Along with upgrading the Arrow Java artifacts, pyarrow on our Jenkins test 
> envs will need to be upgraded as well that will take a fair amount of work 
> and planning.
> One topic I'd like to discuss is if pyarrow should be an installation 
> requirement for pyspark, i.e. when a user pip installs pyspark, it will also 
> install pyarrow.  If not, then is there a minimum version that needs to be 
> supported?  We currently have 0.4.1 installed on Jenkins.
> There are a number of improvements and cleanups in the current code that can 
> happen depending on what we decide (I'll link them all here later, but off 
> the top of my head):
> * Decimal bug fix and improved support
> * Improved internal casting between pyarrow and pandas (can clean up some 
> workarounds), this will also verify data bounds if the user specifies a type 
> and data overflows.  see 
> https://github.com/apache/spark/pull/19459#discussion_r146421804
> * Better type checking when converting Spark types to Arrow
> * Timestamp conversion to microseconds (for Spark internal format)
> * Full support for using validity mask with 'object' types 
> https://github.com/apache/spark/pull/18664#discussion_r146567335
> * VectorSchemaRoot can call close more than once to simplify listener 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala#L90



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org