[ https://issues.apache.org/jira/browse/SPARK-22324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16254054#comment-16254054 ]
Bryan Cutler commented on SPARK-22324: -------------------------------------- I started working on this to test out latest changes in Arrow Java, will submit a WIP PR soon > Upgrade Arrow to version 0.8.0 > ------------------------------ > > Key: SPARK-22324 > URL: https://issues.apache.org/jira/browse/SPARK-22324 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL > Affects Versions: 2.3.0 > Reporter: Bryan Cutler > > Arrow version 0.8.0 is slated for release in early November, but I'd like to > start discussing to help get all the work that's being done synced up. > Along with upgrading the Arrow Java artifacts, pyarrow on our Jenkins test > envs will need to be upgraded as well that will take a fair amount of work > and planning. > One topic I'd like to discuss is if pyarrow should be an installation > requirement for pyspark, i.e. when a user pip installs pyspark, it will also > install pyarrow. If not, then is there a minimum version that needs to be > supported? We currently have 0.4.1 installed on Jenkins. > There are a number of improvements and cleanups in the current code that can > happen depending on what we decide (I'll link them all here later, but off > the top of my head): > * Decimal bug fix and improved support > * Improved internal casting between pyarrow and pandas (can clean up some > workarounds), this will also verify data bounds if the user specifies a type > and data overflows. see > https://github.com/apache/spark/pull/19459#discussion_r146421804 > * Better type checking when converting Spark types to Arrow > * Timestamp conversion to microseconds (for Spark internal format) > * Full support for using validity mask with 'object' types > https://github.com/apache/spark/pull/18664#discussion_r146567335 > * VectorSchemaRoot can call close more than once to simplify listener > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala#L90 -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org