[GitHub] spark issue #15821: [SPARK-13534][PySpark] Using Apache Arrow to increase pe...

BryanCutler Wed, 26 Apr 2017 19:18:31 -0700

Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/15821
  
    Updated to work with the latest Arrow to prepare for 0.3 release (tests 
should fail because that artifact is not yet available).  Also improved 
consistency of ArrowConverters and did some cleanup.  From @rxin 's comments:
    
    > Move ArrowConverters.scala somewhere else that's not top level, e.g. 
execution.arrow
    
    It is now in the o.a.s.sql.execution.arrow package
    
    > Update this to arrow 0.3
    
    Ready to do this, should just need to update the pom again
    
    >Use SQLConf rather than a parameter for toPandas.
    
    I removed this flag and used the conf "spark.sql.execution.arrow.enable" 
which defaults to "false"
    
    >Handle failure gracefully if arrow is not installed (or somehow package it 
with Spark?)
    
    It would be difficult to package with Spark, I think, because pyarrow also 
depends on the native Arrow cpp library.  I changed it to fail gracefully if 
pyarrow is not available.  The error message is:
    ```
    ImportError: No module named pyarrow
    note: pyarrow must be installed and available on calling Python processif 
using spark.sql.execution.arrow.enable=true
    ```
    
    >How are the memory managed? Who allocates the memory for the arrow 
records, and who's responsible for releasing them?
    
    The Java side of Arrow requires using a BufferAllocator class that manages 
the allocated memory.  An instance of this must be used each time a 
ArrowRecordBatch is created and then the batch and allocator must be 
released/closed after they have been processed.  This is all handled in the 
`ArrowConverter` functions.  On the Python side, buffers are allocated from the 
Arrow cpp library and cleaned up when reference counts to the objects are zero. 
 The end user does not have to worry about managing any memory.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15821: [SPARK-13534][PySpark] Using Apache Arrow to increase pe...

Reply via email to