[jira] [Commented] (SPARK-13534) Implement Apache Arrow serializer for Spark DataFrame for use in DataFrame.toPandas

Frederick Reiss (JIRA) Fri, 21 Oct 2016 17:15:06 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-13534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15596756#comment-15596756
 ]


Frederick Reiss commented on SPARK-13534:
-----------------------------------------

We ([~bryanc], [~holdenk], [~yinxusen], and myself) are looking into this.
Here's a rough outline of the current planned approach:
- Add a dependency on Arrow 0.1's Java and Scala APIs to Spark.
- Add a new developer API method to Dataset, {{collectAsArrow()}}, that returns 
an array of byte arrays, where each byte array contains a block of records in 
Arrow format. The conversion to Arrow will be a streamlined version of the 
Parquet conversion in {{ParquetWriteSupport}} (minus all the callbacks and 
levels of indirection). Conversion of complex types (Struct, Array, Map) to 
Arrow will not be supported in this version.
- modify Pyspark's {{DataFrame.toPandas}} method to use the following logic:
{noformat}
if (the schema of the DataFrame does not contain complex types)
    Call collectAsArrow() on the underlying Scala Dataset.
    Pull the resulting buffers of Arrow data over to the Python process.
    Use Arrow's Python APIs to convert the buffers into a single Pandas 
dataframe.
else
    Use the existing code as a slow-path conversion.
{noformat}

> Implement Apache Arrow serializer for Spark DataFrame for use in 
> DataFrame.toPandas
> -----------------------------------------------------------------------------------
>
>                 Key: SPARK-13534
>                 URL: https://issues.apache.org/jira/browse/SPARK-13534
>             Project: Spark
>          Issue Type: New Feature
>          Components: PySpark
>            Reporter: Wes McKinney
>
> The current code path for accessing Spark DataFrame data in Python using 
> PySpark passes through an inefficient serialization-deserialiation process 
> that I've examined at a high level here: 
> https://gist.github.com/wesm/0cb5531b1c2e346a0007. Currently, RDD[Row] 
> objects are being deserialized in pure Python as a list of tuples, which are 
> then converted to pandas.DataFrame using its {{from_records}} alternate 
> constructor. This also uses a large amount of memory.
> For flat (no nested types) schemas, the Apache Arrow memory layout 
> (https://github.com/apache/arrow/tree/master/format) can be deserialized to 
> {{pandas.DataFrame}} objects with comparatively small overhead compared with 
> memcpy / system memory bandwidth -- Arrow's bitmasks must be examined, 
> replacing the corresponding null values with pandas's sentinel values (None 
> or NaN as appropriate).
> I will be contributing patches to Arrow in the coming weeks for converting 
> between Arrow and pandas in the general case, so if Spark can send Arrow 
> memory to PySpark, we will hopefully be able to increase the Python data 
> access throughput by an order of magnitude or more. I propose to add an new 
> serializer for Spark DataFrame and a new method that can be invoked from 
> PySpark to request a Arrow memory-layout byte stream, prefixed by a data 
> header indicating array buffer offsets and sizes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13534) Implement Apache Arrow serializer for Spark DataFrame for use in DataFrame.toPandas

Reply via email to