[jira] [Updated] (SPARK-13534) Implement Apache Arrow serializer for Spark DataFrame for use in DataFrame.toPandas

Wes McKinney (JIRA) Sat, 27 Feb 2016 18:09:27 -0800

     [ 
https://issues.apache.org/jira/browse/SPARK-13534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Wes McKinney updated SPARK-13534:
---------------------------------
    Description: 
The current code path for accessing Spark DataFrame data in Python using 
PySpark passes through an inefficient serialization-deserialiation process that 
I've examined at a high level here: 
https://gist.github.com/wesm/0cb5531b1c2e346a0007. Currently, RDD[Row] objects 
are being deserialized in pure Python as a list of tuples, which are then 
converted to pandas.DataFrame using its {{from_records}} alternate constructor. 
This also uses a large amount of memory.

For flat (no nested types) schemas, the Apache Arrow memory layout 
(https://github.com/apache/arrow/tree/master/format) can be deserialized to 
{{pandas.DataFrame}} objects with comparatively small overhead compared with 
memcpy / system memory bandwidth -- Arrow's bitmasks must be examined, 
replacing the corresponding null values with pandas's sentinel values (None or 
NaN as appropriate).

I will be contributing patches to Arrow in the coming weeks for converting 
between Arrow and pandas in the general case, so if Spark can send Arrow memory 
to PySpark, we will hopefully be able to increase the Python data access 
throughput by an order of magnitude or more. I propose to add an new serializer 
for Spark DataFrame and a new method that can be invoked from PySpark to 
request a Arrow memory-layout byte stream, prefixed by a data header indicating 
array buffer offsets and sizes.

  was:
The current code path for accessing Spark DataFrame data in Python using 
PySpark passes through an inefficient serialization-deserialiation process that 
I've examined at a high level here: 
https://gist.github.com/wesm/0cb5531b1c2e346a0007. Currently, RDD[Row] objects 
are being deserialized in pure Python as a list of tuples, which are then 
converted to pandas.DataFrame using its {{from_records}} alternate constructor. 
This also uses a large amount of memory.

For flat (no nested types) schemas, the Apache Arrow memory layout 
(https://github.com/apache/arrow/tree/master/format) can be deserialized to 
{{pandas.DataFrame}} objects with comparatively small overhead compared with 
memcpy / system memory bandwidth -- Arrow's bitmasks must be examined the the 
null values in the data have to be replace with pandas's sentinel values (None 
or NaN as appropriate).

I will be contributing patches to Arrow in the coming weeks for converting 
between Arrow and pandas in the general case, so if Spark can send Arrow memory 
to PySpark, we will hopefully be able to increase the Python data access 
throughput by an order of magnitude or more. I propose to add an new serializer 
for Spark DataFrame and an method that can be invoked from PySpark to request a 
Arrow memory-layout byte stream, prefixed by a data header indicating array 
buffer offsets and sizes.


> Implement Apache Arrow serializer for Spark DataFrame for use in 
> DataFrame.toPandas
> -----------------------------------------------------------------------------------
>
>                 Key: SPARK-13534
>                 URL: https://issues.apache.org/jira/browse/SPARK-13534
>             Project: Spark
>          Issue Type: New Feature
>          Components: PySpark
>            Reporter: Wes McKinney
>
> The current code path for accessing Spark DataFrame data in Python using 
> PySpark passes through an inefficient serialization-deserialiation process 
> that I've examined at a high level here: 
> https://gist.github.com/wesm/0cb5531b1c2e346a0007. Currently, RDD[Row] 
> objects are being deserialized in pure Python as a list of tuples, which are 
> then converted to pandas.DataFrame using its {{from_records}} alternate 
> constructor. This also uses a large amount of memory.
> For flat (no nested types) schemas, the Apache Arrow memory layout 
> (https://github.com/apache/arrow/tree/master/format) can be deserialized to 
> {{pandas.DataFrame}} objects with comparatively small overhead compared with 
> memcpy / system memory bandwidth -- Arrow's bitmasks must be examined, 
> replacing the corresponding null values with pandas's sentinel values (None 
> or NaN as appropriate).
> I will be contributing patches to Arrow in the coming weeks for converting 
> between Arrow and pandas in the general case, so if Spark can send Arrow 
> memory to PySpark, we will hopefully be able to increase the Python data 
> access throughput by an order of magnitude or more. I propose to add an new 
> serializer for Spark DataFrame and a new method that can be invoked from 
> PySpark to request a Arrow memory-layout byte stream, prefixed by a data 
> header indicating array buffer offsets and sizes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13534) Implement Apache Arrow serializer for Spark DataFrame for use in DataFrame.toPandas

Reply via email to