[jira] [Commented] (SPARK-34463) toPandas failed with error: buffer source array is read-only

David Li (Jira) Thu, 18 Feb 2021 05:36:04 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-34463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17286478#comment-17286478
 ]


David Li commented on SPARK-34463:
----------------------------------

I'll take a look.

The reason for the three options is as follows:
 * use_threads=False - convert each column sequentially to minimize memory 
usage (overhead = 1 column's worth of memory at any time instead of N columns 
where N = parallelism level)
 * split_blocks=True - create a separate Pandas block for each column. This is 
an internal implementation detail, but it makes it more likely for PyArrow to 
find zero-copy opportunities and reduce memory usage/conversion overhead even 
more
 * self_destruct=True - free the PyArrow array after each conversion to save 
memory (technically it's just decrementing a refcount so if you have any other 
references to the array, no memory is freed)

> toPandas failed with error: buffer source array is read-only
> ------------------------------------------------------------
>
>                 Key: SPARK-34463
>                 URL: https://issues.apache.org/jira/browse/SPARK-34463
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.0.2
>            Reporter: Weichen Xu
>            Priority: Major
>
> Environment:
> apache/spark master 
> pandas version > 1.0.5
> Reproduce code:
> {code}
> spark.conf.set('spark.sql.execution.arrow.pyspark.enabled', True)
> spark.conf.set('spark.sql.execution.arrow.pyspark.selfDestruct.enabled', 
> True)                        spark.createDataFrame(sc.parallelize([(i,) for i 
> in range(13)], 1), 'id long').selectExpr('IF(id % 3==0, id+1, NULL) AS f1', 
> '(id+1) % 2 AS label').toPandas()['label'].value_counts()
> {code}
> Get error like:
> {quote}Traceback (most recent call last):                                     
>          
>   File "<stdin>", line 1, in <module>
>   File 
> "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/base.py",
>  line 1033, in value_counts
>     dropna=dropna,
>   File 
> "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/algorithms.py",
>  line 820, in value_counts
>     keys, counts = value_counts_arraylike(values, dropna)
>   File 
> "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/algorithms.py",
>  line 865, in value_counts_arraylike
>     keys, counts = f(values, dropna)
>   File "pandas/_libs/hashtable_func_helper.pxi", line 1098, in 
> pandas._libs.hashtable.value_count_int64
>   File "stringsource", line 658, in View.MemoryView.memoryview_cwrapper
>   File "stringsource", line 349, in View.MemoryView.memoryview.__cinit__
> ValueError: buffer source array is read-only
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34463) toPandas failed with error: buffer source array is read-only

Reply via email to