[ 
https://issues.apache.org/jira/browse/SPARK-24358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16486581#comment-16486581
 ] 

Joel Croteau commented on SPARK-24358:
--------------------------------------

No, I mean the bytes type in Python 3. This code:
{code:java}
from pyspark.sql import SparkSession, Row

TEST_DATA = [Row(data=b'Test string')]


def init_session():
    builder = SparkSession.builder.appName("Test bytes serialization")
    return builder.getOrCreate()


def main():
    spark = init_session()
    frame = spark.createDataFrame(TEST_DATA)
    frame.printSchema()
    print(frame.collect())


__name__ == '__main__' and main()
{code}
 Fails under Python 3 with this output:
{noformat}
Traceback (most recent call last):
  File 
"/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/types.py", 
line 1068, in _infer_type
  File 
"/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/types.py", 
line 1094, in _infer_schema
TypeError: Can not infer schema for type: <class 'bytes'>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/jcroteau/is/pel_selection/test_row_pair.py", line 18, in <module>
    __name__ == '__main__' and main()
  File "/home/jcroteau/is/pel_selection/test_row_pair.py", line 13, in main
    frame = spark.createDataFrame(TEST_DATA)
  File 
"/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/session.py", 
line 689, in createDataFrame
  File 
"/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/session.py", 
line 410, in _createFromLocal
  File 
"/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/session.py", 
line 342, in _inferSchemaFromList
  File 
"/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/session.py", 
line 342, in <genexpr>
  File 
"/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/types.py", 
line 1096, in _infer_schema
  File 
"/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/types.py", 
line 1096, in <listcomp>
  File 
"/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/types.py", 
line 1070, in _infer_type
TypeError: not supported type: <class 'bytes'>
{noformat}
but if I change the data type to bytearray:
{code}
from pyspark.sql import SparkSession, Row

TEST_DATA = [Row(data=bytearray(b'Test string'))]


def init_session():
    builder = SparkSession.builder.appName("Use bytearray instead")
    return builder.getOrCreate()


def main():
    spark = init_session()
    frame = spark.createDataFrame(TEST_DATA)
    frame.printSchema()
    print(frame.collect())


__name__ == '__main__' and main()

{code}
it runs fine:
{noformat}
root
 |-- data: binary (nullable = true)

[Row(data=bytearray(b'Test string'))]
{noformat}
bytes in Python 3 is just an immutable version of bytearry, so it should infer 
the type as binary and serialize it the same way it does with bytearray.

> createDataFrame in Python should be able to infer bytes type as Binary type
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-24358
>                 URL: https://issues.apache.org/jira/browse/SPARK-24358
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Joel Croteau
>            Priority: Minor
>
> createDataFrame can infer Python's bytearray type as a Binary. Since bytes is 
> just the immutable, hashable version of this same structure, it makes sense 
> for the same thing to apply there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to