[ https://issues.apache.org/jira/browse/SPARK-24358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16486581#comment-16486581 ]
Joel Croteau commented on SPARK-24358: -------------------------------------- No, I mean the bytes type in Python 3. This code: {code:java} from pyspark.sql import SparkSession, Row TEST_DATA = [Row(data=b'Test string')] def init_session(): builder = SparkSession.builder.appName("Test bytes serialization") return builder.getOrCreate() def main(): spark = init_session() frame = spark.createDataFrame(TEST_DATA) frame.printSchema() print(frame.collect()) __name__ == '__main__' and main() {code} Fails under Python 3 with this output: {noformat} Traceback (most recent call last): File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/types.py", line 1068, in _infer_type File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/types.py", line 1094, in _infer_schema TypeError: Can not infer schema for type: <class 'bytes'> During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/jcroteau/is/pel_selection/test_row_pair.py", line 18, in <module> __name__ == '__main__' and main() File "/home/jcroteau/is/pel_selection/test_row_pair.py", line 13, in main frame = spark.createDataFrame(TEST_DATA) File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/session.py", line 689, in createDataFrame File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/session.py", line 410, in _createFromLocal File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/session.py", line 342, in _inferSchemaFromList File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/session.py", line 342, in <genexpr> File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/types.py", line 1096, in _infer_schema File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/types.py", line 1096, in <listcomp> File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/types.py", line 1070, in _infer_type TypeError: not supported type: <class 'bytes'> {noformat} but if I change the data type to bytearray: {code} from pyspark.sql import SparkSession, Row TEST_DATA = [Row(data=bytearray(b'Test string'))] def init_session(): builder = SparkSession.builder.appName("Use bytearray instead") return builder.getOrCreate() def main(): spark = init_session() frame = spark.createDataFrame(TEST_DATA) frame.printSchema() print(frame.collect()) __name__ == '__main__' and main() {code} it runs fine: {noformat} root |-- data: binary (nullable = true) [Row(data=bytearray(b'Test string'))] {noformat} bytes in Python 3 is just an immutable version of bytearry, so it should infer the type as binary and serialize it the same way it does with bytearray. > createDataFrame in Python should be able to infer bytes type as Binary type > --------------------------------------------------------------------------- > > Key: SPARK-24358 > URL: https://issues.apache.org/jira/browse/SPARK-24358 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.3.0 > Reporter: Joel Croteau > Priority: Minor > > createDataFrame can infer Python's bytearray type as a Binary. Since bytes is > just the immutable, hashable version of this same structure, it makes sense > for the same thing to apply there. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org