[ https://issues.apache.org/jira/browse/SPARK-6857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14498414#comment-14498414 ]
Davies Liu commented on SPARK-6857: ----------------------------------- [~josephkb] Because the serializer do not support numpy types, we need to convert these numpy types into Python types, then I would suggest that let users to do it by themselves. Does it work for you? > Python SQL schema inference should support numpy types > ------------------------------------------------------ > > Key: SPARK-6857 > URL: https://issues.apache.org/jira/browse/SPARK-6857 > Project: Spark > Issue Type: Improvement > Components: MLlib, PySpark, SQL > Affects Versions: 1.3.0 > Reporter: Joseph K. Bradley > > If you try to use SQL's schema inference to create a DataFrame out of a list > or RDD of numpy types (such as numpy.float64), SQL will not recognize the > numpy types. It would be handy if it did. > E.g.: > {code} > import numpy > from collections import namedtuple > from pyspark.sql import SQLContext > MyType = namedtuple('MyType', 'x') > myValues = map(lambda x: MyType(x), numpy.random.randint(100, size=10)) > sqlContext = SQLContext(sc) > data = sqlContext.createDataFrame(myValues) > {code} > The above code fails with: > {code} > Traceback (most recent call last): > File "<stdin>", line 1, in <module> > File "/Users/josephkb/spark/python/pyspark/sql/context.py", line 331, in > createDataFrame > return self.inferSchema(data, samplingRatio) > File "/Users/josephkb/spark/python/pyspark/sql/context.py", line 205, in > inferSchema > schema = self._inferSchema(rdd, samplingRatio) > File "/Users/josephkb/spark/python/pyspark/sql/context.py", line 160, in > _inferSchema > schema = _infer_schema(first) > File "/Users/josephkb/spark/python/pyspark/sql/types.py", line 660, in > _infer_schema > fields = [StructField(k, _infer_type(v), True) for k, v in items] > File "/Users/josephkb/spark/python/pyspark/sql/types.py", line 637, in > _infer_type > raise ValueError("not supported type: %s" % type(obj)) > ValueError: not supported type: <type 'numpy.int64'> > {code} > But if we cast to int (not numpy types) first, it's OK: > {code} > myNativeValues = map(lambda x: MyType(int(x.x)), myValues) > data = sqlContext.createDataFrame(myNativeValues) # OK > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org