[ https://issues.apache.org/jira/browse/SPARK-5138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Gabe Mulley updated SPARK-5138: ------------------------------- Description: When attempting to infer the schema of an RDD that contains namedtuples, pyspark fails to identify the records as namedtuples, resulting in it raising an error. Example: {noformat} from pyspark import SparkContext from pyspark.sql import SQLContext from collections import namedtuple import os sc = SparkContext() rdd = sc.textFile(os.path.join(os.getenv('SPARK_HOME'), 'README.md')) TextLine = namedtuple('TextLine', 'line length') tuple_rdd = rdd.map(lambda l: TextLine(line=l, length=len(l))) tuple_rdd.take(5) # This works sqlc = SQLContext(sc) # The following line raises an error schema_rdd = sqlc.inferSchema(tuple_rdd) {noformat} The error raised is: {noformat} File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/worker.py", line 107, in main process() File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/worker.py", line 98, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/serializers.py", line 227, in dump_stream vs = list(itertools.islice(iterator, batch)) File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/rdd.py", line 1107, in takeUpToNumLeft yield next(iterator) File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/sql.py", line 816, in convert_struct raise ValueError("unexpected tuple: %s" % obj) TypeError: not all arguments converted during string formatting {noformat} was: When attempting to infer the schema of an RDD that contains namedtuples, pyspark fails to identify the records as namedtuples, resulting in it raising an error. Example: {code:python} from pyspark import SparkContext from pyspark.sql import SQLContext from collections import namedtuple import os sc = SparkContext() rdd = sc.textFile(os.path.join(os.getenv('SPARK_HOME'), 'README.md')) TextLine = namedtuple('TextLine', 'line length') tuple_rdd = rdd.map(lambda l: TextLine(line=l, length=len(l))) tuple_rdd.take(5) # This works sqlc = SQLContext(sc) # The following line raises an error schema_rdd = sqlc.inferSchema(tuple_rdd) {code} The error raised is: {noformat} File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/worker.py", line 107, in main process() File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/worker.py", line 98, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/serializers.py", line 227, in dump_stream vs = list(itertools.islice(iterator, batch)) File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/rdd.py", line 1107, in takeUpToNumLeft yield next(iterator) File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/sql.py", line 816, in convert_struct raise ValueError("unexpected tuple: %s" % obj) TypeError: not all arguments converted during string formatting {noformat} > pyspark unable to infer schema of namedtuple > -------------------------------------------- > > Key: SPARK-5138 > URL: https://issues.apache.org/jira/browse/SPARK-5138 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL > Affects Versions: 1.2.0 > Reporter: Gabe Mulley > Priority: Trivial > > When attempting to infer the schema of an RDD that contains namedtuples, > pyspark fails to identify the records as namedtuples, resulting in it raising > an error. > Example: > {noformat} > from pyspark import SparkContext > from pyspark.sql import SQLContext > from collections import namedtuple > import os > sc = SparkContext() > rdd = sc.textFile(os.path.join(os.getenv('SPARK_HOME'), 'README.md')) > TextLine = namedtuple('TextLine', 'line length') > tuple_rdd = rdd.map(lambda l: TextLine(line=l, length=len(l))) > tuple_rdd.take(5) # This works > sqlc = SQLContext(sc) > # The following line raises an error > schema_rdd = sqlc.inferSchema(tuple_rdd) > {noformat} > The error raised is: > {noformat} > File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/worker.py", line 107, > in main > process() > File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/worker.py", line 98, in > process > serializer.dump_stream(func(split_index, iterator), outfile) > File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/serializers.py", line > 227, in dump_stream > vs = list(itertools.islice(iterator, batch)) > File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/rdd.py", line 1107, in > takeUpToNumLeft > yield next(iterator) > File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/sql.py", line 816, in > convert_struct > raise ValueError("unexpected tuple: %s" % obj) > TypeError: not all arguments converted during string formatting > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org