[jira] [Updated] (SPARK-5138) pyspark unable to infer schema of namedtuple

Gabe Mulley (JIRA) Wed, 07 Jan 2015 11:31:09 -0800

     [ 
https://issues.apache.org/jira/browse/SPARK-5138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Gabe Mulley updated SPARK-5138:
-------------------------------
    Description: 
When attempting to infer the schema of an RDD that contains namedtuples, 
pyspark fails to identify the records as namedtuples, resulting in it raising 
an error.

Example:
{noformat}
from pyspark import SparkContext
from pyspark.sql import SQLContext
from collections import namedtuple
import os

sc = SparkContext()
rdd = sc.textFile(os.path.join(os.getenv('SPARK_HOME'), 'README.md'))
TextLine = namedtuple('TextLine', 'line length')
tuple_rdd = rdd.map(lambda l: TextLine(line=l, length=len(l)))
tuple_rdd.take(5)  # This works

sqlc = SQLContext(sc)

# The following line raises an error
schema_rdd = sqlc.inferSchema(tuple_rdd)
{noformat}

The error raised is:
{noformat}
  File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/worker.py", line 107, in 
main
    process()
  File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/worker.py", line 98, in 
process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/serializers.py", line 
227, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/rdd.py", line 1107, in 
takeUpToNumLeft
    yield next(iterator)
  File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/sql.py", line 816, in 
convert_struct
    raise ValueError("unexpected tuple: %s" % obj)
TypeError: not all arguments converted during string formatting
{noformat}



  was:
When attempting to infer the schema of an RDD that contains namedtuples, 
pyspark fails to identify the records as namedtuples, resulting in it raising 
an error.

Example:
{code:python}
from pyspark import SparkContext
from pyspark.sql import SQLContext
from collections import namedtuple
import os

sc = SparkContext()
rdd = sc.textFile(os.path.join(os.getenv('SPARK_HOME'), 'README.md'))
TextLine = namedtuple('TextLine', 'line length')
tuple_rdd = rdd.map(lambda l: TextLine(line=l, length=len(l)))
tuple_rdd.take(5)  # This works

sqlc = SQLContext(sc)

# The following line raises an error
schema_rdd = sqlc.inferSchema(tuple_rdd)
{code}

The error raised is:
{noformat}
  File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/worker.py", line 107, in 
main
    process()
  File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/worker.py", line 98, in 
process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/serializers.py", line 
227, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/rdd.py", line 1107, in 
takeUpToNumLeft
    yield next(iterator)
  File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/sql.py", line 816, in 
convert_struct
    raise ValueError("unexpected tuple: %s" % obj)
TypeError: not all arguments converted during string formatting
{noformat}




> pyspark unable to infer schema of namedtuple
> --------------------------------------------
>
>                 Key: SPARK-5138
>                 URL: https://issues.apache.org/jira/browse/SPARK-5138
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, SQL
>    Affects Versions: 1.2.0
>            Reporter: Gabe Mulley
>            Priority: Trivial
>
> When attempting to infer the schema of an RDD that contains namedtuples, 
> pyspark fails to identify the records as namedtuples, resulting in it raising 
> an error.
> Example:
> {noformat}
> from pyspark import SparkContext
> from pyspark.sql import SQLContext
> from collections import namedtuple
> import os
> sc = SparkContext()
> rdd = sc.textFile(os.path.join(os.getenv('SPARK_HOME'), 'README.md'))
> TextLine = namedtuple('TextLine', 'line length')
> tuple_rdd = rdd.map(lambda l: TextLine(line=l, length=len(l)))
> tuple_rdd.take(5)  # This works
> sqlc = SQLContext(sc)
> # The following line raises an error
> schema_rdd = sqlc.inferSchema(tuple_rdd)
> {noformat}
> The error raised is:
> {noformat}
>   File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/worker.py", line 107, 
> in main
>     process()
>   File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/worker.py", line 98, in 
> process
>     serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/serializers.py", line 
> 227, in dump_stream
>     vs = list(itertools.islice(iterator, batch))
>   File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/rdd.py", line 1107, in 
> takeUpToNumLeft
>     yield next(iterator)
>   File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/sql.py", line 816, in 
> convert_struct
>     raise ValueError("unexpected tuple: %s" % obj)
> TypeError: not all arguments converted during string formatting
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5138) pyspark unable to infer schema of namedtuple

Reply via email to