[ 
https://issues.apache.org/jira/browse/SPARK-5896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14326610#comment-14326610
 ] 

Michael Armbrust commented on SPARK-5896:
-----------------------------------------

Why not auto assign column names by index just like in Scala or in SQL?  Its 
weird to have python be the only language that doesn't support this.  It can be 
"c0" like in SQL if you are objecting to the underscore.

> toDF in python doesn't work with Strings
> ----------------------------------------
>
>                 Key: SPARK-5896
>                 URL: https://issues.apache.org/jira/browse/SPARK-5896
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>            Reporter: Michael Armbrust
>            Assignee: Davies Liu
>            Priority: Critical
>
> {code}
> rdd = sc.parallelize(range(10)).map(lambda x: (str(x), x))
> kvdf = rdd.toDF()
> {code}
> {code}
> ---------------------------------------------------------------------------
> ValueError                                Traceback (most recent call last)
> <ipython-input-18-327cb4e9a02e> in <module>()
>       1 rdd = sc.parallelize(range(10)).map(lambda x: (str(x), x))
> ----> 2 kvdf = rdd.toDF()
> /home/ubuntu/databricks/spark/python/pyspark/sql/context.pyc in toDF(self, 
> schema, sampleRatio)
>      53         [Row(name=u'Alice', age=1)]
>      54         """
> ---> 55         return sqlCtx.createDataFrame(self, schema, sampleRatio)
>      56 
>      57     RDD.toDF = toDF
> /home/ubuntu/databricks/spark/python/pyspark/sql/context.pyc in 
> createDataFrame(self, data, schema, samplingRatio)
>     395 
>     396         if schema is None:
> --> 397             return self.inferSchema(data, samplingRatio)
>     398 
>     399         if isinstance(schema, (list, tuple)):
> /home/ubuntu/databricks/spark/python/pyspark/sql/context.pyc in 
> inferSchema(self, rdd, samplingRatio)
>     228             raise TypeError("Cannot apply schema to DataFrame")
>     229 
> --> 230         schema = self._inferSchema(rdd, samplingRatio)
>     231         converter = _create_converter(schema)
>     232         rdd = rdd.map(converter)
> /home/ubuntu/databricks/spark/python/pyspark/sql/context.pyc in 
> _inferSchema(self, rdd, samplingRatio)
>     158 
>     159         if samplingRatio is None:
> --> 160             schema = _infer_schema(first)
>     161             if _has_nulltype(schema):
>     162                 for row in rdd.take(100)[1:]:
> /home/ubuntu/databricks/spark/python/pyspark/sql/types.pyc in 
> _infer_schema(row)
>     646             items = row
>     647         else:
> --> 648             raise ValueError("Can't infer schema from tuple")
>     649 
>     650     elif hasattr(row, "__dict__"):  # object
> ValueError: Can't infer schema from tuple
> {code}
> Nearly the same code works if you give names (and this works without names in 
> scala and calls the columns _1, _2, ...)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to