[
https://issues.apache.org/jira/browse/SPARK-5896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14326610#comment-14326610
]
Michael Armbrust commented on SPARK-5896:
-----------------------------------------
Why not auto assign column names by index just like in Scala or in SQL? Its
weird to have python be the only language that doesn't support this. It can be
"c0" like in SQL if you are objecting to the underscore.
> toDF in python doesn't work with Strings
> ----------------------------------------
>
> Key: SPARK-5896
> URL: https://issues.apache.org/jira/browse/SPARK-5896
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Reporter: Michael Armbrust
> Assignee: Davies Liu
> Priority: Critical
>
> {code}
> rdd = sc.parallelize(range(10)).map(lambda x: (str(x), x))
> kvdf = rdd.toDF()
> {code}
> {code}
> ---------------------------------------------------------------------------
> ValueError Traceback (most recent call last)
> <ipython-input-18-327cb4e9a02e> in <module>()
> 1 rdd = sc.parallelize(range(10)).map(lambda x: (str(x), x))
> ----> 2 kvdf = rdd.toDF()
> /home/ubuntu/databricks/spark/python/pyspark/sql/context.pyc in toDF(self,
> schema, sampleRatio)
> 53 [Row(name=u'Alice', age=1)]
> 54 """
> ---> 55 return sqlCtx.createDataFrame(self, schema, sampleRatio)
> 56
> 57 RDD.toDF = toDF
> /home/ubuntu/databricks/spark/python/pyspark/sql/context.pyc in
> createDataFrame(self, data, schema, samplingRatio)
> 395
> 396 if schema is None:
> --> 397 return self.inferSchema(data, samplingRatio)
> 398
> 399 if isinstance(schema, (list, tuple)):
> /home/ubuntu/databricks/spark/python/pyspark/sql/context.pyc in
> inferSchema(self, rdd, samplingRatio)
> 228 raise TypeError("Cannot apply schema to DataFrame")
> 229
> --> 230 schema = self._inferSchema(rdd, samplingRatio)
> 231 converter = _create_converter(schema)
> 232 rdd = rdd.map(converter)
> /home/ubuntu/databricks/spark/python/pyspark/sql/context.pyc in
> _inferSchema(self, rdd, samplingRatio)
> 158
> 159 if samplingRatio is None:
> --> 160 schema = _infer_schema(first)
> 161 if _has_nulltype(schema):
> 162 for row in rdd.take(100)[1:]:
> /home/ubuntu/databricks/spark/python/pyspark/sql/types.pyc in
> _infer_schema(row)
> 646 items = row
> 647 else:
> --> 648 raise ValueError("Can't infer schema from tuple")
> 649
> 650 elif hasattr(row, "__dict__"): # object
> ValueError: Can't infer schema from tuple
> {code}
> Nearly the same code works if you give names (and this works without names in
> scala and calls the columns _1, _2, ...)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]