[jira] [Commented] (SPARK-10847) Pyspark - DataFrame - Optional Metadata with `None` triggers cryptic failure

2015-10-05 Thread Jason C Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943670#comment-14943670
 ] 

Jason C Lee commented on SPARK-10847:
-

You're welcome!

> Pyspark - DataFrame - Optional Metadata with `None` triggers cryptic failure
> 
>
> Key: SPARK-10847
> URL: https://issues.apache.org/jira/browse/SPARK-10847
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.0
> Environment: Windows 7
> java version "1.8.0_60" (64bit)
> Python 3.4.x
> Standalone cluster mode (not local[n]; a full local cluster)
>Reporter: Shea Parkes
>Priority: Minor
>
> If the optional metadata passed to `pyspark.sql.types.StructField` includes a 
> pythonic `None`, the `pyspark.SparkContext.createDataFrame` will fail with a 
> very cryptic/unhelpful error.
> Here is a minimal reproducible example:
> {code:none}
> # Assumes sc exists
> import pyspark.sql.types as types
> sqlContext = SQLContext(sc)
> literal_metadata = types.StructType([
> types.StructField(
> 'name',
> types.StringType(),
> nullable=True,
> metadata={'comment': 'From accounting system.'}
> ),
> types.StructField(
> 'age',
> types.IntegerType(),
> nullable=True,
> metadata={'comment': None}
> ),
> ])
> literal_rdd = sc.parallelize([
> ['Bob', 34],
> ['Dan', 42],
> ])
> print(literal_rdd.take(2))
> failed_dataframe = sqlContext.createDataFrame(
> literal_rdd,
> literal_metadata,
> )
> {code}
> This produces the following ~stacktrace:
> {noformat}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "", line 28, in 
>   File 
> "S:\ZQL\Software\Hotware\spark-1.5.0-bin-hadoop2.6\python\pyspark\sql\context.py",
>  line 408, in createDataFrame
> jdf = self._ssql_ctx.applySchemaToPythonRDD(jrdd.rdd(), schema.json())
>   File 
> "S:\ZQL\Software\Hotware\spark-1.5.0-bin-hadoop2.6\python\lib\py4j-0.8.2.1-src.zip\py4j\java_gateway.py",
>  line 538, in __call__
>   File 
> "S:\ZQL\Software\Hotware\spark-1.5.0-bin-hadoop2.6\python\pyspark\sql\utils.py",
>  line 36, in deco
> return f(*a, **kw)
>   File 
> "S:\ZQL\Software\Hotware\spark-1.5.0-bin-hadoop2.6\python\lib\py4j-0.8.2.1-src.zip\py4j\protocol.py",
>  line 300, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> o757.applySchemaToPythonRDD.
> : java.lang.RuntimeException: Do not support type class scala.Tuple2.
>   at 
> org.apache.spark.sql.types.Metadata$$anonfun$fromJObject$1.apply(Metadata.scala:160)
>   at 
> org.apache.spark.sql.types.Metadata$$anonfun$fromJObject$1.apply(Metadata.scala:127)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at org.apache.spark.sql.types.Metadata$.fromJObject(Metadata.scala:127)
>   at 
> org.apache.spark.sql.types.DataType$.org$apache$spark$sql$types$DataType$$parseStructField(DataType.scala:173)
>   at 
> org.apache.spark.sql.types.DataType$$anonfun$parseDataType$1.apply(DataType.scala:148)
>   at 
> org.apache.spark.sql.types.DataType$$anonfun$parseDataType$1.apply(DataType.scala:148)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.types.DataType$.parseDataType(DataType.scala:148)
>   at org.apache.spark.sql.types.DataType$.fromJson(DataType.scala:96)
>   at org.apache.spark.sql.SQLContext.parseDataType(SQLContext.scala:961)
>   at 
> org.apache.spark.sql.SQLContext.applySchemaToPythonRDD(SQLContext.scala:970)
>   at sun.reflect.GeneratedMethodAccessor38.invoke(Unknown Source)
>   at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>   at java.lang.reflect.Method.invoke(Unknown Source)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>   at py4j.Gateway.invoke(Gateway.java:259)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:207)
>   at java.lang.Thread.run(Unknown Source)
> {noformat}
> I believe the most important line of the traceback is this one:
> {noformat}
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> o757.applySchemaToPythonRDD.
> : 

[jira] [Commented] (SPARK-10847) Pyspark - DataFrame - Optional Metadata with `None` triggers cryptic failure

2015-10-02 Thread Shea Parkes (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14941952#comment-14941952
 ] 

Shea Parkes commented on SPARK-10847:
-

I appreciate your assistance!  I think your proposal is an improvement, but I 
think it would be better if the failure was triggered upon the creation of the 
StructType object - that's where the error actually occurred.

The distance between the definition of the metadata and the import was much 
larger in my project; I think your new error message would still have me 
looking for NULL values in my data (instead of my metadata).  That's likely a 
part of my unfamiliarity of Scala, but I chased as far down the pyspark code as 
I could go and didn't figure it out without trial and error.

I realize this might mean traversing an arbitrary dictionary in the StructType 
initialization looking for unallowed types, which might be unacceptable.  It 
would still be much more in line with "Crash Early, Crash Often" philosophy if 
it were possible to bomb at the creation of the metadata.

Thanks again for the assistance!

> Pyspark - DataFrame - Optional Metadata with `None` triggers cryptic failure
> 
>
> Key: SPARK-10847
> URL: https://issues.apache.org/jira/browse/SPARK-10847
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.0
> Environment: Windows 7
> java version "1.8.0_60" (64bit)
> Python 3.4.x
> Standalone cluster mode (not local[n]; a full local cluster)
>Reporter: Shea Parkes
>Priority: Minor
>
> If the optional metadata passed to `pyspark.sql.types.StructField` includes a 
> pythonic `None`, the `pyspark.SparkContext.createDataFrame` will fail with a 
> very cryptic/unhelpful error.
> Here is a minimal reproducible example:
> {code:none}
> # Assumes sc exists
> import pyspark.sql.types as types
> sqlContext = SQLContext(sc)
> literal_metadata = types.StructType([
> types.StructField(
> 'name',
> types.StringType(),
> nullable=True,
> metadata={'comment': 'From accounting system.'}
> ),
> types.StructField(
> 'age',
> types.IntegerType(),
> nullable=True,
> metadata={'comment': None}
> ),
> ])
> literal_rdd = sc.parallelize([
> ['Bob', 34],
> ['Dan', 42],
> ])
> print(literal_rdd.take(2))
> failed_dataframe = sqlContext.createDataFrame(
> literal_rdd,
> literal_metadata,
> )
> {code}
> This produces the following ~stacktrace:
> {noformat}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "", line 28, in 
>   File 
> "S:\ZQL\Software\Hotware\spark-1.5.0-bin-hadoop2.6\python\pyspark\sql\context.py",
>  line 408, in createDataFrame
> jdf = self._ssql_ctx.applySchemaToPythonRDD(jrdd.rdd(), schema.json())
>   File 
> "S:\ZQL\Software\Hotware\spark-1.5.0-bin-hadoop2.6\python\lib\py4j-0.8.2.1-src.zip\py4j\java_gateway.py",
>  line 538, in __call__
>   File 
> "S:\ZQL\Software\Hotware\spark-1.5.0-bin-hadoop2.6\python\pyspark\sql\utils.py",
>  line 36, in deco
> return f(*a, **kw)
>   File 
> "S:\ZQL\Software\Hotware\spark-1.5.0-bin-hadoop2.6\python\lib\py4j-0.8.2.1-src.zip\py4j\protocol.py",
>  line 300, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> o757.applySchemaToPythonRDD.
> : java.lang.RuntimeException: Do not support type class scala.Tuple2.
>   at 
> org.apache.spark.sql.types.Metadata$$anonfun$fromJObject$1.apply(Metadata.scala:160)
>   at 
> org.apache.spark.sql.types.Metadata$$anonfun$fromJObject$1.apply(Metadata.scala:127)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at org.apache.spark.sql.types.Metadata$.fromJObject(Metadata.scala:127)
>   at 
> org.apache.spark.sql.types.DataType$.org$apache$spark$sql$types$DataType$$parseStructField(DataType.scala:173)
>   at 
> org.apache.spark.sql.types.DataType$$anonfun$parseDataType$1.apply(DataType.scala:148)
>   at 
> org.apache.spark.sql.types.DataType$$anonfun$parseDataType$1.apply(DataType.scala:148)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.types.DataType$.parseDataType(DataType.scala:148)
>   at org.apache.spark.sql.types.DataType$.fromJson(DataType.scala:96)
>   at org.apache.spark.sql.SQLContext.parseDataType(SQLContext.scala:961)
>   at 
> 

[jira] [Commented] (SPARK-10847) Pyspark - DataFrame - Optional Metadata with `None` triggers cryptic failure

2015-10-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14941929#comment-14941929
 ] 

Apache Spark commented on SPARK-10847:
--

User 'jasoncl' has created a pull request for this issue:
https://github.com/apache/spark/pull/8969

> Pyspark - DataFrame - Optional Metadata with `None` triggers cryptic failure
> 
>
> Key: SPARK-10847
> URL: https://issues.apache.org/jira/browse/SPARK-10847
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.0
> Environment: Windows 7
> java version "1.8.0_60" (64bit)
> Python 3.4.x
> Standalone cluster mode (not local[n]; a full local cluster)
>Reporter: Shea Parkes
>Priority: Minor
>
> If the optional metadata passed to `pyspark.sql.types.StructField` includes a 
> pythonic `None`, the `pyspark.SparkContext.createDataFrame` will fail with a 
> very cryptic/unhelpful error.
> Here is a minimal reproducible example:
> {code:none}
> # Assumes sc exists
> import pyspark.sql.types as types
> sqlContext = SQLContext(sc)
> literal_metadata = types.StructType([
> types.StructField(
> 'name',
> types.StringType(),
> nullable=True,
> metadata={'comment': 'From accounting system.'}
> ),
> types.StructField(
> 'age',
> types.IntegerType(),
> nullable=True,
> metadata={'comment': None}
> ),
> ])
> literal_rdd = sc.parallelize([
> ['Bob', 34],
> ['Dan', 42],
> ])
> print(literal_rdd.take(2))
> failed_dataframe = sqlContext.createDataFrame(
> literal_rdd,
> literal_metadata,
> )
> {code}
> This produces the following ~stacktrace:
> {noformat}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "", line 28, in 
>   File 
> "S:\ZQL\Software\Hotware\spark-1.5.0-bin-hadoop2.6\python\pyspark\sql\context.py",
>  line 408, in createDataFrame
> jdf = self._ssql_ctx.applySchemaToPythonRDD(jrdd.rdd(), schema.json())
>   File 
> "S:\ZQL\Software\Hotware\spark-1.5.0-bin-hadoop2.6\python\lib\py4j-0.8.2.1-src.zip\py4j\java_gateway.py",
>  line 538, in __call__
>   File 
> "S:\ZQL\Software\Hotware\spark-1.5.0-bin-hadoop2.6\python\pyspark\sql\utils.py",
>  line 36, in deco
> return f(*a, **kw)
>   File 
> "S:\ZQL\Software\Hotware\spark-1.5.0-bin-hadoop2.6\python\lib\py4j-0.8.2.1-src.zip\py4j\protocol.py",
>  line 300, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> o757.applySchemaToPythonRDD.
> : java.lang.RuntimeException: Do not support type class scala.Tuple2.
>   at 
> org.apache.spark.sql.types.Metadata$$anonfun$fromJObject$1.apply(Metadata.scala:160)
>   at 
> org.apache.spark.sql.types.Metadata$$anonfun$fromJObject$1.apply(Metadata.scala:127)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at org.apache.spark.sql.types.Metadata$.fromJObject(Metadata.scala:127)
>   at 
> org.apache.spark.sql.types.DataType$.org$apache$spark$sql$types$DataType$$parseStructField(DataType.scala:173)
>   at 
> org.apache.spark.sql.types.DataType$$anonfun$parseDataType$1.apply(DataType.scala:148)
>   at 
> org.apache.spark.sql.types.DataType$$anonfun$parseDataType$1.apply(DataType.scala:148)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.types.DataType$.parseDataType(DataType.scala:148)
>   at org.apache.spark.sql.types.DataType$.fromJson(DataType.scala:96)
>   at org.apache.spark.sql.SQLContext.parseDataType(SQLContext.scala:961)
>   at 
> org.apache.spark.sql.SQLContext.applySchemaToPythonRDD(SQLContext.scala:970)
>   at sun.reflect.GeneratedMethodAccessor38.invoke(Unknown Source)
>   at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>   at java.lang.reflect.Method.invoke(Unknown Source)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>   at py4j.Gateway.invoke(Gateway.java:259)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:207)
>   at java.lang.Thread.run(Unknown Source)
> {noformat}
> I believe the most important line of the traceback is this one:
> {noformat}
> 

[jira] [Commented] (SPARK-10847) Pyspark - DataFrame - Optional Metadata with `None` triggers cryptic failure

2015-10-02 Thread Shea Parkes (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14941953#comment-14941953
 ] 

Shea Parkes commented on SPARK-10847:
-

My apologies, I just read your patch and see you made it work even with 
Pythonic Nulls.  You rule sir; thanks a bunch.

> Pyspark - DataFrame - Optional Metadata with `None` triggers cryptic failure
> 
>
> Key: SPARK-10847
> URL: https://issues.apache.org/jira/browse/SPARK-10847
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.0
> Environment: Windows 7
> java version "1.8.0_60" (64bit)
> Python 3.4.x
> Standalone cluster mode (not local[n]; a full local cluster)
>Reporter: Shea Parkes
>Priority: Minor
>
> If the optional metadata passed to `pyspark.sql.types.StructField` includes a 
> pythonic `None`, the `pyspark.SparkContext.createDataFrame` will fail with a 
> very cryptic/unhelpful error.
> Here is a minimal reproducible example:
> {code:none}
> # Assumes sc exists
> import pyspark.sql.types as types
> sqlContext = SQLContext(sc)
> literal_metadata = types.StructType([
> types.StructField(
> 'name',
> types.StringType(),
> nullable=True,
> metadata={'comment': 'From accounting system.'}
> ),
> types.StructField(
> 'age',
> types.IntegerType(),
> nullable=True,
> metadata={'comment': None}
> ),
> ])
> literal_rdd = sc.parallelize([
> ['Bob', 34],
> ['Dan', 42],
> ])
> print(literal_rdd.take(2))
> failed_dataframe = sqlContext.createDataFrame(
> literal_rdd,
> literal_metadata,
> )
> {code}
> This produces the following ~stacktrace:
> {noformat}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "", line 28, in 
>   File 
> "S:\ZQL\Software\Hotware\spark-1.5.0-bin-hadoop2.6\python\pyspark\sql\context.py",
>  line 408, in createDataFrame
> jdf = self._ssql_ctx.applySchemaToPythonRDD(jrdd.rdd(), schema.json())
>   File 
> "S:\ZQL\Software\Hotware\spark-1.5.0-bin-hadoop2.6\python\lib\py4j-0.8.2.1-src.zip\py4j\java_gateway.py",
>  line 538, in __call__
>   File 
> "S:\ZQL\Software\Hotware\spark-1.5.0-bin-hadoop2.6\python\pyspark\sql\utils.py",
>  line 36, in deco
> return f(*a, **kw)
>   File 
> "S:\ZQL\Software\Hotware\spark-1.5.0-bin-hadoop2.6\python\lib\py4j-0.8.2.1-src.zip\py4j\protocol.py",
>  line 300, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> o757.applySchemaToPythonRDD.
> : java.lang.RuntimeException: Do not support type class scala.Tuple2.
>   at 
> org.apache.spark.sql.types.Metadata$$anonfun$fromJObject$1.apply(Metadata.scala:160)
>   at 
> org.apache.spark.sql.types.Metadata$$anonfun$fromJObject$1.apply(Metadata.scala:127)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at org.apache.spark.sql.types.Metadata$.fromJObject(Metadata.scala:127)
>   at 
> org.apache.spark.sql.types.DataType$.org$apache$spark$sql$types$DataType$$parseStructField(DataType.scala:173)
>   at 
> org.apache.spark.sql.types.DataType$$anonfun$parseDataType$1.apply(DataType.scala:148)
>   at 
> org.apache.spark.sql.types.DataType$$anonfun$parseDataType$1.apply(DataType.scala:148)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.types.DataType$.parseDataType(DataType.scala:148)
>   at org.apache.spark.sql.types.DataType$.fromJson(DataType.scala:96)
>   at org.apache.spark.sql.SQLContext.parseDataType(SQLContext.scala:961)
>   at 
> org.apache.spark.sql.SQLContext.applySchemaToPythonRDD(SQLContext.scala:970)
>   at sun.reflect.GeneratedMethodAccessor38.invoke(Unknown Source)
>   at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>   at java.lang.reflect.Method.invoke(Unknown Source)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>   at py4j.Gateway.invoke(Gateway.java:259)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:207)
>   at java.lang.Thread.run(Unknown Source)
> {noformat}
> I believe the most important line of the traceback is this one:
> {noformat}
> 

[jira] [Commented] (SPARK-10847) Pyspark - DataFrame - Optional Metadata with `None` triggers cryptic failure

2015-09-28 Thread Jason C Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14933707#comment-14933707
 ] 

Jason C Lee commented on SPARK-10847:
-

I would like to work on this.

> Pyspark - DataFrame - Optional Metadata with `None` triggers cryptic failure
> 
>
> Key: SPARK-10847
> URL: https://issues.apache.org/jira/browse/SPARK-10847
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.0
> Environment: Windows 7
> java version "1.8.0_60" (64bit)
> Python 3.4.x
> Standalone cluster mode (not local[n]; a full local cluster)
>Reporter: Shea Parkes
>Priority: Minor
>
> If the optional metadata passed to `pyspark.sql.types.StructField` includes a 
> pythonic `None`, the `pyspark.SparkContext.createDataFrame` will fail with a 
> very cryptic/unhelpful error.
> Here is a minimal reproducible example:
> {code:none}
> # Assumes sc exists
> import pyspark.sql.types as types
> sqlContext = SQLContext(sc)
> literal_metadata = types.StructType([
> types.StructField(
> 'name',
> types.StringType(),
> nullable=True,
> metadata={'comment': 'From accounting system.'}
> ),
> types.StructField(
> 'age',
> types.IntegerType(),
> nullable=True,
> metadata={'comment': None}
> ),
> ])
> literal_rdd = sc.parallelize([
> ['Bob', 34],
> ['Dan', 42],
> ])
> print(literal_rdd.take(2))
> failed_dataframe = sqlContext.createDataFrame(
> literal_rdd,
> literal_metadata,
> )
> {code}
> This produces the following ~stacktrace:
> {noformat}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "", line 28, in 
>   File 
> "S:\ZQL\Software\Hotware\spark-1.5.0-bin-hadoop2.6\python\pyspark\sql\context.py",
>  line 408, in createDataFrame
> jdf = self._ssql_ctx.applySchemaToPythonRDD(jrdd.rdd(), schema.json())
>   File 
> "S:\ZQL\Software\Hotware\spark-1.5.0-bin-hadoop2.6\python\lib\py4j-0.8.2.1-src.zip\py4j\java_gateway.py",
>  line 538, in __call__
>   File 
> "S:\ZQL\Software\Hotware\spark-1.5.0-bin-hadoop2.6\python\pyspark\sql\utils.py",
>  line 36, in deco
> return f(*a, **kw)
>   File 
> "S:\ZQL\Software\Hotware\spark-1.5.0-bin-hadoop2.6\python\lib\py4j-0.8.2.1-src.zip\py4j\protocol.py",
>  line 300, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> o757.applySchemaToPythonRDD.
> : java.lang.RuntimeException: Do not support type class scala.Tuple2.
>   at 
> org.apache.spark.sql.types.Metadata$$anonfun$fromJObject$1.apply(Metadata.scala:160)
>   at 
> org.apache.spark.sql.types.Metadata$$anonfun$fromJObject$1.apply(Metadata.scala:127)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at org.apache.spark.sql.types.Metadata$.fromJObject(Metadata.scala:127)
>   at 
> org.apache.spark.sql.types.DataType$.org$apache$spark$sql$types$DataType$$parseStructField(DataType.scala:173)
>   at 
> org.apache.spark.sql.types.DataType$$anonfun$parseDataType$1.apply(DataType.scala:148)
>   at 
> org.apache.spark.sql.types.DataType$$anonfun$parseDataType$1.apply(DataType.scala:148)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.types.DataType$.parseDataType(DataType.scala:148)
>   at org.apache.spark.sql.types.DataType$.fromJson(DataType.scala:96)
>   at org.apache.spark.sql.SQLContext.parseDataType(SQLContext.scala:961)
>   at 
> org.apache.spark.sql.SQLContext.applySchemaToPythonRDD(SQLContext.scala:970)
>   at sun.reflect.GeneratedMethodAccessor38.invoke(Unknown Source)
>   at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>   at java.lang.reflect.Method.invoke(Unknown Source)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>   at py4j.Gateway.invoke(Gateway.java:259)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:207)
>   at java.lang.Thread.run(Unknown Source)
> {noformat}
> I believe the most important line of the traceback is this one:
> {noformat}
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> o757.applySchemaToPythonRDD.

[jira] [Commented] (SPARK-10847) Pyspark - DataFrame - Optional Metadata with `None` triggers cryptic failure

2015-09-28 Thread Shea Parkes (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14933924#comment-14933924
 ] 

Shea Parkes commented on SPARK-10847:
-

This issue caused me to learn enough about Scala only to learn that the 
exception still wasn't helpful once I even knew what a scala.Tuple2 was.

I'm not planning on doing any further work on this, so to the extent you were 
waiting to avoid duplication of efforts with me, feel free to go ahead and 
knock it out.  I'm not entirely familiar with the contribution guidelines, but 
I'm sure you can work them out.

In case it wasn't clear above, the line that triggers the error is:
{code:none}
metadata={'comment': None}
{code}

Thanks for the interest!

> Pyspark - DataFrame - Optional Metadata with `None` triggers cryptic failure
> 
>
> Key: SPARK-10847
> URL: https://issues.apache.org/jira/browse/SPARK-10847
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.0
> Environment: Windows 7
> java version "1.8.0_60" (64bit)
> Python 3.4.x
> Standalone cluster mode (not local[n]; a full local cluster)
>Reporter: Shea Parkes
>Priority: Minor
>
> If the optional metadata passed to `pyspark.sql.types.StructField` includes a 
> pythonic `None`, the `pyspark.SparkContext.createDataFrame` will fail with a 
> very cryptic/unhelpful error.
> Here is a minimal reproducible example:
> {code:none}
> # Assumes sc exists
> import pyspark.sql.types as types
> sqlContext = SQLContext(sc)
> literal_metadata = types.StructType([
> types.StructField(
> 'name',
> types.StringType(),
> nullable=True,
> metadata={'comment': 'From accounting system.'}
> ),
> types.StructField(
> 'age',
> types.IntegerType(),
> nullable=True,
> metadata={'comment': None}
> ),
> ])
> literal_rdd = sc.parallelize([
> ['Bob', 34],
> ['Dan', 42],
> ])
> print(literal_rdd.take(2))
> failed_dataframe = sqlContext.createDataFrame(
> literal_rdd,
> literal_metadata,
> )
> {code}
> This produces the following ~stacktrace:
> {noformat}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "", line 28, in 
>   File 
> "S:\ZQL\Software\Hotware\spark-1.5.0-bin-hadoop2.6\python\pyspark\sql\context.py",
>  line 408, in createDataFrame
> jdf = self._ssql_ctx.applySchemaToPythonRDD(jrdd.rdd(), schema.json())
>   File 
> "S:\ZQL\Software\Hotware\spark-1.5.0-bin-hadoop2.6\python\lib\py4j-0.8.2.1-src.zip\py4j\java_gateway.py",
>  line 538, in __call__
>   File 
> "S:\ZQL\Software\Hotware\spark-1.5.0-bin-hadoop2.6\python\pyspark\sql\utils.py",
>  line 36, in deco
> return f(*a, **kw)
>   File 
> "S:\ZQL\Software\Hotware\spark-1.5.0-bin-hadoop2.6\python\lib\py4j-0.8.2.1-src.zip\py4j\protocol.py",
>  line 300, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> o757.applySchemaToPythonRDD.
> : java.lang.RuntimeException: Do not support type class scala.Tuple2.
>   at 
> org.apache.spark.sql.types.Metadata$$anonfun$fromJObject$1.apply(Metadata.scala:160)
>   at 
> org.apache.spark.sql.types.Metadata$$anonfun$fromJObject$1.apply(Metadata.scala:127)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at org.apache.spark.sql.types.Metadata$.fromJObject(Metadata.scala:127)
>   at 
> org.apache.spark.sql.types.DataType$.org$apache$spark$sql$types$DataType$$parseStructField(DataType.scala:173)
>   at 
> org.apache.spark.sql.types.DataType$$anonfun$parseDataType$1.apply(DataType.scala:148)
>   at 
> org.apache.spark.sql.types.DataType$$anonfun$parseDataType$1.apply(DataType.scala:148)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.types.DataType$.parseDataType(DataType.scala:148)
>   at org.apache.spark.sql.types.DataType$.fromJson(DataType.scala:96)
>   at org.apache.spark.sql.SQLContext.parseDataType(SQLContext.scala:961)
>   at 
> org.apache.spark.sql.SQLContext.applySchemaToPythonRDD(SQLContext.scala:970)
>   at sun.reflect.GeneratedMethodAccessor38.invoke(Unknown Source)
>   at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>   at java.lang.reflect.Method.invoke(Unknown Source)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>   at 

[jira] [Commented] (SPARK-10847) Pyspark - DataFrame - Optional Metadata with `None` triggers cryptic failure

2015-09-28 Thread Jason C Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14934044#comment-14934044
 ] 

Jason C Lee commented on SPARK-10847:
-

Instead of 
Py4JJavaError: An error occurred while calling o757.applySchemaToPythonRDD.
: java.lang.RuntimeException: Do not support type class scala.Tuple2.

Would it be helpful if the error message is this:
Py4JJavaError: An error occurred while calling o76.applySchemaToPythonRDD.
: java.lang.RuntimeException: Do not support type class java.lang.String : 
class org.json4s.JsonAST$JNull$.

> Pyspark - DataFrame - Optional Metadata with `None` triggers cryptic failure
> 
>
> Key: SPARK-10847
> URL: https://issues.apache.org/jira/browse/SPARK-10847
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.0
> Environment: Windows 7
> java version "1.8.0_60" (64bit)
> Python 3.4.x
> Standalone cluster mode (not local[n]; a full local cluster)
>Reporter: Shea Parkes
>Priority: Minor
>
> If the optional metadata passed to `pyspark.sql.types.StructField` includes a 
> pythonic `None`, the `pyspark.SparkContext.createDataFrame` will fail with a 
> very cryptic/unhelpful error.
> Here is a minimal reproducible example:
> {code:none}
> # Assumes sc exists
> import pyspark.sql.types as types
> sqlContext = SQLContext(sc)
> literal_metadata = types.StructType([
> types.StructField(
> 'name',
> types.StringType(),
> nullable=True,
> metadata={'comment': 'From accounting system.'}
> ),
> types.StructField(
> 'age',
> types.IntegerType(),
> nullable=True,
> metadata={'comment': None}
> ),
> ])
> literal_rdd = sc.parallelize([
> ['Bob', 34],
> ['Dan', 42],
> ])
> print(literal_rdd.take(2))
> failed_dataframe = sqlContext.createDataFrame(
> literal_rdd,
> literal_metadata,
> )
> {code}
> This produces the following ~stacktrace:
> {noformat}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "", line 28, in 
>   File 
> "S:\ZQL\Software\Hotware\spark-1.5.0-bin-hadoop2.6\python\pyspark\sql\context.py",
>  line 408, in createDataFrame
> jdf = self._ssql_ctx.applySchemaToPythonRDD(jrdd.rdd(), schema.json())
>   File 
> "S:\ZQL\Software\Hotware\spark-1.5.0-bin-hadoop2.6\python\lib\py4j-0.8.2.1-src.zip\py4j\java_gateway.py",
>  line 538, in __call__
>   File 
> "S:\ZQL\Software\Hotware\spark-1.5.0-bin-hadoop2.6\python\pyspark\sql\utils.py",
>  line 36, in deco
> return f(*a, **kw)
>   File 
> "S:\ZQL\Software\Hotware\spark-1.5.0-bin-hadoop2.6\python\lib\py4j-0.8.2.1-src.zip\py4j\protocol.py",
>  line 300, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> o757.applySchemaToPythonRDD.
> : java.lang.RuntimeException: Do not support type class scala.Tuple2.
>   at 
> org.apache.spark.sql.types.Metadata$$anonfun$fromJObject$1.apply(Metadata.scala:160)
>   at 
> org.apache.spark.sql.types.Metadata$$anonfun$fromJObject$1.apply(Metadata.scala:127)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at org.apache.spark.sql.types.Metadata$.fromJObject(Metadata.scala:127)
>   at 
> org.apache.spark.sql.types.DataType$.org$apache$spark$sql$types$DataType$$parseStructField(DataType.scala:173)
>   at 
> org.apache.spark.sql.types.DataType$$anonfun$parseDataType$1.apply(DataType.scala:148)
>   at 
> org.apache.spark.sql.types.DataType$$anonfun$parseDataType$1.apply(DataType.scala:148)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.types.DataType$.parseDataType(DataType.scala:148)
>   at org.apache.spark.sql.types.DataType$.fromJson(DataType.scala:96)
>   at org.apache.spark.sql.SQLContext.parseDataType(SQLContext.scala:961)
>   at 
> org.apache.spark.sql.SQLContext.applySchemaToPythonRDD(SQLContext.scala:970)
>   at sun.reflect.GeneratedMethodAccessor38.invoke(Unknown Source)
>   at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>   at java.lang.reflect.Method.invoke(Unknown Source)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>   at py4j.Gateway.invoke(Gateway.java:259)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>