[jira] [Created] (SPARK-8450) PySpark write.parquet raises Unsupported datatype DecimalType()

2015-06-18 Thread Peter Hoffmann (JIRA)
Peter Hoffmann created SPARK-8450:
-

 Summary: PySpark write.parquet raises Unsupported datatype 
DecimalType()
 Key: SPARK-8450
 URL: https://issues.apache.org/jira/browse/SPARK-8450
 Project: Spark
  Issue Type: Bug
 Environment: Spark 1.4.0 on Debian
Reporter: Peter Hoffmann


I'm getting an Exception when I try to save a DataFrame as an parquet file

Minimal Example:

from decimal import Decimal
from pyspark.sql import SQLContext
from pyspark.sql.types import *

sqlContext = SQLContext(sc)
schema = StructType([
StructField('id', LongType()),
StructField('value', DecimalType())])
rdd = sc.parallelize([[1, Decimal("0.5")],[2, Decimal("2.9")]])
df = sqlContext.createDataFrame(rdd, schema)
df.write.parquet("hdfs://srv:9000/user/ph/decimal.parquet", 'overwrite')

Stack Trace

---
Py4JJavaError Traceback (most recent call last)
 in ()
> 1 sr.write.parquet("hdfs://srv:9000/user/ph/decimal.parquet", 'overwrite')

/home/spark/spark-1.4.0-bin-hadoop2.6/python/pyspark/sql/readwriter.pyc in 
parquet(self, path, mode)
367 :param mode: one of `append`, `overwrite`, `error`, `ignore` 
(default: error)
368 """
--> 369 return self._jwrite.mode(mode).parquet(path)
370 
371 @since(1.4)

/home/spark/spark-1.4.0-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py
 in __call__(self, *args)
536 answer = self.gateway_client.send_command(command)
537 return_value = get_return_value(answer, self.gateway_client,
--> 538 self.target_id, self.name)
539 
540 for temp_arg in temp_args:

/home/spark/spark-1.4.0-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py
 in get_return_value(answer, gateway_client, target_id, name)
298 raise Py4JJavaError(
299 'An error occurred while calling {0}{1}{2}.\n'.
--> 300 format(target_id, '.', name), value)
301 else:
302 raise Py4JError(

Py4JJavaError: An error occurred while calling o361.parquet.
: org.apache.spark.SparkException: Job aborted.
at 
org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.insert(commands.scala:138)
at 
org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.run(commands.scala:114)
at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
at 
org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:68)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:88)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:88)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:87)
at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:939)
at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:939)
at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:332)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:144)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:135)
at 
org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:281)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 158 in stage 35.0 failed 4 times, most recent failure: Lost task 158.3 in 
stage 35.0 (TID 2736, 10.2.160.14): java.lang.RuntimeException: Unsupported 
datatype DecimalType()
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(ParquetTypes.scala:374)
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply

[jira] [Updated] (SPARK-8450) PySpark write.parquet raises Unsupported datatype DecimalType()

2015-06-18 Thread Peter Hoffmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Hoffmann updated SPARK-8450:
--
Description: 
I'm getting an Exception when I try to save a DataFrame with a DeciamlType as 
an parquet file

Minimal Example:

from decimal import Decimal
from pyspark.sql import SQLContext
from pyspark.sql.types import *

sqlContext = SQLContext(sc)
schema = StructType([
StructField('id', LongType()),
StructField('value', DecimalType())])
rdd = sc.parallelize([[1, Decimal("0.5")],[2, Decimal("2.9")]])
df = sqlContext.createDataFrame(rdd, schema)
df.write.parquet("hdfs://srv:9000/user/ph/decimal.parquet", 'overwrite')

Stack Trace

---
Py4JJavaError Traceback (most recent call last)
 in ()
> 1 sr.write.parquet("hdfs://srv:9000/user/ph/decimal.parquet", 'overwrite')

/home/spark/spark-1.4.0-bin-hadoop2.6/python/pyspark/sql/readwriter.pyc in 
parquet(self, path, mode)
367 :param mode: one of `append`, `overwrite`, `error`, `ignore` 
(default: error)
368 """
--> 369 return self._jwrite.mode(mode).parquet(path)
370 
371 @since(1.4)

/home/spark/spark-1.4.0-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py
 in __call__(self, *args)
536 answer = self.gateway_client.send_command(command)
537 return_value = get_return_value(answer, self.gateway_client,
--> 538 self.target_id, self.name)
539 
540 for temp_arg in temp_args:

/home/spark/spark-1.4.0-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py
 in get_return_value(answer, gateway_client, target_id, name)
298 raise Py4JJavaError(
299 'An error occurred while calling {0}{1}{2}.\n'.
--> 300 format(target_id, '.', name), value)
301 else:
302 raise Py4JError(

Py4JJavaError: An error occurred while calling o361.parquet.
: org.apache.spark.SparkException: Job aborted.
at 
org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.insert(commands.scala:138)
at 
org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.run(commands.scala:114)
at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
at 
org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:68)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:88)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:88)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:87)
at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:939)
at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:939)
at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:332)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:144)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:135)
at 
org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:281)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 158 in stage 35.0 failed 4 times, most recent failure: Lost task 158.3 in 
stage 35.0 (TID 2736, 10.2.160.14): java.lang.RuntimeException: Unsupported 
datatype DecimalType()
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(ParquetTypes.scala:374)
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(ParquetTypes.scala:318)
at scala.Option.getOrElse(Option.scala:120)
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$.fromDataType(ParquetTypes

[jira] [Updated] (SPARK-8450) PySpark write.parquet raises Unsupported datatype DecimalType()

2015-06-18 Thread Peter Hoffmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Hoffmann updated SPARK-8450:
--
Description: 
I'm getting an Exception when I try to save a DataFrame with a DeciamlType as 
an parquet file

Minimal Example:

from decimal import Decimal
from pyspark.sql import SQLContext
from pyspark.sql.types import *

sqlContext = SQLContext(sc)
schema = StructType([
StructField('id', LongType()),
StructField('value', DecimalType())])
rdd = sc.parallelize([[1, Decimal("0.5")],[2, Decimal("2.9")]])
df = sqlContext.createDataFrame(rdd, schema)
df.write.parquet("hdfs://srv:9000/user/ph/decimal.parquet", 'overwrite')

Stack Trace

---
Py4JJavaError Traceback (most recent call last)
 in ()
> 1 sr.write.parquet("hdfs://srv:9000/user/ph/decimal.parquet", 'overwrite')

/home/spark/spark-1.4.0-bin-hadoop2.6/python/pyspark/sql/readwriter.pyc in 
parquet(self, path, mode)
367 :param mode: one of `append`, `overwrite`, `error`, `ignore` 
(default: error)
368 """
--> 369 return self._jwrite.mode(mode).parquet(path)
370 
371 @since(1.4)

/home/spark/spark-1.4.0-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py
 in __call__(self, *args)
536 answer = self.gateway_client.send_command(command)
537 return_value = get_return_value(answer, self.gateway_client,
--> 538 self.target_id, self.name)
539 
540 for temp_arg in temp_args:

/home/spark/spark-1.4.0-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py
 in get_return_value(answer, gateway_client, target_id, name)
298 raise Py4JJavaError(
299 'An error occurred while calling {0}{1}{2}.\n'.
--> 300 format(target_id, '.', name), value)
301 else:
302 raise Py4JError(

Py4JJavaError: An error occurred while calling o361.parquet.
: org.apache.spark.SparkException: Job aborted.
at 
org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.insert(commands.scala:138)
at 
org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.run(commands.scala:114)
at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
at 
org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:68)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:88)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:88)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:87)
at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:939)
at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:939)
at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:332)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:144)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:135)
at 
org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:281)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 158 in stage 35.0 failed 4 times, most recent failure: Lost task 158.3 in 
stage 35.0 (TID 2736, 10.2.160.14): java.lang.RuntimeException: Unsupported 
datatype DecimalType()
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(ParquetTypes.scala:374)
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$fromDataType$2.apply(ParquetTypes.scala:318)
at scala.Option.getOrElse(Option.scala:120)
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$.fromDataType(ParquetTypes

[jira] [Commented] (SPARK-8450) PySpark write.parquet raises Unsupported datatype DecimalType()

2015-07-12 Thread Peter Hoffmann (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14623999#comment-14623999
 ] 

Peter Hoffmann commented on SPARK-8450:
---

I have tried it with todays spark-1.5.0-SNAPSHOT-bin-hadoop2.6 daily build from 
http://people.apache.org/~pwendell/spark-nightly/spark-master-bin/latest/ and 
was able to save DecimalType(16,2) as parquet in python

Thanks for the quick fix!



> PySpark write.parquet raises Unsupported datatype DecimalType()
> ---
>
> Key: SPARK-8450
> URL: https://issues.apache.org/jira/browse/SPARK-8450
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
> Environment: Spark 1.4.0 on Debian
>Reporter: Peter Hoffmann
>Assignee: Davies Liu
> Fix For: 1.5.0
>
>
> I'm getting an Exception when I try to save a DataFrame with a DeciamlType as 
> an parquet file
> Minimal Example:
> {code}
> from decimal import Decimal
> from pyspark.sql import SQLContext
> from pyspark.sql.types import *
> sqlContext = SQLContext(sc)
> schema = StructType([
> StructField('id', LongType()),
> StructField('value', DecimalType())])
> rdd = sc.parallelize([[1, Decimal("0.5")],[2, Decimal("2.9")]])
> df = sqlContext.createDataFrame(rdd, schema)
> df.write.parquet("hdfs://srv:9000/user/ph/decimal.parquet", 'overwrite')
> {code}
> Stack Trace
> {code}
> ---
> Py4JJavaError Traceback (most recent call last)
>  in ()
> > 1 sr.write.parquet("hdfs://srv:9000/user/ph/decimal.parquet", 
> 'overwrite')
> /home/spark/spark-1.4.0-bin-hadoop2.6/python/pyspark/sql/readwriter.pyc in 
> parquet(self, path, mode)
> 367 :param mode: one of `append`, `overwrite`, `error`, `ignore` 
> (default: error)
> 368 """
> --> 369 return self._jwrite.mode(mode).parquet(path)
> 370 
> 371 @since(1.4)
> /home/spark/spark-1.4.0-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py
>  in __call__(self, *args)
> 536 answer = self.gateway_client.send_command(command)
> 537 return_value = get_return_value(answer, self.gateway_client,
> --> 538 self.target_id, self.name)
> 539 
> 540 for temp_arg in temp_args:
> /home/spark/spark-1.4.0-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py
>  in get_return_value(answer, gateway_client, target_id, name)
> 298 raise Py4JJavaError(
> 299 'An error occurred while calling {0}{1}{2}.\n'.
> --> 300 format(target_id, '.', name), value)
> 301 else:
> 302 raise Py4JError(
> Py4JJavaError: An error occurred while calling o361.parquet.
> : org.apache.spark.SparkException: Job aborted.
>   at 
> org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.insert(commands.scala:138)
>   at 
> org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.run(commands.scala:114)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:68)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:88)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:88)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:87)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:939)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:939)
>   at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:332)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:144)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:135)
>   at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:281)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>   at py4j.Gateway.invoke(Gateway.java:259)
>   at py4j.commands.AbstractCommand.invokeMet