Hi, I have a spark cluster setup and I am trying to write the data to s3 but in parquet format. Here is what I am doing
df = sqlContext.load('test', 'com.databricks.spark.avro') df.saveAsParquetFile("s3n://test") But I get some nasty error: Py4JJavaError: An error occurred while calling o29.saveAsParquetFile. : org.apache.spark.SparkException: Job aborted. at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.insert(commands.scala:166) at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.run(commands.scala:139) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57) at org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:68) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:88) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:88) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:87) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:950) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:950) at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:336) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:144) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:135) at org.apache.spark.sql.DataFrame.saveAsParquetFile(DataFrame.scala:1508) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:744) Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 0.0 failed 4 times, most recent failure: Lost task 3.3 in stage 0.0 (TID 12, srv-110-29.720.rdio): org.apache.spark.SparkException: Task failed while writing rows. at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org $apache$spark$sql$sources$InsertIntoHadoopFsRelation$$writeRows$1(commands.scala:191) at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:160) at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:160) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Caused by: java.lang.VerifyError: Bad type on operand stack Exception Details: Location: org/apache/hadoop/fs/s3native/Jets3tNativeFileSystemStore.initialize(Ljava/net/URI;Lorg/apache/hadoop/conf/Configuration;)V @38: invokespecial Reason: Type 'org/jets3t/service/security/AWSCredentials' (current frame, stack[3]) is not assignable to 'org/jets3t/service/security/ProviderCredentials' Current Frame: bci: @38 flags: { } locals: { 'org/apache/hadoop/fs/s3native/Jets3tNativeFileSystemStore', 'java/net/URI', 'org/apache/hadoop/conf/Configuration', 'org/apache/hadoop/fs/s3/S3Credentials', 'org/jets3t/service/security/AWSCredentials' } stack: { 'org/apache/hadoop/fs/s3native/Jets3tNativeFileSystemStore', uninitialized 32, uninitialized 32, 'org/jets3t/service/security/AWSCredentials' } Bytecode: 0000000: bb00 3159 b700 324e 2d2b 2cb6 0034 bb00 0000010: 3659 2db6 003a 2db6 003d b700 403a 042a 0000020: bb00 4259 1904 b700 45b5 0047 a700 0b3a 0000030: 042a 1904 b700 4f2a 2c12 5103 b600 55b5 0000040: 0057 2a2c 1259 1400 5ab6 005f 1400 1eb8 0000050: 0065 b500 672a 2c12 6914 001e b600 5f14 0000060: 001e b800 65b5 006b 2a2c 126d b600 71b5 0000070: 0073 2abb 0075 592b b600 78b7 007b b500 0000080: 7db1 Exception Handler Table: bci [14, 44] => handler: 47 Stackmap Table: full_frame(@47,{Object[#2],Object[#73],Object[#75],Object[#49]},{Object[#47]}) same_frame(@55) And in s3, I see something like test$folder? I am not sure, how to fix this? Any ideas? Thanks