Re: Which OutputCommitter to use for S3?

Pei-Lun Lee Thu, 05 Mar 2015 00:29:53 -0800

Thanks for the DirectOutputCommitter example.
However I found it only works for saveAsHadoopFile. What about
saveAsParquetFile?
It looks like SparkSQL is using ParquetOutputCommitter, which is subclass
of FileOutputCommitter.


On Fri, Feb 27, 2015 at 1:52 AM, Thomas Demoor <thomas.dem...@amplidata.com>
wrote:

> FYI. We're currently addressing this at the Hadoop level in
> https://issues.apache.org/jira/browse/HADOOP-9565
>
>
> Thomas Demoor
>
> On Mon, Feb 23, 2015 at 10:16 PM, Darin McBeath <
> ddmcbe...@yahoo.com.invalid> wrote:
>
>> Just to close the loop in case anyone runs into the same problem I had.
>>
>> By setting --hadoop-major-version=2 when using the ec2 scripts,
>> everything worked fine.
>>
>> Darin.
>>
>>
>> ----- Original Message -----
>> From: Darin McBeath <ddmcbe...@yahoo.com.INVALID>
>> To: Mingyu Kim <m...@palantir.com>; Aaron Davidson <ilike...@gmail.com>
>> Cc: "u...@spark.apache.org" <u...@spark.apache.org>
>> Sent: Monday, February 23, 2015 3:16 PM
>> Subject: Re: Which OutputCommitter to use for S3?
>>
>> Thanks.  I think my problem might actually be the other way around.
>>
>> I'm compiling with hadoop 2,  but when I startup Spark, using the ec2
>> scripts, I don't specify a
>> -hadoop-major-version and the default is 1.   I'm guessing that if I make
>> that a 2 that it might work correctly.  I'll try it and post a response.
>>
>>
>> ----- Original Message -----
>> From: Mingyu Kim <m...@palantir.com>
>> To: Darin McBeath <ddmcbe...@yahoo.com>; Aaron Davidson <
>> ilike...@gmail.com>
>> Cc: "u...@spark.apache.org" <u...@spark.apache.org>
>> Sent: Monday, February 23, 2015 3:06 PM
>> Subject: Re: Which OutputCommitter to use for S3?
>>
>> Cool, we will start from there. Thanks Aaron and Josh!
>>
>> Darin, it¹s likely because the DirectOutputCommitter is compiled with
>> Hadoop 1 classes and you¹re running it with Hadoop 2.
>> org.apache.hadoop.mapred.JobContext used to be a class in Hadoop 1, and it
>> became an interface in Hadoop 2.
>>
>> Mingyu
>>
>>
>>
>>
>>
>> On 2/23/15, 11:52 AM, "Darin McBeath" <ddmcbe...@yahoo.com.INVALID>
>> wrote:
>>
>> >Aaron.  Thanks for the class. Since I'm currently writing Java based
>> >Spark applications, I tried converting your class to Java (it seemed
>> >pretty straightforward).
>> >
>> >I set up the use of the class as follows:
>> >
>> >SparkConf conf = new SparkConf()
>> >.set("spark.hadoop.mapred.output.committer.class",
>> >"com.elsevier.common.DirectOutputCommitter");
>> >
>> >And I then try and save a file to S3 (which I believe should use the old
>> >hadoop apis).
>> >
>> >JavaPairRDD<Text, Text> newBaselineRDDWritable =
>> >reducedhsfPairRDD.mapToPair(new ConvertToWritableTypes());
>> >newBaselineRDDWritable.saveAsHadoopFile(baselineOutputBucketFile,
>> >Text.class, Text.class, SequenceFileOutputFormat.class,
>> >org.apache.hadoop.io.compress.GzipCodec.class);
>> >
>> >But, I get the following error message.
>> >
>> >Exception in thread "main" java.lang.IncompatibleClassChangeError: Found
>> >class org.apache.hadoop.mapred.JobContext, but interface was expected
>> >at
>>
>> >com.elsevier.common.DirectOutputCommitter.commitJob(DirectOutputCommitter.
>> >java:68)
>> >at
>> >org.apache.spark.SparkHadoopWriter.commitJob(SparkHadoopWriter.scala:127)
>> >at
>>
>> >org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions
>> >.scala:1075)
>> >at
>>
>> >org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.sc
>> >ala:940)
>> >at
>>
>> >org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.sc
>> >ala:902)
>> >at
>>
>> >org.apache.spark.api.java.JavaPairRDD.saveAsHadoopFile(JavaPairRDD.scala:7
>> >71)
>> >at com.elsevier.spark.SparkSyncDedup.main(SparkSyncDedup.java:156)
>> >
>> >In my class, JobContext is an interface of  type
>> >org.apache.hadoop.mapred.JobContext.
>> >
>> >Is there something obvious that I might be doing wrong (or messed up in
>> >the translation from Scala to Java) or something I should look into?  I'm
>> >using Spark 1.2 with hadoop 2.4.
>> >
>> >
>> >Thanks.
>> >
>> >Darin.
>> >
>> >
>> >________________________________
>> >
>> >
>> >From: Aaron Davidson <ilike...@gmail.com>
>> >To: Andrew Ash <and...@andrewash.com>
>> >Cc: Josh Rosen <rosenvi...@gmail.com>; Mingyu Kim <m...@palantir.com>;
>> >"u...@spark.apache.org" <u...@spark.apache.org>; Aaron Davidson
>> ><aa...@databricks.com>
>> >Sent: Saturday, February 21, 2015 7:01 PM
>> >Subject: Re: Which OutputCommitter to use for S3?
>> >
>> >
>> >
>> >Here is the class:
>> >
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__gist.github.com_aaron
>>
>> >dav_c513916e72101bbe14ec&d=AwIFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6o
>>
>> >Onmz8&r=ennQJq47pNnObsDh-88a9YUrUulcYQoV8giPASqXB84&m=_2YAVrYZtQmuKZRf6sFs
>> >zOvl_-ZnxmkBPHo1K24TfGE&s=cwSCPKlJO-BJcz4UcGck3xOE2N-4V3eoNvgtFCdMLP8&e=
>> >
>> >You can use it by setting "mapred.output.committer.class" in the Hadoop
>> >configuration (or "spark.hadoop.mapred.output.committer.class" in the
>> >Spark configuration). Note that this only works for the old Hadoop APIs,
>> >I believe the new Hadoop APIs strongly tie committer to input format (so
>> >FileInputFormat always uses FileOutputCommitter), which makes this fix
>> >more difficult to apply.
>> >
>> >
>> >
>> >
>> >On Sat, Feb 21, 2015 at 12:12 PM, Andrew Ash <and...@andrewash.com>
>> wrote:
>> >
>> >Josh is that class something you guys would consider open sourcing, or
>> >would you rather the community step up and create an OutputCommitter
>> >implementation optimized for S3?
>> >>
>> >>
>> >>On Fri, Feb 20, 2015 at 4:02 PM, Josh Rosen <rosenvi...@gmail.com>
>> wrote:
>> >>
>> >>We (Databricks) use our own DirectOutputCommitter implementation, which
>> >>is a couple tens of lines of Scala code.  The class would almost
>> >>entirely be a no-op except we took some care to properly handle the
>> >>_SUCCESS file.
>> >>>
>> >>>
>> >>>On Fri, Feb 20, 2015 at 3:52 PM, Mingyu Kim <m...@palantir.com> wrote:
>> >>>
>> >>>I didn¹t get any response. It¹d be really appreciated if anyone using a
>> >>>special OutputCommitter for S3 can comment on this!
>> >>>>
>> >>>>
>> >>>>Thanks,
>> >>>>Mingyu
>> >>>>
>> >>>>
>> >>>>From: Mingyu Kim <m...@palantir.com>
>> >>>>Date: Monday, February 16, 2015 at 1:15 AM
>> >>>>To: "u...@spark.apache.org" <u...@spark.apache.org>
>> >>>>Subject: Which OutputCommitter to use for S3?
>> >>>>
>> >>>>
>> >>>>
>> >>>>HI all,
>> >>>>
>> >>>>
>> >>>>The default OutputCommitter used by RDD, which is FileOutputCommitter,
>> >>>>seems to require moving files at the commit step, which is not a
>> >>>>constant operation in S3, as discussed in
>> >>>>
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__mail-2Darchives.apa
>>
>> >>>>che.org_mod-5Fmbox_spark-2Duser_201410.mbox_-253C543E33FA.2000802-40ent
>>
>> >>>>ropy.be-253E&d=AwIFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=e
>>
>> >>>>nnQJq47pNnObsDh-88a9YUrUulcYQoV8giPASqXB84&m=_2YAVrYZtQmuKZRf6sFszOvl_-
>> >>>>ZnxmkBPHo1K24TfGE&s=EQOZaHRANJupdjXCfHSXL2t5BZ9YgMt2pRc3pht4o7o&e= .
>>
>> >>>>People seem to develop their own NullOutputCommitter implementation or
>> >>>>use DirectFileOutputCommitter (as mentioned in SPARK-3595), but I
>> >>>>wanted to check if there is a de facto standard, publicly available
>> >>>>OutputCommitter to use for S3 in conjunction with Spark.
>> >>>>
>> >>>>
>> >>>>Thanks,
>> >>>>Mingyu
>> >>>
>> >>
>> >
>> >---------------------------------------------------------------------
>> >To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> >For additional commands, e-mail: user-h...@spark.apache.org
>>
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Re: Which OutputCommitter to use for S3?

Reply via email to