Re: using saveAsNewAPIHadoopFile with OrcOutputFormat

Nick Pentreath Thu, 17 Apr 2014 02:35:22 -0700

ES formats are pretty easy to use:

Reading:
val conf = new Configuration()
conf.set("es.resource", "index/type")
conf.set("es.query", "?q=*")
val rdd = sc.newAPIHadoopRDD(
conf,
classOf[EsInputFormat[NullWritable, LinkedMapWritable]],
classOf[NullWritable],
classOf[LinkedMapWritable]
)


The only gotcha is this loads a bunch of writables that you have to convert
yourself based on the indexed types, involving casting.

Writing is similar:
val conf = new Configuration()
conf.set("es.resource", "index/type")
val mapWritableRDD: RDD[(NullWritable, MapWritable)] = rdd.map{
...
        val mw = new MapWritable()
        mw.put(new Text("key"), new DoubleWritable(1.0))
        (NullWritable.get(), mw)
}
mapWritableRDD.saveAsNewAPIHadoopFile(
"/some/random/directory",
classOf[NullWritable],
classOf[MapWritable],
classOf[EsOutputFormat],
conf
)


Gotcha here is that you need to provide a Path to saveAsNewAPIHadoopFile
even though it is not used.




On Wed, Apr 16, 2014 at 7:15 PM, Kostiantyn Kudriavtsev <
kudryavtsev.konstan...@gmail.com> wrote:

> I’d prefer to find good example of using saveAsNewAPIHadoopFile with
> different OutputFormat implementations (not only orc, but EsOutputFormat,
> etc). Any common example
>
> On Apr 16, 2014, at 4:51 PM, Brock Bose <brock.b...@gmail.com> wrote:
>
> Howdy all,
>     I recently saw that the OrcInputFormat/OutputFormat's have been
> exposed to be usable outside of hive (
> https://issues.apache.org/jira/browse/HIVE-5728).   Does anyone know how
> one could use this with saveAsNewAPIHadoopFile to write records in orc
> format?
>    In particular, I would like to use a spark streaming process to read
> avro records off of kafka, and write then write them directly to hdfs in
> orc format where they could be used with shark.
>
> Thanks,
> Brock
>
>
>

Re: using saveAsNewAPIHadoopFile with OrcOutputFormat

Reply via email to