ES formats are pretty easy to use: Reading: val conf = new Configuration() conf.set("es.resource", "index/type") conf.set("es.query", "?q=*") val rdd = sc.newAPIHadoopRDD( conf, classOf[EsInputFormat[NullWritable, LinkedMapWritable]], classOf[NullWritable], classOf[LinkedMapWritable] )
The only gotcha is this loads a bunch of writables that you have to convert yourself based on the indexed types, involving casting. Writing is similar: val conf = new Configuration() conf.set("es.resource", "index/type") val mapWritableRDD: RDD[(NullWritable, MapWritable)] = rdd.map{ ... val mw = new MapWritable() mw.put(new Text("key"), new DoubleWritable(1.0)) (NullWritable.get(), mw) } mapWritableRDD.saveAsNewAPIHadoopFile( "/some/random/directory", classOf[NullWritable], classOf[MapWritable], classOf[EsOutputFormat], conf ) Gotcha here is that you need to provide a Path to saveAsNewAPIHadoopFile even though it is not used. On Wed, Apr 16, 2014 at 7:15 PM, Kostiantyn Kudriavtsev < kudryavtsev.konstan...@gmail.com> wrote: > I’d prefer to find good example of using saveAsNewAPIHadoopFile with > different OutputFormat implementations (not only orc, but EsOutputFormat, > etc). Any common example > > On Apr 16, 2014, at 4:51 PM, Brock Bose <brock.b...@gmail.com> wrote: > > Howdy all, > I recently saw that the OrcInputFormat/OutputFormat's have been > exposed to be usable outside of hive ( > https://issues.apache.org/jira/browse/HIVE-5728). Does anyone know how > one could use this with saveAsNewAPIHadoopFile to write records in orc > format? > In particular, I would like to use a spark streaming process to read > avro records off of kafka, and write then write them directly to hdfs in > orc format where they could be used with shark. > > Thanks, > Brock > > >