spark with AccumuloRowInputFormat?

Marc Reichman Mon, 04 May 2015 08:48:50 -0700

Has anyone done any testing with Spark and AccumuloRowInputFormat? I have
no problem doing this for AccumuloInputFormat:


JavaPairRDD<Key, Value> pairRDD =
sparkContext.newAPIHadoopRDD(job.getConfiguration(),
        AccumuloInputFormat.class,
        Key.class, Value.class);

But I run into a snag trying to do a similar thing:

JavaPairRDD<Text, PeekingIterator<Map.Entry<Key, Value>>> pairRDD =
sparkContext.newAPIHadoopRDD(job.getConfiguration(),
        AccumuloRowInputFormat.class,
        Text.class, PeekingIterator.class);

The compilation error is (big, sorry):

Error:(141, 97) java: method newAPIHadoopRDD in class
org.apache.spark.api.java.JavaSparkContext cannot be applied to given
types;
  required: 
org.apache.hadoop.conf.Configuration,java.lang.Class<F>,java.lang.Class<K>,java.lang.Class<V>
  found: 
org.apache.hadoop.conf.Configuration,java.lang.Class<org.apache.accumulo.core.client.mapreduce.AccumuloRowInputFormat>,java.lang.Class<org.apache.hadoop.io.Text>,java.lang.Class<org.apache.accumulo.core.util.PeekingIterator>
  reason: inferred type does not conform to declared bound(s)
    inferred: org.apache.accumulo.core.client.mapreduce.AccumuloRowInputFormat
    bound(s): 
org.apache.hadoop.mapreduce.InputFormat<org.apache.hadoop.io.Text,org.apache.accumulo.core.util.PeekingIterator>

I've tried a few things, the signature of the function is:

public <K, V, F extends org.apache.hadoop.mapreduce.InputFormat<K, V>>
JavaPairRDD<K, V> newAPIHadoopRDD(Configuration conf, Class<F> fClass,
Class<K> kClass, Class<V> vClass)

I guess it's having trouble with the format extending InputFormatBase with
its own additional generic parameters (the Map.Entry inside
PeekingIterator).

This may be an issue to chase with Spark vs Accumulo, unless something can
be tweaked on the Accumulo side or I could wrap the InputFormat with my own
somehow.

Accumulo 1.6.1, Spark 1.3.1, JDK 7u71.

Stopping short of this, can anyone think of a good way to use
AccumuloInputFormat to get what I'm getting from the Row version in a
performant way? It doesn't necessarily have to be an iterator approach, but
I'd need all my values with the key in one consuming function. I'm looking
into ways to do it in spark functions but trying to avoid any major
performance hits.

Thanks,

Marc

p.s. The summit was absolutely great, thank you all for having it!

spark with AccumuloRowInputFormat?

Reply via email to