Has anyone done any testing with Spark and AccumuloRowInputFormat? I have no problem doing this for AccumuloInputFormat:
JavaPairRDD<Key, Value> pairRDD = sparkContext.newAPIHadoopRDD(job.getConfiguration(), AccumuloInputFormat.class, Key.class, Value.class); But I run into a snag trying to do a similar thing: JavaPairRDD<Text, PeekingIterator<Map.Entry<Key, Value>>> pairRDD = sparkContext.newAPIHadoopRDD(job.getConfiguration(), AccumuloRowInputFormat.class, Text.class, PeekingIterator.class); The compilation error is (big, sorry): Error:(141, 97) java: method newAPIHadoopRDD in class org.apache.spark.api.java.JavaSparkContext cannot be applied to given types; required: org.apache.hadoop.conf.Configuration,java.lang.Class<F>,java.lang.Class<K>,java.lang.Class<V> found: org.apache.hadoop.conf.Configuration,java.lang.Class<org.apache.accumulo.core.client.mapreduce.AccumuloRowInputFormat>,java.lang.Class<org.apache.hadoop.io.Text>,java.lang.Class<org.apache.accumulo.core.util.PeekingIterator> reason: inferred type does not conform to declared bound(s) inferred: org.apache.accumulo.core.client.mapreduce.AccumuloRowInputFormat bound(s): org.apache.hadoop.mapreduce.InputFormat<org.apache.hadoop.io.Text,org.apache.accumulo.core.util.PeekingIterator> I've tried a few things, the signature of the function is: public <K, V, F extends org.apache.hadoop.mapreduce.InputFormat<K, V>> JavaPairRDD<K, V> newAPIHadoopRDD(Configuration conf, Class<F> fClass, Class<K> kClass, Class<V> vClass) I guess it's having trouble with the format extending InputFormatBase with its own additional generic parameters (the Map.Entry inside PeekingIterator). This may be an issue to chase with Spark vs Accumulo, unless something can be tweaked on the Accumulo side or I could wrap the InputFormat with my own somehow. Accumulo 1.6.1, Spark 1.3.1, JDK 7u71. Stopping short of this, can anyone think of a good way to use AccumuloInputFormat to get what I'm getting from the Row version in a performant way? It doesn't necessarily have to be an iterator approach, but I'd need all my values with the key in one consuming function. I'm looking into ways to do it in spark functions but trying to avoid any major performance hits. Thanks, Marc p.s. The summit was absolutely great, thank you all for having it!