Good question.  As a very general note, one can leverage Hadoop InputFormats to create Spark RDDs.

As a rather non-trivial example, you could check out GeoMesa's implementation of mapping Accumulo entries to geospatial data types.

The basic strategy is make a Hadoop Configuration object representing what to scan in Accumulo and call SparkContext.newAPIHadoopRDD to get an RDD.

If you want a DataFrame/DataSet, you'll need to implement that Spark DataSource API.

Hope that helps!

Cheers,

Jim

1. Current implementation; decently refactored. https://github.com/locationtech/geomesa/blob/main/geomesa-accumulo/geomesa-accumulo-spark/src/main/scala/org/locationtech/geomesa/spark/accumulo/AccumuloSpatialRDDProvider.scala#L52-L82

2. Older implementation; less refactoring, may be more clear. https://github.com/locationtech/geomesa/blob/geomesa_2.11-1.3.0/geomesa-accumulo/geomesa-accumulo-spark/src/main/scala/org/locationtech/geomesa/spark/accumulo/AccumuloSpatialRDDProvider.scala#L51-L100

p.s. Alternatively, if you just want to get a little data out of Accumulo, you could just query for it on the master, and fan the data out on the cluster.  *shrugs*

On 8/3/20 4:46 PM, Bulldog20630405 wrote:

we would like to read rfiles directly outside an active accumulo instance using spark.  is there a example to do this?

note: i know there is an utility to print rfiles and i could start there and build my own; but was hoping to leverage something already there.


Reply via email to