Re: reading rfiles directly

Jim Hughes Mon, 03 Aug 2020 14:38:47 -0700

Good question. As a very general note, one can leverage HadoopInputFormats to create Spark RDDs.

As a rather non-trivial example, you could check out GeoMesa'simplementation of mapping Accumulo entries to geospatial data types.

The basic strategy is make a Hadoop Configuration object representingwhat to scan in Accumulo and call SparkContext.newAPIHadoopRDD to get anRDD.

If you want a DataFrame/DataSet, you'll need to implement that SparkDataSource API.


Hope that helps!

Cheers,

Jim

1. Current implementation; decently refactored.https://github.com/locationtech/geomesa/blob/main/geomesa-accumulo/geomesa-accumulo-spark/src/main/scala/org/locationtech/geomesa/spark/accumulo/AccumuloSpatialRDDProvider.scala#L52-L82

2. Older implementation; less refactoring, may be more clear.https://github.com/locationtech/geomesa/blob/geomesa_2.11-1.3.0/geomesa-accumulo/geomesa-accumulo-spark/src/main/scala/org/locationtech/geomesa/spark/accumulo/AccumuloSpatialRDDProvider.scala#L51-L100

p.s. Alternatively, if you just want to get a little data out ofAccumulo, you could just query for it on the master, and fan the dataout on the cluster. *shrugs*


On 8/3/20 4:46 PM, Bulldog20630405 wrote:

we would like to read rfiles directly outside an active accumuloinstance using spark. is there a example to do this?
note: i know there is an utility to print rfiles and i could startthere and build my own; but was hoping to leverage something alreadythere.

Re: reading rfiles directly

Reply via email to