Distributed R-Trees are not very common. Most "big data" spatial solutions
collapse multi-dimensional data into a distributed one-dimensional index
using a space-filling curve. Many implementations exist outside of Spark
for eg. Hbase or Accumulo. It's simple enough to write a map function that
Hi, David,
This is the code that I use to create a JavaPairRDD from an Accumulo table:
JavaSparkContext sc = new JavaSparkContext(conf);
Job hadoopJob = Job.getInstance(conf,TestSparkJob);
job.setInputFormatClass(AccumuloInputFormat.class);
AccumuloInputFormat.setZooKeeperInstance(job,
Hi, Tao,
When I used newAPIHadoopRDD (Accumulo not HBase) I found that I had to
specify executor-memory and num-executors explicitly on the command line or
else I didn't get any parallelism across the cluster.
I used --executor-memory 3G --num-executors 24 but obviously other
parameters will be
I use newAPIHadoopRDD with AccumuloInputFormat. It produces a PairRDD using
Accumulo's Key and Value classes, both of which extend Writable. Works like
a charm. I use the same InputFormat for all my MR jobs.
-Russ
On Wed, Sep 24, 2014 at 9:33 AM, Steve Lewis lordjoe2...@gmail.com wrote:
I
No, they do not implement Serializable. There are a couple of places where
I've had to do a Text-String conversion but generally it hasn't been a
problem.
-Russ
On Wed, Sep 24, 2014 at 10:27 AM, Steve Lewis lordjoe2...@gmail.com wrote:
Do your custom Writable classes implement Serializable - I
It's very straightforward to set up a Hadoop RDD to use
AccumuloInputFormat. Something like this will do the trick:
private JavaPairRDDKey,Value newAccumuloRDD(JavaSparkContext sc,
AgileConf agileConf, String appName, Authorizations auths)
throws IOException, AccumuloSecurityException {
down to 30s from 18 minutes and I'm seeing much better
utilization of my accumulo tablet servers.
-Russ
On Tue, Sep 9, 2014 at 5:13 PM, Russ Weeks rwe...@newbrightidea.com wrote:
Hi,
I'm trying to execute Spark SQL queries on top of the AccumuloInputFormat.
Not sure if I should be asking
Hi,
I'm trying to execute Spark SQL queries on top of the AccumuloInputFormat.
Not sure if I should be asking on the Spark list or the Accumulo list, but
I'll try here. The problem is that the workload to process SQL queries
doesn't seem to be distributed across my cluster very well.
My Spark