Re: Indexing Support

2015-10-18 Thread Russ Weeks
Distributed R-Trees are not very common. Most "big data" spatial solutions
collapse multi-dimensional data into a distributed one-dimensional index
using a space-filling curve. Many implementations exist outside of Spark
for eg. Hbase or Accumulo. It's simple enough to write a map function that
takes a longitude+latitude pair and converts it to a position on a Z curve,
so you can work with a PairRDD of something like . It's more
complicated to convert a geospatial query expressed as a bounding box into
a set of disjoint curve intervals, but there are good examples out there.
The excellent Accumulo Recipes project has an implementation of such an
algorithm, it would be pretty easy to port it to work with a PairRDD as
described above.

On Sun, Oct 18, 2015 at 3:26 PM Jerry Lam  wrote:

> I'm interested in it but I doubt there is r-tree indexing support in the
> near future as spark is not a database. You might have a better luck
> looking at databases with spatial indexing support out of the box.
>
> Cheers
>
> Sent from my iPad
>
> On 2015-10-18, at 17:16, Mustafa Elbehery 
> wrote:
>
> Hi All,
>
> I am trying to use spark to process *Spatial Data. *I am looking for
> R-Tree Indexing support in best case, but I would be fine with any other
> indexing capability as well, just to improve performance.
>
> Anyone had the same issue before, and is there any information regarding
> Index support in future releases ?!!
>
> Regards.
>
> --
> Mustafa Elbehery
> EIT ICT Labs Master School 
> +49(0)15750363097
> skype: mustafaelbehery87
>
>


Re: iPython Notebook + Spark + Accumulo -- best practice?

2015-03-26 Thread Russ Weeks
Hi, David,

This is the code that I use to create a JavaPairRDD from an Accumulo table:

JavaSparkContext sc = new JavaSparkContext(conf);
Job hadoopJob = Job.getInstance(conf,TestSparkJob);
job.setInputFormatClass(AccumuloInputFormat.class);
AccumuloInputFormat.setZooKeeperInstance(job,
conf.get(ZOOKEEPER_INSTANCE_NAME,
conf.get(ZOOKEEPER_HOSTS)
);
AccumuloInputFormat.setConnectorInfo(job,
conf.get(ACCUMULO_AGILE_USERNAME),
new PasswordToken(conf.get(ACCUMULO_AGILE_PASSWORD))
);
AccumuloInputFormat.setInputTableName(job, conf.get(ACCUMULO_TABLE_NAME));
AccumuloInputFormat.setScanAuthorizations(job, auths);
JavaPairRDDKey, Value values =
sc.newAPIHadoopRDD(hadoopJob.getConfiguration(), AccumuloInputFormat.class,
Key.class, Value.class);

Key.class and Value.class are from org.apache.accumulo.core.data. I use a
WholeRowIterator so that the Value is actually an encoded representation of
an entire logical row; it's a useful convenience if you can be sure that
your rows always fit in memory.

I haven't tested it since Spark 1.0.1 but I doubt anything important has
changed.

Regards,
-Russ


On Thu, Mar 26, 2015 at 11:41 AM, David Holiday dav...@annaisystems.com
wrote:

  * progress!*

 i was able to figure out why the 'input INFO not set' error was occurring.
 the eagle-eyed among you will no doubt see the following code is missing a
 closing '('

 AbstractInputFormat.setConnectorInfo(jobConf, root, new 
 PasswordToken(password)

 as I'm doing this in spark-notebook, I'd been clicking the execute button
 and moving on because I wasn't seeing an error. what I forgot was that
 notebook is going to do what spark-shell will do when you leave off a
 closing ')' -- *it will wait forever for you to add it*. so the error was
 the result of the 'setConnectorInfo' method never getting executed.

 unfortunately, I'm still unable to shove the accumulo table data into an
 RDD that's useable to me. when I execute

 rddX.count

 I get back

 res15: Long = 1

 which is the correct response - there are 10,000 rows of data in the table
 I pointed to. however, when I try to grab the first element of data thusly:

 rddX.first

 I get the following error:

 org.apache.spark.SparkException: Job aborted due to stage failure: Task
 0.0 in stage 0.0 (TID 0) had a not serializable result:
 org.apache.accumulo.core.data.Key

 any thoughts on where to go from here?
  DAVID HOLIDAY
  Software Engineer
  760 607 3300 | Office
  312 758 8385 | Mobile
  dav...@annaisystems.com broo...@annaisystems.com



 www.AnnaiSystems.com

  On Mar 26, 2015, at 8:35 AM, David Holiday dav...@annaisystems.com
 wrote:

  hi Nick

  Unfortunately the Accumulo docs are woefully inadequate, and in some
 places, flat wrong. I'm not sure if this is a case where the docs are 'flat
 wrong', or if there's some wrinke with spark-notebook in the mix that's
 messing everything up. I've been working with some people on stack overflow
 on this same issue (including one of the people from the spark-notebook
 team):


 http://stackoverflow.com/questions/29244530/how-do-i-create-a-spark-rdd-from-accumulo-1-6-in-spark-notebook?noredirect=1#comment46755938_29244530

  if you click the link you can see the entire thread of code, responses
 from notebook, etc. I'm going to try invoking the same techniques both from
 within a stand-alone scala problem and from the shell itself to see if I
 can get some traction. I'll report back when I have more data.

  cheers (and thx!)



 DAVID HOLIDAY
  Software Engineer
  760 607 3300 | Office
  312 758 8385 | Mobile
  dav...@annaisystems.com broo...@annaisystems.com


 GetFileAttachment.jpg
 www.AnnaiSystems.com http://www.annaisystems.com/

  On Mar 25, 2015, at 11:43 PM, Nick Pentreath nick.pentre...@gmail.com
 wrote:

  From a quick look at this link -
 http://accumulo.apache.org/1.6/accumulo_user_manual.html#_mapreduce - it
 seems you need to call some static methods on AccumuloInputFormat in order
 to set the auth, table, and range settings. Try setting these config
 options first and then call newAPIHadoopRDD?

 On Thu, Mar 26, 2015 at 2:34 AM, David Holiday dav...@annaisystems.com
 wrote:

 hi Irfan,

  thanks for getting back to me - i'll try the accumulo list to be sure.
 what is the normal use case for spark though? I'm surprised that hooking it
 into something as common and popular as accumulo isn't more of an every-day
 task.

 DAVID HOLIDAY
  Software Engineer
  760 607 3300 | Office
  312 758 8385 | Mobile
  dav...@annaisystems.com broo...@annaisystems.com


 GetFileAttachment.jpg
 www.AnnaiSystems.com http://www.annaisystems.com/

On Mar 25, 2015, at 5:27 PM, Irfan Ahmad ir...@cloudphysics.com
 wrote:

Hmmm this seems very accumulo-specific, doesn't it? Not sure how
 to help with that.


  *Irfan Ahmad*
 CTO | Co-Founder | *CloudPhysics* http://www.cloudphysics.com/
 Best of VMworld Finalist
  Best Cloud Management Award
  NetworkWorld 10 Startups to Watch
 EMA Most Notable Vendor

   On 

Re: Reading from HBase is too slow

2014-09-29 Thread Russ Weeks
Hi, Tao,
When I used newAPIHadoopRDD (Accumulo not HBase) I found that I had to
specify executor-memory and num-executors explicitly on the command line or
else I didn't get any parallelism across the cluster.

I used  --executor-memory 3G --num-executors 24 but obviously other
parameters will be better for your cluster.

-Russ

On Mon, Sep 29, 2014 at 7:43 PM, Nan Zhu zhunanmcg...@gmail.com wrote:

 can you look at your HBase UI to check whether your job is just reading
 from a single region server?

 Best,

 --
 Nan Zhu

 On Monday, September 29, 2014 at 10:21 PM, Tao Xiao wrote:

 I submitted a job in Yarn-Client mode, which simply reads from a HBase
 table containing tens of millions of records and then does a *count *action.
 The job runs for a much longer time than I expected, so I wonder whether it
 was because the data to read was too much. Actually, there are 20 nodes in
 my Hadoop cluster so the HBase table seems not so big (tens of millopns of
 records). :

 I'm using CDH 5.0.0 (Spark 0.9 and HBase 0.96).

 BTW, when the job was running, I can see logs on the console, and
 specifically I'd like to know what the following log means:

 14/09/30 09:45:20 INFO scheduler.TaskSetManager: Starting task 0.0:20 as
 TID 20 on executor 2: b04.jsepc.com (PROCESS_LOCAL)
 14/09/30 09:45:20 INFO scheduler.TaskSetManager: Serialized task 0.0:20 as
 13454 bytes in 0 ms
 14/09/30 09:45:20 INFO scheduler.TaskSetManager: Finished TID 19 in 16426
 ms on b04.jsepc.com (progress: 18/86)
 14/09/30 09:45:20 INFO scheduler.DAGScheduler: Completed ResultTask(0, 19)


 Thanks





Re: Does anyone have experience with using Hadoop InputFormats?

2014-09-24 Thread Russ Weeks
I use newAPIHadoopRDD with AccumuloInputFormat. It produces a PairRDD using
Accumulo's Key and Value classes, both of which extend Writable. Works like
a charm. I use the same InputFormat for all my MR jobs.

-Russ

On Wed, Sep 24, 2014 at 9:33 AM, Steve Lewis lordjoe2...@gmail.com wrote:

 I tried newAPIHadoopFile and it works except that my original InputFormat
  extends InputFormatText,Text and has a RecordReaderText,Text
 This throws a not Serializable exception on Text - changing the type to
 InputFormatStringBuffer, StringBuffer works with minor code changes.
 I do not, however, believe that Hadoop count use an InputFormat with types
 not derived from Writable -
 What were you using and was it able to work with Hadoop?

 On Tue, Sep 23, 2014 at 5:52 PM, Liquan Pei liquan...@gmail.com wrote:

 Hi Steve,

 Hi Steve,

 Did you try the newAPIHadoopFile? That worked for us.

 Thanks,
 Liquan

 On Tue, Sep 23, 2014 at 5:43 PM, Steve Lewis lordjoe2...@gmail.com
 wrote:

 Well I had one and tried that - my message tells what I found found
 1) Spark only accepts org.apache.hadoop.mapred.InputFormatK,V
  not org.apache.hadoop.mapreduce.InputFormatK,V
 2) Hadoop expects K and V to be Writables - I always use Text - Text is
 not Serializable and will not work with Spark - StringBuffer will work with
 Spark but not (as far as I know) with Hadoop
 - Telling me what the documentation SAYS is all well and good but I just
 tried it and want hear from people with real examples working

 On Tue, Sep 23, 2014 at 5:29 PM, Liquan Pei liquan...@gmail.com wrote:

 Hi Steve,

 Here is my understanding, as long as you implement InputFormat, you
 should be able to use hadoopFile API in SparkContext to create an RDD.
 Suppose you have a customized InputFormat which we call
 CustomizedInputFormatK, V where K is the key type and V is the value
 type. You can create an RDD with CustomizedInputFormat in the following 
 way:

 Let sc denote the SparkContext variable and path denote the path to
 file of CustomizedInputFormat, we use

 val rdd;RDD[(K,V)] = sc.hadoopFile[K,V,CustomizedInputFormat](path,
 ClassOf[CustomizedInputFormat], ClassOf[K], ClassOf[V])

 to create an RDD of (K,V) with CustomizedInputFormat.

 Hope this helps,
 Liquan

 On Tue, Sep 23, 2014 at 5:13 PM, Steve Lewis lordjoe2...@gmail.com
 wrote:

  When I experimented with using an InputFormat I had used in Hadoop
 for a long time in Hadoop I found
 1) it must extend org.apache.hadoop.mapred.FileInputFormat (the
 deprecated class not org.apache.hadoop.mapreduce.lib.input;FileInputFormat
 2) initialize needs to be called in the constructor
 3) The type - mine was extends FileInputFormatText, Text must not be
 a Hadoop Writable - those are not serializable but extends
 FileInputFormatStringBuffer, StringBuffer does work - I don't think this
 is allowed in Hadoop

 Are these statements correct and if so it seems like most Hadoop
 InputFormate - certainly the custom ones I create require serious
 modifications to work - does anyone have samples of use of Hadoop
 InputFormat

 Since I am working with problems where a directory with multiple files
 are processed and some files are many gigabytes in size with multiline
 complex records an input format is a requirement.




 --
 Liquan Pei
 Department of Physics
 University of Massachusetts Amherst




 --
 Steven M. Lewis PhD
 4221 105th Ave NE
 Kirkland, WA 98033
 206-384-1340 (cell)
 Skype lordjoe_com




 --
 Liquan Pei
 Department of Physics
 University of Massachusetts Amherst




 --
 Steven M. Lewis PhD
 4221 105th Ave NE
 Kirkland, WA 98033
 206-384-1340 (cell)
 Skype lordjoe_com




Re: Does anyone have experience with using Hadoop InputFormats?

2014-09-24 Thread Russ Weeks
No, they do not implement Serializable. There are a couple of places where
I've had to do a Text-String conversion but generally it hasn't been a
problem.
-Russ

On Wed, Sep 24, 2014 at 10:27 AM, Steve Lewis lordjoe2...@gmail.com wrote:

 Do your custom Writable classes implement Serializable - I think that is
 the only real issue - my code uses vanilla Text




Re: Accumulo and Spark

2014-09-10 Thread Russ Weeks
It's very straightforward to set up a Hadoop RDD to use
AccumuloInputFormat. Something like this will do the trick:

private JavaPairRDDKey,Value newAccumuloRDD(JavaSparkContext sc,
AgileConf agileConf, String appName, Authorizations auths)
throws IOException, AccumuloSecurityException {
Job hadoopJob = Job.getInstance(agileConf, appName);
// configureAccumuloInput is exactly the same as for an MR job
// sets zookeeper instance, credentials, table name, auths etc.
configureAccumuloInput(hadoopJob, ACCUMULO_TABLE, auths);
return sc.newAPIHadoopRDD(hadoopJob.getConfiguration(),
AccumuloInputFormat.class, Key.class, Value.class);
}

There's tons of docs around how to operate on a JavaPairRDD. But you're
right, there's hardly anything at all re. how to plug accumulo into spark.

-Russ

On Wed, Sep 10, 2014 at 1:17 PM, Megavolt jbru...@42six.com wrote:

 I've been doing some Googling and haven't found much info on how to
 incorporate Spark and Accumulo.  Does anyone know of some examples of how
 to
 tie Spark to Accumulo (for both fetching data and dumping results)?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Accumulo-and-Spark-tp13923.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Spark + AccumuloInputFormat

2014-09-10 Thread Russ Weeks
To answer my own question... I didn't realize that I was responsible for
telling Spark how much parallelism I wanted for my job. I figured that
between Spark and Yarn they'd figure it out for themselves.

Adding --executor-memory 3G --num-executors 24 to my spark-submit command
took the query time down to 30s from 18 minutes and I'm seeing much better
utilization of my accumulo tablet servers.

-Russ

On Tue, Sep 9, 2014 at 5:13 PM, Russ Weeks rwe...@newbrightidea.com wrote:

 Hi,

 I'm trying to execute Spark SQL queries on top of the AccumuloInputFormat.
 Not sure if I should be asking on the Spark list or the Accumulo list, but
 I'll try here. The problem is that the workload to process SQL queries
 doesn't seem to be distributed across my cluster very well.

 My Spark SQL app is running in yarn-client mode. The query I'm running is
 select count(*) from audit_log (or a similarly simple query) where my
 audit_log table has 14.3M rows, 504M key value pairs spread fairly evenly
 across 8 tablet servers. Looking at the Accumulo monitor app, I only ever
 see a maximum of 2 tablet servers with active scans. Since the data is
 spread across all the tablet servers, I hoped to see 8!

 I realize there are a lot of moving parts here but I'd any advice about
 where to start looking.

 Using Spark 1.0.1 with Accumulo 1.6.

 Thanks!
 -Russ



Spark + AccumuloInputFormat

2014-09-09 Thread Russ Weeks
Hi,

I'm trying to execute Spark SQL queries on top of the AccumuloInputFormat.
Not sure if I should be asking on the Spark list or the Accumulo list, but
I'll try here. The problem is that the workload to process SQL queries
doesn't seem to be distributed across my cluster very well.

My Spark SQL app is running in yarn-client mode. The query I'm running is
select count(*) from audit_log (or a similarly simple query) where my
audit_log table has 14.3M rows, 504M key value pairs spread fairly evenly
across 8 tablet servers. Looking at the Accumulo monitor app, I only ever
see a maximum of 2 tablet servers with active scans. Since the data is
spread across all the tablet servers, I hoped to see 8!

I realize there are a lot of moving parts here but I'd any advice about
where to start looking.

Using Spark 1.0.1 with Accumulo 1.6.

Thanks!
-Russ