Re: iPython Notebook + Spark + Accumulo -- best practice?

Russ Weeks Thu, 26 Mar 2015 12:02:31 -0700

Hi, David,

This is the code that I use to create a JavaPairRDD from an Accumulo table:


JavaSparkContext sc = new JavaSparkContext(conf);
Job hadoopJob = Job.getInstance(conf,"TestSparkJob");
job.setInputFormatClass(AccumuloInputFormat.class);
AccumuloInputFormat.setZooKeeperInstance(job,
    conf.get(ZOOKEEPER_INSTANCE_NAME,
    conf.get(ZOOKEEPER_HOSTS)
);
AccumuloInputFormat.setConnectorInfo(job,
    conf.get(ACCUMULO_AGILE_USERNAME),
    new PasswordToken(conf.get(ACCUMULO_AGILE_PASSWORD))
);
AccumuloInputFormat.setInputTableName(job, conf.get(ACCUMULO_TABLE_NAME));
AccumuloInputFormat.setScanAuthorizations(job, auths);
JavaPairRDD<Key, Value> values =
sc.newAPIHadoopRDD(hadoopJob.getConfiguration(), AccumuloInputFormat.class,
Key.class, Value.class);

Key.class and Value.class are from org.apache.accumulo.core.data. I use a
WholeRowIterator so that the Value is actually an encoded representation of
an entire logical row; it's a useful convenience if you can be sure that
your rows always fit in memory.

I haven't tested it since Spark 1.0.1 but I doubt anything important has
changed.

Regards,
-Russ


On Thu, Mar 26, 2015 at 11:41 AM, David Holiday <dav...@annaisystems.com>
wrote:

>  * progress!*
>
> i was able to figure out why the 'input INFO not set' error was occurring.
> the eagle-eyed among you will no doubt see the following code is missing a
> closing '('
>
> AbstractInputFormat.setConnectorInfo(jobConf, "root", new 
> PasswordToken("password")
>
> as I'm doing this in spark-notebook, I'd been clicking the execute button
> and moving on because I wasn't seeing an error. what I forgot was that
> notebook is going to do what spark-shell will do when you leave off a
> closing ')' -- *it will wait forever for you to add it*. so the error was
> the result of the 'setConnectorInfo' method never getting executed.
>
> unfortunately, I'm still unable to shove the accumulo table data into an
> RDD that's useable to me. when I execute
>
> rddX.count
>
> I get back
>
> res15: Long = 10000
>
> which is the correct response - there are 10,000 rows of data in the table
> I pointed to. however, when I try to grab the first element of data thusly:
>
> rddX.first
>
> I get the following error:
>
> org.apache.spark.SparkException: Job aborted due to stage failure: Task
> 0.0 in stage 0.0 (TID 0) had a not serializable result:
> org.apache.accumulo.core.data.Key
>
> any thoughts on where to go from here?
>  DAVID HOLIDAY
>  Software Engineer
>  760 607 3300 | Office
>  312 758 8385 | Mobile
>  dav...@annaisystems.com <broo...@annaisystems.com>
>
>
>
> www.AnnaiSystems.com
>
>  On Mar 26, 2015, at 8:35 AM, David Holiday <dav...@annaisystems.com>
> wrote:
>
>  hi Nick
>
>  Unfortunately the Accumulo docs are woefully inadequate, and in some
> places, flat wrong. I'm not sure if this is a case where the docs are 'flat
> wrong', or if there's some wrinke with spark-notebook in the mix that's
> messing everything up. I've been working with some people on stack overflow
> on this same issue (including one of the people from the spark-notebook
> team):
>
>
> http://stackoverflow.com/questions/29244530/how-do-i-create-a-spark-rdd-from-accumulo-1-6-in-spark-notebook?noredirect=1#comment46755938_29244530
>
>  if you click the link you can see the entire thread of code, responses
> from notebook, etc. I'm going to try invoking the same techniques both from
> within a stand-alone scala problem and from the shell itself to see if I
> can get some traction. I'll report back when I have more data.
>
>  cheers (and thx!)
>
>
>
> DAVID HOLIDAY
>  Software Engineer
>  760 607 3300 | Office
>  312 758 8385 | Mobile
>  dav...@annaisystems.com <broo...@annaisystems.com>
>
>
> <GetFileAttachment.jpg>
> www.AnnaiSystems.com <http://www.annaisystems.com/>
>
>  On Mar 25, 2015, at 11:43 PM, Nick Pentreath <nick.pentre...@gmail.com>
> wrote:
>
>  From a quick look at this link -
> http://accumulo.apache.org/1.6/accumulo_user_manual.html#_mapreduce - it
> seems you need to call some static methods on AccumuloInputFormat in order
> to set the auth, table, and range settings. Try setting these config
> options first and then call newAPIHadoopRDD?
>
> On Thu, Mar 26, 2015 at 2:34 AM, David Holiday <dav...@annaisystems.com>
> wrote:
>
>> hi Irfan,
>>
>>  thanks for getting back to me - i'll try the accumulo list to be sure.
>> what is the normal use case for spark though? I'm surprised that hooking it
>> into something as common and popular as accumulo isn't more of an every-day
>> task.
>>
>> DAVID HOLIDAY
>>  Software Engineer
>>  760 607 3300 | Office
>>  312 758 8385 | Mobile
>>  dav...@annaisystems.com <broo...@annaisystems.com>
>>
>>
>> <GetFileAttachment.jpg>
>> www.AnnaiSystems.com <http://www.annaisystems.com/>
>>
>>    On Mar 25, 2015, at 5:27 PM, Irfan Ahmad <ir...@cloudphysics.com>
>> wrote:
>>
>>    Hmmm.... this seems very accumulo-specific, doesn't it? Not sure how
>> to help with that.
>>
>>
>>  *Irfan Ahmad*
>> CTO | Co-Founder | *CloudPhysics* <http://www.cloudphysics.com/>
>> Best of VMworld Finalist
>>  Best Cloud Management Award
>>  NetworkWorld 10 Startups to Watch
>> EMA Most Notable Vendor
>>
>>   On Tue, Mar 24, 2015 at 4:09 PM, David Holiday <dav...@annaisystems.com
>> > wrote:
>>
>>>  hi all,
>>>
>>>  got a vagrant image with spark notebook, spark, accumulo, and hadoop
>>> all running. from notebook I can manually create a scanner and pull test
>>> data from a table I created using one of the accumulo examples:
>>>
>>> val instanceNameS = "accumulo"val zooServersS = "localhost:2181"val 
>>> instance: Instance = new ZooKeeperInstance(instanceNameS, zooServersS)val 
>>> connector: Connector = instance.getConnector( "root", new 
>>> PasswordToken("password"))val auths = new Authorizations("exampleVis")val 
>>> scanner = connector.createScanner("batchtest1", auths)
>>>
>>> scanner.setRange(new Range("row_0000000000", "row_0000000010"))
>>> for(entry: Entry[Key, Value] <- scanner) {
>>>   println(entry.getKey + " is " + entry.getValue)}
>>>
>>> will give the first ten rows of table data. when I try to create the RDD
>>> thusly:
>>>
>>> val rdd2 =
>>>   sparkContext.newAPIHadoopRDD (
>>>     new Configuration(),
>>>     classOf[org.apache.accumulo.core.client.mapreduce.AccumuloInputFormat],
>>>     classOf[org.apache.accumulo.core.data.Key],
>>>     classOf[org.apache.accumulo.core.data.Value]
>>>   )
>>>
>>> I get an RDD returned to me that I can't do much with due to the
>>> following error:
>>>
>>> java.io.IOException: Input info has not been set. at
>>> org.apache.accumulo.core.client.mapreduce.lib.impl.InputConfigurator.validateOptions(InputConfigurator.java:630)
>>> at
>>> org.apache.accumulo.core.client.mapreduce.AbstractInputFormat.validateOptions(AbstractInputFormat.java:343)
>>> at
>>> org.apache.accumulo.core.client.mapreduce.AbstractInputFormat.getSplits(AbstractInputFormat.java:538)
>>> at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:98)
>>> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222) at
>>> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220) at
>>> scala.Option.getOrElse(Option.scala:120) at
>>> org.apache.spark.rdd.RDD.partitions(RDD.scala:220) at
>>> org.apache.spark.SparkContext.runJob(SparkContext.scala:1367) at
>>> org.apache.spark.rdd.RDD.count(RDD.scala:927)
>>>
>>> which totally makes sense in light of the fact that I haven't specified
>>> any parameters as to which table to connect with, what the auths are, etc.
>>>
>>> so my question is: what do I need to do from here to get those first ten
>>> rows of table data into my RDD?
>>>
>>>
>>>
>>>   DAVID HOLIDAY
>>>  Software Engineer
>>>  760 607 3300 | Office
>>>  312 758 8385 | Mobile
>>>  dav...@annaisystems.com <broo...@annaisystems.com>
>>>
>>>
>>> <GetFileAttachment.jpg>
>>> www.AnnaiSystems.com <http://www.annaisystems.com/>
>>>
>>>    On Mar 19, 2015, at 11:25 AM, David Holiday <dav...@annaisystems.com>
>>> wrote:
>>>
>>>  kk - I'll put something together and get back to you with more :-)
>>>
>>> DAVID HOLIDAY
>>>  Software Engineer
>>>  760 607 3300 | Office
>>>  312 758 8385 | Mobile
>>>  dav...@annaisystems.com <broo...@annaisystems.com>
>>>
>>>
>>> <GetFileAttachment.jpg>
>>> www.AnnaiSystems.com <http://www.annaisystems.com/>
>>>
>>>  On Mar 19, 2015, at 10:59 AM, Irfan Ahmad <ir...@cloudphysics.com>
>>> wrote:
>>>
>>>  Once you setup spark-notebook, it'll handle the submits for
>>> interactive work. Non-interactive is not handled by it. For that
>>> spark-kernel could be used.
>>>
>>>  Give it a shot ... it only takes 5 minutes to get it running in
>>> local-mode.
>>>
>>>
>>>  *Irfan Ahmad*
>>> CTO | Co-Founder | *CloudPhysics* <http://www.cloudphysics.com/>
>>> Best of VMworld Finalist
>>>  Best Cloud Management Award
>>>  NetworkWorld 10 Startups to Watch
>>> EMA Most Notable Vendor
>>>
>>> On Thu, Mar 19, 2015 at 9:51 AM, David Holiday <dav...@annaisystems.com>
>>> wrote:
>>>
>>>> hi all - thx for the alacritous replies! so regarding how to get things
>>>> from notebook to spark and back, am I correct that spark-submit is the way
>>>> to go?
>>>>
>>>> DAVID HOLIDAY
>>>>  Software Engineer
>>>>  760 607 3300 | Office
>>>>  312 758 8385 | Mobile
>>>>  dav...@annaisystems.com <broo...@annaisystems.com>
>>>>
>>>>
>>>> <GetFileAttachment.jpg>
>>>> www.AnnaiSystems.com <http://www.annaisystems.com/>
>>>>
>>>>  On Mar 19, 2015, at 1:14 AM, Paolo Platter <paolo.plat...@agilelab.it>
>>>> wrote:
>>>>
>>>>   Yes, I would suggest spark-notebook too.
>>>> It's very simple to setup and it's growing pretty fast.
>>>>
>>>> Paolo
>>>>
>>>> Inviata dal mio Windows Phone
>>>>  ------------------------------
>>>> Da: Irfan Ahmad <ir...@cloudphysics.com>
>>>> Inviato: ‎19/‎03/‎2015 04:05
>>>> A: davidh <dav...@annaisystems.com>
>>>> Cc: user@spark.apache.org
>>>> Oggetto: Re: iPython Notebook + Spark + Accumulo -- best practice?
>>>>
>>>>  I forgot to mention that there is also Zeppelin and jove-notebook but
>>>> I haven't got any experience with those yet.
>>>>
>>>>
>>>>  *Irfan Ahmad*
>>>> CTO | Co-Founder | *CloudPhysics* <http://www.cloudphysics.com/>
>>>> Best of VMworld Finalist
>>>>  Best Cloud Management Award
>>>>  NetworkWorld 10 Startups to Watch
>>>> EMA Most Notable Vendor
>>>>
>>>> On Wed, Mar 18, 2015 at 8:01 PM, Irfan Ahmad <ir...@cloudphysics.com>
>>>> wrote:
>>>>
>>>>> Hi David,
>>>>>
>>>>>  W00t indeed and great questions. On the notebook front, there are
>>>>> two options depending on what you are looking for. You can either go with
>>>>> iPython 3 with Spark-kernel as a backend or you can use spark-notebook.
>>>>> Both have interesting tradeoffs.
>>>>>
>>>>>  If you have looking for a single notebook platform for your data
>>>>> scientists that has R, Python as well as a Spark Shell, you'll likely want
>>>>> to go with iPython + Spark-kernel. Downsides with the spark-kernel project
>>>>> are that data visualization isn't quite there yet, early days for
>>>>> documentation and blogs/etc. Upside is that R and Python work beautifully
>>>>> and that the ipython committers are super-helpful.
>>>>>
>>>>>  If you are OK with a primarily spark/scala experience, then I
>>>>> suggest you with spark-notebook. Upsides are that the project is a little
>>>>> further along, visualization support is better than spark-kernel (though
>>>>> not as good as iPython with Python) and the committer is awesome with 
>>>>> help.
>>>>> Downside is that you won't get R and Python.
>>>>>
>>>>>  FWIW: I'm using both at the moment!
>>>>>
>>>>>  Hope that helps.
>>>>>
>>>>>
>>>>>  *Irfan Ahmad*
>>>>> CTO | Co-Founder | *CloudPhysics* <http://www.cloudphysics.com/>
>>>>> Best of VMworld Finalist
>>>>>  Best Cloud Management Award
>>>>>  NetworkWorld 10 Startups to Watch
>>>>> EMA Most Notable Vendor
>>>>>
>>>>> On Wed, Mar 18, 2015 at 5:45 PM, davidh <dav...@annaisystems.com>
>>>>> wrote:
>>>>>
>>>>>> hi all, I've been DDGing, Stack Overflowing, Twittering, RTFMing, and
>>>>>> scanning through this archive with only moderate success. in other
>>>>>> words --
>>>>>> my way of saying sorry if this is answered somewhere obvious and I
>>>>>> missed it
>>>>>> :-)
>>>>>>
>>>>>> i've been tasked with figuring out how to connect Notebook, Spark, and
>>>>>> Accumulo together. The end user will do her work via notebook. thus
>>>>>> far,
>>>>>> I've successfully setup a Vagrant image containing Spark, Accumulo,
>>>>>> and
>>>>>> Hadoop. I was able to use some of the Accumulo example code to create
>>>>>> a
>>>>>> table populated with data, create a simple program in scala that,
>>>>>> when fired
>>>>>> off to Spark via spark-submit, connects to accumulo and prints the
>>>>>> first ten
>>>>>> rows of data in the table. so w00t on that - but now I'm left with
>>>>>> more
>>>>>> questions:
>>>>>>
>>>>>> 1) I'm still stuck on what's considered 'best practice' in terms of
>>>>>> hooking
>>>>>> all this together. Let's say Sally, a  user, wants to do some
>>>>>> analytic work
>>>>>> on her data. She pecks the appropriate commands into notebook and
>>>>>> fires them
>>>>>> off. how does this get wired together on the back end? Do I, from
>>>>>> notebook,
>>>>>> use spark-submit to send a job to spark and let spark worry about
>>>>>> hooking
>>>>>> into accumulo or is it preferable to create some kind of open stream
>>>>>> between
>>>>>> the two?
>>>>>>
>>>>>> 2) if I want to extend spark's api, do I need to first submit an
>>>>>> endless job
>>>>>> via spark-submit that does something like what this gentleman
>>>>>> describes
>>>>>> <http://blog.madhukaraphatak.com/extending-spark-api>  ? is there an
>>>>>> alternative (other than refactoring spark's source) that doesn't
>>>>>> involve
>>>>>> extending the api via a job submission?
>>>>>>
>>>>>> ultimately what I'm looking for help locating docs, blogs, etc that
>>>>>> may shed
>>>>>> some light on this.
>>>>>>
>>>>>> t/y in advance!
>>>>>>
>>>>>> d
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> View this message in context:
>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/iPython-Notebook-Spark-Accumulo-best-practice-tp22137.html
>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>> Nabble.com <http://nabble.com/>.
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>
>
>

Re: iPython Notebook + Spark + Accumulo -- best practice?

Reply via email to