hi Nick

Unfortunately the Accumulo docs are woefully inadequate, and in some places, 
flat wrong. I'm not sure if this is a case where the docs are 'flat wrong', or 
if there's some wrinke with spark-notebook in the mix that's messing everything 
up. I've been working with some people on stack overflow on this same issue 
(including one of the people from the spark-notebook team):


if you click the link you can see the entire thread of code, responses from 
notebook, etc. I'm going to try invoking the same techniques both from within a 
stand-alone scala problem and from the shell itself to see if I can get some 
traction. I'll report back when I have more data.

cheers (and thx!)

Software Engineer
760 607 3300 | Office
312 758 8385 | Mobile


On Mar 25, 2015, at 11:43 PM, Nick Pentreath 
<nick.pentre...@gmail.com<mailto:nick.pentre...@gmail.com>> wrote:

From a quick look at this link - 
http://accumulo.apache.org/1.6/accumulo_user_manual.html#_mapreduce - it seems 
you need to call some static methods on AccumuloInputFormat in order to set the 
auth, table, and range settings. Try setting these config options first and 
then call newAPIHadoopRDD?

On Thu, Mar 26, 2015 at 2:34 AM, David Holiday 
<dav...@annaisystems.com<mailto:dav...@annaisystems.com>> wrote:
hi Irfan,

thanks for getting back to me - i'll try the accumulo list to be sure. what is 
the normal use case for spark though? I'm surprised that hooking it into 
something as common and popular as accumulo isn't more of an every-day task.

Software Engineer
760 607 3300<tel:760%20607%203300> | Office
312 758 8385<tel:312%20758%208385> | Mobile


On Mar 25, 2015, at 5:27 PM, Irfan Ahmad 
<ir...@cloudphysics.com<mailto:ir...@cloudphysics.com>> wrote:

Hmmm.... this seems very accumulo-specific, doesn't it? Not sure how to help 
with that.

Irfan Ahmad
CTO | Co-Founder | CloudPhysics<http://www.cloudphysics.com/>
Best of VMworld Finalist
Best Cloud Management Award
NetworkWorld 10 Startups to Watch
EMA Most Notable Vendor

On Tue, Mar 24, 2015 at 4:09 PM, David Holiday 
<dav...@annaisystems.com<mailto:dav...@annaisystems.com>> wrote:
hi all,

got a vagrant image with spark notebook, spark, accumulo, and hadoop all 
running. from notebook I can manually create a scanner and pull test data from 
a table I created using one of the accumulo examples:

val instanceNameS = "accumulo"
val zooServersS = "localhost:2181"
val instance: Instance = new ZooKeeperInstance(instanceNameS, zooServersS)
val connector: Connector = instance.getConnector( "root", new 
val auths = new Authorizations("exampleVis")
val scanner = connector.createScanner("batchtest1", auths)

scanner.setRange(new Range("row_0000000000", "row_0000000010"))

for(entry: Entry[Key, Value] <- scanner) {
  println(entry.getKey + " is " + entry.getValue)

will give the first ten rows of table data. when I try to create the RDD thusly:

val rdd2 =
  sparkContext.newAPIHadoopRDD (
    new Configuration(),

I get an RDD returned to me that I can't do much with due to the following 

java.io.IOException: Input info has not been set. at 
 at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:98) at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222) at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220) at 
scala.Option.getOrElse(Option.scala:120) at 
org.apache.spark.rdd.RDD.partitions(RDD.scala:220) at 
org.apache.spark.SparkContext.runJob(SparkContext.scala:1367) at 

which totally makes sense in light of the fact that I haven't specified any 
parameters as to which table to connect with, what the auths are, etc.

so my question is: what do I need to do from here to get those first ten rows 
of table data into my RDD?

Software Engineer
760 607 3300<tel:760%20607%203300> | Office
312 758 8385<tel:312%20758%208385> | Mobile


On Mar 19, 2015, at 11:25 AM, David Holiday 
<dav...@annaisystems.com<mailto:dav...@annaisystems.com>> wrote:

kk - I'll put something together and get back to you with more :-)

Software Engineer
760 607 3300<tel:760%20607%203300> | Office
312 758 8385<tel:312%20758%208385> | Mobile


On Mar 19, 2015, at 10:59 AM, Irfan Ahmad 
<ir...@cloudphysics.com<mailto:ir...@cloudphysics.com>> wrote:

Once you setup spark-notebook, it'll handle the submits for interactive work. 
Non-interactive is not handled by it. For that spark-kernel could be used.

Give it a shot ... it only takes 5 minutes to get it running in local-mode.

Irfan Ahmad
CTO | Co-Founder | CloudPhysics<http://www.cloudphysics.com/>
Best of VMworld Finalist
Best Cloud Management Award
NetworkWorld 10 Startups to Watch
EMA Most Notable Vendor

On Thu, Mar 19, 2015 at 9:51 AM, David Holiday 
<dav...@annaisystems.com<mailto:dav...@annaisystems.com>> wrote:
hi all - thx for the alacritous replies! so regarding how to get things from 
notebook to spark and back, am I correct that spark-submit is the way to go?

Software Engineer
760 607 3300<tel:760%20607%203300> | Office
312 758 8385<tel:312%20758%208385> | Mobile


On Mar 19, 2015, at 1:14 AM, Paolo Platter 
<paolo.plat...@agilelab.it<mailto:paolo.plat...@agilelab.it>> wrote:

Yes, I would suggest spark-notebook too.
It's very simple to setup and it's growing pretty fast.


Inviata dal mio Windows Phone
Da: Irfan Ahmad<mailto:ir...@cloudphysics.com>
Inviato: ‎19/‎03/‎2015 04:05
A: davidh<mailto:dav...@annaisystems.com>
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Oggetto: Re: iPython Notebook + Spark + Accumulo -- best practice?

I forgot to mention that there is also Zeppelin and jove-notebook but I haven't 
got any experience with those yet.

Irfan Ahmad
CTO | Co-Founder | CloudPhysics<http://www.cloudphysics.com/>
Best of VMworld Finalist
Best Cloud Management Award
NetworkWorld 10 Startups to Watch
EMA Most Notable Vendor

On Wed, Mar 18, 2015 at 8:01 PM, Irfan Ahmad 
<ir...@cloudphysics.com<mailto:ir...@cloudphysics.com>> wrote:
Hi David,

W00t indeed and great questions. On the notebook front, there are two options 
depending on what you are looking for. You can either go with iPython 3 with 
Spark-kernel as a backend or you can use spark-notebook. Both have interesting 

If you have looking for a single notebook platform for your data scientists 
that has R, Python as well as a Spark Shell, you'll likely want to go with 
iPython + Spark-kernel. Downsides with the spark-kernel project are that data 
visualization isn't quite there yet, early days for documentation and 
blogs/etc. Upside is that R and Python work beautifully and that the ipython 
committers are super-helpful.

If you are OK with a primarily spark/scala experience, then I suggest you with 
spark-notebook. Upsides are that the project is a little further along, 
visualization support is better than spark-kernel (though not as good as 
iPython with Python) and the committer is awesome with help. Downside is that 
you won't get R and Python.

FWIW: I'm using both at the moment!

Hope that helps.

Irfan Ahmad
CTO | Co-Founder | CloudPhysics<http://www.cloudphysics.com/>
Best of VMworld Finalist
Best Cloud Management Award
NetworkWorld 10 Startups to Watch
EMA Most Notable Vendor

On Wed, Mar 18, 2015 at 5:45 PM, davidh 
<dav...@annaisystems.com<mailto:dav...@annaisystems.com>> wrote:
hi all, I've been DDGing, Stack Overflowing, Twittering, RTFMing, and
scanning through this archive with only moderate success. in other words --
my way of saying sorry if this is answered somewhere obvious and I missed it

i've been tasked with figuring out how to connect Notebook, Spark, and
Accumulo together. The end user will do her work via notebook. thus far,
I've successfully setup a Vagrant image containing Spark, Accumulo, and
Hadoop. I was able to use some of the Accumulo example code to create a
table populated with data, create a simple program in scala that, when fired
off to Spark via spark-submit, connects to accumulo and prints the first ten
rows of data in the table. so w00t on that - but now I'm left with more

1) I'm still stuck on what's considered 'best practice' in terms of hooking
all this together. Let's say Sally, a  user, wants to do some analytic work
on her data. She pecks the appropriate commands into notebook and fires them
off. how does this get wired together on the back end? Do I, from notebook,
use spark-submit to send a job to spark and let spark worry about hooking
into accumulo or is it preferable to create some kind of open stream between
the two?

2) if I want to extend spark's api, do I need to first submit an endless job
via spark-submit that does something like what this gentleman describes
<http://blog.madhukaraphatak.com/extending-spark-api>  ? is there an
alternative (other than refactoring spark's source) that doesn't involve
extending the api via a job submission?

ultimately what I'm looking for help locating docs, blogs, etc that may shed
some light on this.

t/y in advance!


View this message in context: 
Sent from the Apache Spark User List mailing list archive at 

To unsubscribe, e-mail: 
For additional commands, e-mail: 

Reply via email to