Hi, David, This is the code that I use to create a JavaPairRDD from an Accumulo table:
JavaSparkContext sc = new JavaSparkContext(conf); Job hadoopJob = Job.getInstance(conf,"TestSparkJob"); job.setInputFormatClass(AccumuloInputFormat.class); AccumuloInputFormat.setZooKeeperInstance(job, conf.get(ZOOKEEPER_INSTANCE_NAME, conf.get(ZOOKEEPER_HOSTS) ); AccumuloInputFormat.setConnectorInfo(job, conf.get(ACCUMULO_AGILE_USERNAME), new PasswordToken(conf.get(ACCUMULO_AGILE_PASSWORD)) ); AccumuloInputFormat.setInputTableName(job, conf.get(ACCUMULO_TABLE_NAME)); AccumuloInputFormat.setScanAuthorizations(job, auths); JavaPairRDD<Key, Value> values = sc.newAPIHadoopRDD(hadoopJob.getConfiguration(), AccumuloInputFormat.class, Key.class, Value.class); Key.class and Value.class are from org.apache.accumulo.core.data. I use a WholeRowIterator so that the Value is actually an encoded representation of an entire logical row; it's a useful convenience if you can be sure that your rows always fit in memory. I haven't tested it since Spark 1.0.1 but I doubt anything important has changed. Regards, -Russ On Thu, Mar 26, 2015 at 11:41 AM, David Holiday <dav...@annaisystems.com> wrote: > * progress!* > > i was able to figure out why the 'input INFO not set' error was occurring. > the eagle-eyed among you will no doubt see the following code is missing a > closing '(' > > AbstractInputFormat.setConnectorInfo(jobConf, "root", new > PasswordToken("password") > > as I'm doing this in spark-notebook, I'd been clicking the execute button > and moving on because I wasn't seeing an error. what I forgot was that > notebook is going to do what spark-shell will do when you leave off a > closing ')' -- *it will wait forever for you to add it*. so the error was > the result of the 'setConnectorInfo' method never getting executed. > > unfortunately, I'm still unable to shove the accumulo table data into an > RDD that's useable to me. when I execute > > rddX.count > > I get back > > res15: Long = 10000 > > which is the correct response - there are 10,000 rows of data in the table > I pointed to. however, when I try to grab the first element of data thusly: > > rddX.first > > I get the following error: > > org.apache.spark.SparkException: Job aborted due to stage failure: Task > 0.0 in stage 0.0 (TID 0) had a not serializable result: > org.apache.accumulo.core.data.Key > > any thoughts on where to go from here? > DAVID HOLIDAY > Software Engineer > 760 607 3300 | Office > 312 758 8385 | Mobile > dav...@annaisystems.com <broo...@annaisystems.com> > > > > www.AnnaiSystems.com > > On Mar 26, 2015, at 8:35 AM, David Holiday <dav...@annaisystems.com> > wrote: > > hi Nick > > Unfortunately the Accumulo docs are woefully inadequate, and in some > places, flat wrong. I'm not sure if this is a case where the docs are 'flat > wrong', or if there's some wrinke with spark-notebook in the mix that's > messing everything up. I've been working with some people on stack overflow > on this same issue (including one of the people from the spark-notebook > team): > > > http://stackoverflow.com/questions/29244530/how-do-i-create-a-spark-rdd-from-accumulo-1-6-in-spark-notebook?noredirect=1#comment46755938_29244530 > > if you click the link you can see the entire thread of code, responses > from notebook, etc. I'm going to try invoking the same techniques both from > within a stand-alone scala problem and from the shell itself to see if I > can get some traction. I'll report back when I have more data. > > cheers (and thx!) > > > > DAVID HOLIDAY > Software Engineer > 760 607 3300 | Office > 312 758 8385 | Mobile > dav...@annaisystems.com <broo...@annaisystems.com> > > > <GetFileAttachment.jpg> > www.AnnaiSystems.com <http://www.annaisystems.com/> > > On Mar 25, 2015, at 11:43 PM, Nick Pentreath <nick.pentre...@gmail.com> > wrote: > > From a quick look at this link - > http://accumulo.apache.org/1.6/accumulo_user_manual.html#_mapreduce - it > seems you need to call some static methods on AccumuloInputFormat in order > to set the auth, table, and range settings. Try setting these config > options first and then call newAPIHadoopRDD? > > On Thu, Mar 26, 2015 at 2:34 AM, David Holiday <dav...@annaisystems.com> > wrote: > >> hi Irfan, >> >> thanks for getting back to me - i'll try the accumulo list to be sure. >> what is the normal use case for spark though? I'm surprised that hooking it >> into something as common and popular as accumulo isn't more of an every-day >> task. >> >> DAVID HOLIDAY >> Software Engineer >> 760 607 3300 | Office >> 312 758 8385 | Mobile >> dav...@annaisystems.com <broo...@annaisystems.com> >> >> >> <GetFileAttachment.jpg> >> www.AnnaiSystems.com <http://www.annaisystems.com/> >> >> On Mar 25, 2015, at 5:27 PM, Irfan Ahmad <ir...@cloudphysics.com> >> wrote: >> >> Hmmm.... this seems very accumulo-specific, doesn't it? Not sure how >> to help with that. >> >> >> *Irfan Ahmad* >> CTO | Co-Founder | *CloudPhysics* <http://www.cloudphysics.com/> >> Best of VMworld Finalist >> Best Cloud Management Award >> NetworkWorld 10 Startups to Watch >> EMA Most Notable Vendor >> >> On Tue, Mar 24, 2015 at 4:09 PM, David Holiday <dav...@annaisystems.com >> > wrote: >> >>> hi all, >>> >>> got a vagrant image with spark notebook, spark, accumulo, and hadoop >>> all running. from notebook I can manually create a scanner and pull test >>> data from a table I created using one of the accumulo examples: >>> >>> val instanceNameS = "accumulo"val zooServersS = "localhost:2181"val >>> instance: Instance = new ZooKeeperInstance(instanceNameS, zooServersS)val >>> connector: Connector = instance.getConnector( "root", new >>> PasswordToken("password"))val auths = new Authorizations("exampleVis")val >>> scanner = connector.createScanner("batchtest1", auths) >>> >>> scanner.setRange(new Range("row_0000000000", "row_0000000010")) >>> for(entry: Entry[Key, Value] <- scanner) { >>> println(entry.getKey + " is " + entry.getValue)} >>> >>> will give the first ten rows of table data. when I try to create the RDD >>> thusly: >>> >>> val rdd2 = >>> sparkContext.newAPIHadoopRDD ( >>> new Configuration(), >>> classOf[org.apache.accumulo.core.client.mapreduce.AccumuloInputFormat], >>> classOf[org.apache.accumulo.core.data.Key], >>> classOf[org.apache.accumulo.core.data.Value] >>> ) >>> >>> I get an RDD returned to me that I can't do much with due to the >>> following error: >>> >>> java.io.IOException: Input info has not been set. at >>> org.apache.accumulo.core.client.mapreduce.lib.impl.InputConfigurator.validateOptions(InputConfigurator.java:630) >>> at >>> org.apache.accumulo.core.client.mapreduce.AbstractInputFormat.validateOptions(AbstractInputFormat.java:343) >>> at >>> org.apache.accumulo.core.client.mapreduce.AbstractInputFormat.getSplits(AbstractInputFormat.java:538) >>> at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:98) >>> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222) at >>> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220) at >>> scala.Option.getOrElse(Option.scala:120) at >>> org.apache.spark.rdd.RDD.partitions(RDD.scala:220) at >>> org.apache.spark.SparkContext.runJob(SparkContext.scala:1367) at >>> org.apache.spark.rdd.RDD.count(RDD.scala:927) >>> >>> which totally makes sense in light of the fact that I haven't specified >>> any parameters as to which table to connect with, what the auths are, etc. >>> >>> so my question is: what do I need to do from here to get those first ten >>> rows of table data into my RDD? >>> >>> >>> >>> DAVID HOLIDAY >>> Software Engineer >>> 760 607 3300 | Office >>> 312 758 8385 | Mobile >>> dav...@annaisystems.com <broo...@annaisystems.com> >>> >>> >>> <GetFileAttachment.jpg> >>> www.AnnaiSystems.com <http://www.annaisystems.com/> >>> >>> On Mar 19, 2015, at 11:25 AM, David Holiday <dav...@annaisystems.com> >>> wrote: >>> >>> kk - I'll put something together and get back to you with more :-) >>> >>> DAVID HOLIDAY >>> Software Engineer >>> 760 607 3300 | Office >>> 312 758 8385 | Mobile >>> dav...@annaisystems.com <broo...@annaisystems.com> >>> >>> >>> <GetFileAttachment.jpg> >>> www.AnnaiSystems.com <http://www.annaisystems.com/> >>> >>> On Mar 19, 2015, at 10:59 AM, Irfan Ahmad <ir...@cloudphysics.com> >>> wrote: >>> >>> Once you setup spark-notebook, it'll handle the submits for >>> interactive work. Non-interactive is not handled by it. For that >>> spark-kernel could be used. >>> >>> Give it a shot ... it only takes 5 minutes to get it running in >>> local-mode. >>> >>> >>> *Irfan Ahmad* >>> CTO | Co-Founder | *CloudPhysics* <http://www.cloudphysics.com/> >>> Best of VMworld Finalist >>> Best Cloud Management Award >>> NetworkWorld 10 Startups to Watch >>> EMA Most Notable Vendor >>> >>> On Thu, Mar 19, 2015 at 9:51 AM, David Holiday <dav...@annaisystems.com> >>> wrote: >>> >>>> hi all - thx for the alacritous replies! so regarding how to get things >>>> from notebook to spark and back, am I correct that spark-submit is the way >>>> to go? >>>> >>>> DAVID HOLIDAY >>>> Software Engineer >>>> 760 607 3300 | Office >>>> 312 758 8385 | Mobile >>>> dav...@annaisystems.com <broo...@annaisystems.com> >>>> >>>> >>>> <GetFileAttachment.jpg> >>>> www.AnnaiSystems.com <http://www.annaisystems.com/> >>>> >>>> On Mar 19, 2015, at 1:14 AM, Paolo Platter <paolo.plat...@agilelab.it> >>>> wrote: >>>> >>>> Yes, I would suggest spark-notebook too. >>>> It's very simple to setup and it's growing pretty fast. >>>> >>>> Paolo >>>> >>>> Inviata dal mio Windows Phone >>>> ------------------------------ >>>> Da: Irfan Ahmad <ir...@cloudphysics.com> >>>> Inviato: 19/03/2015 04:05 >>>> A: davidh <dav...@annaisystems.com> >>>> Cc: user@spark.apache.org >>>> Oggetto: Re: iPython Notebook + Spark + Accumulo -- best practice? >>>> >>>> I forgot to mention that there is also Zeppelin and jove-notebook but >>>> I haven't got any experience with those yet. >>>> >>>> >>>> *Irfan Ahmad* >>>> CTO | Co-Founder | *CloudPhysics* <http://www.cloudphysics.com/> >>>> Best of VMworld Finalist >>>> Best Cloud Management Award >>>> NetworkWorld 10 Startups to Watch >>>> EMA Most Notable Vendor >>>> >>>> On Wed, Mar 18, 2015 at 8:01 PM, Irfan Ahmad <ir...@cloudphysics.com> >>>> wrote: >>>> >>>>> Hi David, >>>>> >>>>> W00t indeed and great questions. On the notebook front, there are >>>>> two options depending on what you are looking for. You can either go with >>>>> iPython 3 with Spark-kernel as a backend or you can use spark-notebook. >>>>> Both have interesting tradeoffs. >>>>> >>>>> If you have looking for a single notebook platform for your data >>>>> scientists that has R, Python as well as a Spark Shell, you'll likely want >>>>> to go with iPython + Spark-kernel. Downsides with the spark-kernel project >>>>> are that data visualization isn't quite there yet, early days for >>>>> documentation and blogs/etc. Upside is that R and Python work beautifully >>>>> and that the ipython committers are super-helpful. >>>>> >>>>> If you are OK with a primarily spark/scala experience, then I >>>>> suggest you with spark-notebook. Upsides are that the project is a little >>>>> further along, visualization support is better than spark-kernel (though >>>>> not as good as iPython with Python) and the committer is awesome with >>>>> help. >>>>> Downside is that you won't get R and Python. >>>>> >>>>> FWIW: I'm using both at the moment! >>>>> >>>>> Hope that helps. >>>>> >>>>> >>>>> *Irfan Ahmad* >>>>> CTO | Co-Founder | *CloudPhysics* <http://www.cloudphysics.com/> >>>>> Best of VMworld Finalist >>>>> Best Cloud Management Award >>>>> NetworkWorld 10 Startups to Watch >>>>> EMA Most Notable Vendor >>>>> >>>>> On Wed, Mar 18, 2015 at 5:45 PM, davidh <dav...@annaisystems.com> >>>>> wrote: >>>>> >>>>>> hi all, I've been DDGing, Stack Overflowing, Twittering, RTFMing, and >>>>>> scanning through this archive with only moderate success. in other >>>>>> words -- >>>>>> my way of saying sorry if this is answered somewhere obvious and I >>>>>> missed it >>>>>> :-) >>>>>> >>>>>> i've been tasked with figuring out how to connect Notebook, Spark, and >>>>>> Accumulo together. The end user will do her work via notebook. thus >>>>>> far, >>>>>> I've successfully setup a Vagrant image containing Spark, Accumulo, >>>>>> and >>>>>> Hadoop. I was able to use some of the Accumulo example code to create >>>>>> a >>>>>> table populated with data, create a simple program in scala that, >>>>>> when fired >>>>>> off to Spark via spark-submit, connects to accumulo and prints the >>>>>> first ten >>>>>> rows of data in the table. so w00t on that - but now I'm left with >>>>>> more >>>>>> questions: >>>>>> >>>>>> 1) I'm still stuck on what's considered 'best practice' in terms of >>>>>> hooking >>>>>> all this together. Let's say Sally, a user, wants to do some >>>>>> analytic work >>>>>> on her data. She pecks the appropriate commands into notebook and >>>>>> fires them >>>>>> off. how does this get wired together on the back end? Do I, from >>>>>> notebook, >>>>>> use spark-submit to send a job to spark and let spark worry about >>>>>> hooking >>>>>> into accumulo or is it preferable to create some kind of open stream >>>>>> between >>>>>> the two? >>>>>> >>>>>> 2) if I want to extend spark's api, do I need to first submit an >>>>>> endless job >>>>>> via spark-submit that does something like what this gentleman >>>>>> describes >>>>>> <http://blog.madhukaraphatak.com/extending-spark-api> ? is there an >>>>>> alternative (other than refactoring spark's source) that doesn't >>>>>> involve >>>>>> extending the api via a job submission? >>>>>> >>>>>> ultimately what I'm looking for help locating docs, blogs, etc that >>>>>> may shed >>>>>> some light on this. >>>>>> >>>>>> t/y in advance! >>>>>> >>>>>> d >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> View this message in context: >>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/iPython-Notebook-Spark-Accumulo-best-practice-tp22137.html >>>>>> Sent from the Apache Spark User List mailing list archive at >>>>>> Nabble.com <http://nabble.com/>. >>>>>> >>>>>> --------------------------------------------------------------------- >>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>>>> For additional commands, e-mail: user-h...@spark.apache.org >>>>>> >>>>>> >>>>> >>>> >>>> >>> >>> >>> >> >> > > >