Hmmm.... this seems very accumulo-specific, doesn't it? Not sure how to help with that.
*Irfan Ahmad* CTO | Co-Founder | *CloudPhysics* <http://www.cloudphysics.com> Best of VMworld Finalist Best Cloud Management Award NetworkWorld 10 Startups to Watch EMA Most Notable Vendor On Tue, Mar 24, 2015 at 4:09 PM, David Holiday <dav...@annaisystems.com> wrote: > hi all, > > got a vagrant image with spark notebook, spark, accumulo, and hadoop all > running. from notebook I can manually create a scanner and pull test data > from a table I created using one of the accumulo examples: > > val instanceNameS = "accumulo"val zooServersS = "localhost:2181"val instance: > Instance = new ZooKeeperInstance(instanceNameS, zooServersS)val connector: > Connector = instance.getConnector( "root", new PasswordToken("password"))val > auths = new Authorizations("exampleVis")val scanner = > connector.createScanner("batchtest1", auths) > > scanner.setRange(new Range("row_0000000000", "row_0000000010")) > for(entry: Entry[Key, Value] <- scanner) { > println(entry.getKey + " is " + entry.getValue)} > > will give the first ten rows of table data. when I try to create the RDD > thusly: > > val rdd2 = > sparkContext.newAPIHadoopRDD ( > new Configuration(), > classOf[org.apache.accumulo.core.client.mapreduce.AccumuloInputFormat], > classOf[org.apache.accumulo.core.data.Key], > classOf[org.apache.accumulo.core.data.Value] > ) > > I get an RDD returned to me that I can't do much with due to the following > error: > > java.io.IOException: Input info has not been set. at > org.apache.accumulo.core.client.mapreduce.lib.impl.InputConfigurator.validateOptions(InputConfigurator.java:630) > at > org.apache.accumulo.core.client.mapreduce.AbstractInputFormat.validateOptions(AbstractInputFormat.java:343) > at > org.apache.accumulo.core.client.mapreduce.AbstractInputFormat.getSplits(AbstractInputFormat.java:538) > at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:98) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222) at > org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220) at > scala.Option.getOrElse(Option.scala:120) at > org.apache.spark.rdd.RDD.partitions(RDD.scala:220) at > org.apache.spark.SparkContext.runJob(SparkContext.scala:1367) at > org.apache.spark.rdd.RDD.count(RDD.scala:927) > > which totally makes sense in light of the fact that I haven't specified > any parameters as to which table to connect with, what the auths are, etc. > > so my question is: what do I need to do from here to get those first ten > rows of table data into my RDD? > > > > DAVID HOLIDAY > Software Engineer > 760 607 3300 | Office > 312 758 8385 | Mobile > dav...@annaisystems.com <broo...@annaisystems.com> > > > > www.AnnaiSystems.com > > On Mar 19, 2015, at 11:25 AM, David Holiday <dav...@annaisystems.com> > wrote: > > kk - I'll put something together and get back to you with more :-) > > DAVID HOLIDAY > Software Engineer > 760 607 3300 | Office > 312 758 8385 | Mobile > dav...@annaisystems.com <broo...@annaisystems.com> > > > <GetFileAttachment.jpg> > www.AnnaiSystems.com <http://www.annaisystems.com/> > > On Mar 19, 2015, at 10:59 AM, Irfan Ahmad <ir...@cloudphysics.com> wrote: > > Once you setup spark-notebook, it'll handle the submits for interactive > work. Non-interactive is not handled by it. For that spark-kernel could be > used. > > Give it a shot ... it only takes 5 minutes to get it running in > local-mode. > > > *Irfan Ahmad* > CTO | Co-Founder | *CloudPhysics* <http://www.cloudphysics.com/> > Best of VMworld Finalist > Best Cloud Management Award > NetworkWorld 10 Startups to Watch > EMA Most Notable Vendor > > On Thu, Mar 19, 2015 at 9:51 AM, David Holiday <dav...@annaisystems.com> > wrote: > >> hi all - thx for the alacritous replies! so regarding how to get things >> from notebook to spark and back, am I correct that spark-submit is the way >> to go? >> >> DAVID HOLIDAY >> Software Engineer >> 760 607 3300 | Office >> 312 758 8385 | Mobile >> dav...@annaisystems.com <broo...@annaisystems.com> >> >> >> <GetFileAttachment.jpg> >> www.AnnaiSystems.com <http://www.annaisystems.com/> >> >> On Mar 19, 2015, at 1:14 AM, Paolo Platter <paolo.plat...@agilelab.it> >> wrote: >> >> Yes, I would suggest spark-notebook too. >> It's very simple to setup and it's growing pretty fast. >> >> Paolo >> >> Inviata dal mio Windows Phone >> ------------------------------ >> Da: Irfan Ahmad <ir...@cloudphysics.com> >> Inviato: 19/03/2015 04:05 >> A: davidh <dav...@annaisystems.com> >> Cc: user@spark.apache.org >> Oggetto: Re: iPython Notebook + Spark + Accumulo -- best practice? >> >> I forgot to mention that there is also Zeppelin and jove-notebook but I >> haven't got any experience with those yet. >> >> >> *Irfan Ahmad* >> CTO | Co-Founder | *CloudPhysics* <http://www.cloudphysics.com/> >> Best of VMworld Finalist >> Best Cloud Management Award >> NetworkWorld 10 Startups to Watch >> EMA Most Notable Vendor >> >> On Wed, Mar 18, 2015 at 8:01 PM, Irfan Ahmad <ir...@cloudphysics.com> >> wrote: >> >>> Hi David, >>> >>> W00t indeed and great questions. On the notebook front, there are two >>> options depending on what you are looking for. You can either go with >>> iPython 3 with Spark-kernel as a backend or you can use spark-notebook. >>> Both have interesting tradeoffs. >>> >>> If you have looking for a single notebook platform for your data >>> scientists that has R, Python as well as a Spark Shell, you'll likely want >>> to go with iPython + Spark-kernel. Downsides with the spark-kernel project >>> are that data visualization isn't quite there yet, early days for >>> documentation and blogs/etc. Upside is that R and Python work beautifully >>> and that the ipython committers are super-helpful. >>> >>> If you are OK with a primarily spark/scala experience, then I suggest >>> you with spark-notebook. Upsides are that the project is a little further >>> along, visualization support is better than spark-kernel (though not as >>> good as iPython with Python) and the committer is awesome with help. >>> Downside is that you won't get R and Python. >>> >>> FWIW: I'm using both at the moment! >>> >>> Hope that helps. >>> >>> >>> *Irfan Ahmad* >>> CTO | Co-Founder | *CloudPhysics* <http://www.cloudphysics.com/> >>> Best of VMworld Finalist >>> Best Cloud Management Award >>> NetworkWorld 10 Startups to Watch >>> EMA Most Notable Vendor >>> >>> On Wed, Mar 18, 2015 at 5:45 PM, davidh <dav...@annaisystems.com> wrote: >>> >>>> hi all, I've been DDGing, Stack Overflowing, Twittering, RTFMing, and >>>> scanning through this archive with only moderate success. in other >>>> words -- >>>> my way of saying sorry if this is answered somewhere obvious and I >>>> missed it >>>> :-) >>>> >>>> i've been tasked with figuring out how to connect Notebook, Spark, and >>>> Accumulo together. The end user will do her work via notebook. thus far, >>>> I've successfully setup a Vagrant image containing Spark, Accumulo, and >>>> Hadoop. I was able to use some of the Accumulo example code to create a >>>> table populated with data, create a simple program in scala that, when >>>> fired >>>> off to Spark via spark-submit, connects to accumulo and prints the >>>> first ten >>>> rows of data in the table. so w00t on that - but now I'm left with more >>>> questions: >>>> >>>> 1) I'm still stuck on what's considered 'best practice' in terms of >>>> hooking >>>> all this together. Let's say Sally, a user, wants to do some analytic >>>> work >>>> on her data. She pecks the appropriate commands into notebook and fires >>>> them >>>> off. how does this get wired together on the back end? Do I, from >>>> notebook, >>>> use spark-submit to send a job to spark and let spark worry about >>>> hooking >>>> into accumulo or is it preferable to create some kind of open stream >>>> between >>>> the two? >>>> >>>> 2) if I want to extend spark's api, do I need to first submit an >>>> endless job >>>> via spark-submit that does something like what this gentleman describes >>>> <http://blog.madhukaraphatak.com/extending-spark-api> ? is there an >>>> alternative (other than refactoring spark's source) that doesn't involve >>>> extending the api via a job submission? >>>> >>>> ultimately what I'm looking for help locating docs, blogs, etc that may >>>> shed >>>> some light on this. >>>> >>>> t/y in advance! >>>> >>>> d >>>> >>>> >>>> >>>> -- >>>> View this message in context: >>>> http://apache-spark-user-list.1001560.n3.nabble.com/iPython-Notebook-Spark-Accumulo-best-practice-tp22137.html >>>> Sent from the Apache Spark User List mailing list archive at Nabble.com >>>> <http://nabble.com/>. >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>> For additional commands, e-mail: user-h...@spark.apache.org >>>> >>>> >>> >> >> > > >