That purely awesome! Don't hesitate to contribute your notebook back to the spark notebook repo, even rough, I'll help cleaning up if needed.
The vagrant is also appealing π Congrats! Le jeu 26 mars 2015 22:22, David Holiday <dav...@annaisystems.com> a Γ©crit : > w00000000000000000t! that did it! t/y so much! > > I'm going to put together a pastebin or something that has all the code > put together so if anyone else runs into this issue they will have some > working code to help them figure out what's going on. > > DAVID HOLIDAY > Software Engineer > 760 607 3300 | Office > 312 758 8385 | Mobile > dav...@annaisystems.com <broo...@annaisystems.com> > > > > www.AnnaiSystems.com > > On Mar 26, 2015, at 12:24 PM, Corey Nolet <cjno...@gmail.com> wrote: > > Spark uses a SerializableWritable [1] to java serialize writable > objects. I've noticed (at least in Spark 1.2.1) that it breaks down with > some objects when Kryo is used instead of regular java serialization. > Though it is wrapping the actual AccumuloInputFormat (another example of > something you may want to do in the future), we have Accumulo working to > load data from a table into Spark SQL [2]. The way Spark uses the > InputFormat is very straightforward. > > [1] > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SerializableWritable.scala > [2] > https://github.com/calrissian/accumulo-recipes/blob/master/thirdparty/spark/src/main/scala/org/calrissian/accumulorecipes/spark/sql/EventStoreCatalyst.scala#L76 > > On Thu, Mar 26, 2015 at 3:06 PM, Nick Pentreath <nick.pentre...@gmail.com> > wrote: > >> I'm guessing the Accumulo Key and Value classes are not serializable, so >> you would need to do something like >> >> val rdd = sc.newAPIHadoopRDD(...).map { case (key, value) => >> (extractScalaType(key), extractScalaType(value)) } >> >> Where 'extractScalaType converts the key or Value to a standard Scala >> type or case class or whatever - basically extracts the data from the Key >> or Value in a form usable in Scala >> >> β >> Sent from Mailbox <https://www.dropbox.com/mailbox> >> >> >> On Thu, Mar 26, 2015 at 8:59 PM, Russ Weeks <rwe...@newbrightidea.com> >> wrote: >> >>> Hi, David, >>> >>> This is the code that I use to create a JavaPairRDD from an Accumulo >>> table: >>> >>> JavaSparkContext sc = new JavaSparkContext(conf); >>> Job hadoopJob = Job.getInstance(conf,"TestSparkJob"); >>> job.setInputFormatClass(AccumuloInputFormat.class); >>> AccumuloInputFormat.setZooKeeperInstance(job, >>> conf.get(ZOOKEEPER_INSTANCE_NAME, >>> conf.get(ZOOKEEPER_HOSTS) >>> ); >>> AccumuloInputFormat.setConnectorInfo(job, >>> conf.get(ACCUMULO_AGILE_USERNAME), >>> new PasswordToken(conf.get(ACCUMULO_AGILE_PASSWORD)) >>> ); >>> AccumuloInputFormat.setInputTableName(job, >>> conf.get(ACCUMULO_TABLE_NAME)); >>> AccumuloInputFormat.setScanAuthorizations(job, auths); >>> JavaPairRDD<Key, Value> values = >>> sc.newAPIHadoopRDD(hadoopJob.getConfiguration(), AccumuloInputFormat.class, >>> Key.class, Value.class); >>> >>> Key.class and Value.class are from org.apache.accumulo.core.data. I >>> use a WholeRowIterator so that the Value is actually an encoded >>> representation of an entire logical row; it's a useful convenience if you >>> can be sure that your rows always fit in memory. >>> >>> I haven't tested it since Spark 1.0.1 but I doubt anything important >>> has changed. >>> >>> Regards, >>> -Russ >>> >>> >>> On Thu, Mar 26, 2015 at 11:41 AM, David Holiday < >>> dav...@annaisystems.com> wrote: >>> >>>> * progress!* >>>> >>>> i was able to figure out why the 'input INFO not set' error was >>>> occurring. the eagle-eyed among you will no doubt see the following code is >>>> missing a closing '(' >>>> >>>> AbstractInputFormat.setConnectorInfo(jobConf, "root", new >>>> PasswordToken("password") >>>> >>>> as I'm doing this in spark-notebook, I'd been clicking the execute >>>> button and moving on because I wasn't seeing an error. what I forgot was >>>> that notebook is going to do what spark-shell will do when you leave off a >>>> closing ')' -- *it will wait forever for you to add it*. so the error >>>> was the result of the 'setConnectorInfo' method never getting executed. >>>> >>>> unfortunately, I'm still unable to shove the accumulo table data into >>>> an RDD that's useable to me. when I execute >>>> >>>> rddX.count >>>> >>>> I get back >>>> >>>> res15: Long = 10000 >>>> >>>> which is the correct response - there are 10,000 rows of data in the >>>> table I pointed to. however, when I try to grab the first element of data >>>> thusly: >>>> >>>> rddX.first >>>> >>>> I get the following error: >>>> >>>> org.apache.spark.SparkException: Job aborted due to stage failure: Task >>>> 0.0 in stage 0.0 (TID 0) had a not serializable result: >>>> org.apache.accumulo.core.data.Key >>>> >>>> any thoughts on where to go from here? >>>> DAVID HOLIDAY >>>> Software Engineer >>>> 760 607 3300 | Office >>>> 312 758 8385 | Mobile >>>> dav...@annaisystems.com <broo...@annaisystems.com> >>>> >>>> >>>> <GetFileAttachment.jpg> >>>> >>>> www.AnnaiSystems.com <http://www.annaisystems.com/> >>>> >>>> On Mar 26, 2015, at 8:35 AM, David Holiday <dav...@annaisystems.com> >>>> wrote: >>>> >>>> hi Nick >>>> >>>> Unfortunately the Accumulo docs are woefully inadequate, and in some >>>> places, flat wrong. I'm not sure if this is a case where the docs are 'flat >>>> wrong', or if there's some wrinke with spark-notebook in the mix that's >>>> messing everything up. I've been working with some people on stack overflow >>>> on this same issue (including one of the people from the spark-notebook >>>> team): >>>> >>>> >>>> http://stackoverflow.com/questions/29244530/how-do-i-create-a-spark-rdd-from-accumulo-1-6-in-spark-notebook?noredirect=1#comment46755938_29244530 >>>> >>>> if you click the link you can see the entire thread of code, >>>> responses from notebook, etc. I'm going to try invoking the same techniques >>>> both from within a stand-alone scala problem and from the shell itself to >>>> see if I can get some traction. I'll report back when I have more data. >>>> >>>> cheers (and thx!) >>>> >>>> >>>> >>>> DAVID HOLIDAY >>>> Software Engineer >>>> 760 607 3300 | Office >>>> 312 758 8385 | Mobile >>>> dav...@annaisystems.com <broo...@annaisystems.com> >>>> >>>> >>>> <GetFileAttachment.jpg> >>>> www.AnnaiSystems.com <http://www.annaisystems.com/> >>>> >>>> On Mar 25, 2015, at 11:43 PM, Nick Pentreath <nick.pentre...@gmail.com> >>>> wrote: >>>> >>>> From a quick look at this link - >>>> http://accumulo.apache.org/1.6/accumulo_user_manual.html#_mapreduce - >>>> it seems you need to call some static methods on AccumuloInputFormat in >>>> order to set the auth, table, and range settings. Try setting these config >>>> options first and then call newAPIHadoopRDD? >>>> >>>> On Thu, Mar 26, 2015 at 2:34 AM, David Holiday <dav...@annaisystems.com >>>> > wrote: >>>> >>>>> hi Irfan, >>>>> >>>>> thanks for getting back to me - i'll try the accumulo list to be >>>>> sure. what is the normal use case for spark though? I'm surprised that >>>>> hooking it into something as common and popular as accumulo isn't more of >>>>> an every-day task. >>>>> >>>>> DAVID HOLIDAY >>>>> Software Engineer >>>>> 760 607 3300 | Office >>>>> 312 758 8385 | Mobile >>>>> dav...@annaisystems.com <broo...@annaisystems.com> >>>>> >>>>> >>>>> <GetFileAttachment.jpg> >>>>> www.AnnaiSystems.com <http://www.annaisystems.com/> >>>>> >>>>> On Mar 25, 2015, at 5:27 PM, Irfan Ahmad <ir...@cloudphysics.com> >>>>> wrote: >>>>> >>>>> Hmmm.... this seems very accumulo-specific, doesn't it? Not sure >>>>> how to help with that. >>>>> >>>>> >>>>> *Irfan Ahmad* >>>>> CTO | Co-Founder | *CloudPhysics* <http://www.cloudphysics.com/> >>>>> Best of VMworld Finalist >>>>> Best Cloud Management Award >>>>> NetworkWorld 10 Startups to Watch >>>>> EMA Most Notable Vendor >>>>> >>>>> On Tue, Mar 24, 2015 at 4:09 PM, David Holiday < >>>>> dav...@annaisystems.com> wrote: >>>>> >>>>>> hi all, >>>>>> >>>>>> got a vagrant image with spark notebook, spark, accumulo, and >>>>>> hadoop all running. from notebook I can manually create a scanner and >>>>>> pull >>>>>> test data from a table I created using one of the accumulo examples: >>>>>> >>>>>> val instanceNameS = "accumulo"val zooServersS = "localhost:2181"val >>>>>> instance: Instance = new ZooKeeperInstance(instanceNameS, >>>>>> zooServersS)val connector: Connector = instance.getConnector( "root", >>>>>> new PasswordToken("password"))val auths = new >>>>>> Authorizations("exampleVis")val scanner = >>>>>> connector.createScanner("batchtest1", auths) >>>>>> >>>>>> scanner.setRange(new Range("row_0000000000", "row_0000000010")) >>>>>> for(entry: Entry[Key, Value] <- scanner) { >>>>>> println(entry.getKey + " is " + entry.getValue)} >>>>>> >>>>>> will give the first ten rows of table data. when I try to create the >>>>>> RDD thusly: >>>>>> >>>>>> val rdd2 = >>>>>> sparkContext.newAPIHadoopRDD ( >>>>>> new Configuration(), >>>>>> >>>>>> classOf[org.apache.accumulo.core.client.mapreduce.AccumuloInputFormat], >>>>>> classOf[org.apache.accumulo.core.data.Key], >>>>>> classOf[org.apache.accumulo.core.data.Value] >>>>>> ) >>>>>> >>>>>> I get an RDD returned to me that I can't do much with due to the >>>>>> following error: >>>>>> >>>>>> java.io.IOException: Input info has not been set. at >>>>>> org.apache.accumulo.core.client.mapreduce.lib.impl.InputConfigurator.validateOptions(InputConfigurator.java:630) >>>>>> at >>>>>> org.apache.accumulo.core.client.mapreduce.AbstractInputFormat.validateOptions(AbstractInputFormat.java:343) >>>>>> at >>>>>> org.apache.accumulo.core.client.mapreduce.AbstractInputFormat.getSplits(AbstractInputFormat.java:538) >>>>>> at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:98) >>>>>> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222) at >>>>>> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220) at >>>>>> scala.Option.getOrElse(Option.scala:120) at >>>>>> org.apache.spark.rdd.RDD.partitions(RDD.scala:220) at >>>>>> org.apache.spark.SparkContext.runJob(SparkContext.scala:1367) at >>>>>> org.apache.spark.rdd.RDD.count(RDD.scala:927) >>>>>> >>>>>> which totally makes sense in light of the fact that I haven't >>>>>> specified any parameters as to which table to connect with, what the >>>>>> auths >>>>>> are, etc. >>>>>> >>>>>> so my question is: what do I need to do from here to get those first >>>>>> ten rows of table data into my RDD? >>>>>> >>>>>> >>>>>> >>>>>> DAVID HOLIDAY >>>>>> Software Engineer >>>>>> 760 607 3300 | Office >>>>>> 312 758 8385 | Mobile >>>>>> dav...@annaisystems.com <broo...@annaisystems.com> >>>>>> >>>>>> >>>>>> <GetFileAttachment.jpg> >>>>>> www.AnnaiSystems.com <http://www.annaisystems.com/> >>>>>> >>>>>> On Mar 19, 2015, at 11:25 AM, David Holiday < >>>>>> dav...@annaisystems.com> wrote: >>>>>> >>>>>> kk - I'll put something together and get back to you with more :-) >>>>>> >>>>>> DAVID HOLIDAY >>>>>> Software Engineer >>>>>> 760 607 3300 | Office >>>>>> 312 758 8385 | Mobile >>>>>> dav...@annaisystems.com <broo...@annaisystems.com> >>>>>> >>>>>> >>>>>> <GetFileAttachment.jpg> >>>>>> www.AnnaiSystems.com <http://www.annaisystems.com/> >>>>>> >>>>>> On Mar 19, 2015, at 10:59 AM, Irfan Ahmad <ir...@cloudphysics.com> >>>>>> wrote: >>>>>> >>>>>> Once you setup spark-notebook, it'll handle the submits for >>>>>> interactive work. Non-interactive is not handled by it. For that >>>>>> spark-kernel could be used. >>>>>> >>>>>> Give it a shot ... it only takes 5 minutes to get it running in >>>>>> local-mode. >>>>>> >>>>>> >>>>>> *Irfan Ahmad* >>>>>> CTO | Co-Founder | *CloudPhysics* <http://www.cloudphysics.com/> >>>>>> Best of VMworld Finalist >>>>>> Best Cloud Management Award >>>>>> NetworkWorld 10 Startups to Watch >>>>>> EMA Most Notable Vendor >>>>>> >>>>>> On Thu, Mar 19, 2015 at 9:51 AM, David Holiday < >>>>>> dav...@annaisystems.com> wrote: >>>>>> >>>>>>> hi all - thx for the alacritous replies! so regarding how to get >>>>>>> things from notebook to spark and back, am I correct that spark-submit >>>>>>> is >>>>>>> the way to go? >>>>>>> >>>>>>> DAVID HOLIDAY >>>>>>> Software Engineer >>>>>>> 760 607 3300 | Office >>>>>>> 312 758 8385 | Mobile >>>>>>> dav...@annaisystems.com <broo...@annaisystems.com> >>>>>>> >>>>>>> >>>>>>> <GetFileAttachment.jpg> >>>>>>> www.AnnaiSystems.com <http://www.annaisystems.com/> >>>>>>> >>>>>>> On Mar 19, 2015, at 1:14 AM, Paolo Platter < >>>>>>> paolo.plat...@agilelab.it> wrote: >>>>>>> >>>>>>> Yes, I would suggest spark-notebook too. >>>>>>> It's very simple to setup and it's growing pretty fast. >>>>>>> >>>>>>> Paolo >>>>>>> >>>>>>> Inviata dal mio Windows Phone >>>>>>> ------------------------------ >>>>>>> Da: Irfan Ahmad <ir...@cloudphysics.com> >>>>>>> Inviato: β19/β03/β2015 04:05 >>>>>>> A: davidh <dav...@annaisystems.com> >>>>>>> Cc: user@spark.apache.org >>>>>>> Oggetto: Re: iPython Notebook + Spark + Accumulo -- best practice? >>>>>>> >>>>>>> I forgot to mention that there is also Zeppelin and jove-notebook >>>>>>> but I haven't got any experience with those yet. >>>>>>> >>>>>>> >>>>>>> *Irfan Ahmad* >>>>>>> CTO | Co-Founder | *CloudPhysics* <http://www.cloudphysics.com/> >>>>>>> Best of VMworld Finalist >>>>>>> Best Cloud Management Award >>>>>>> NetworkWorld 10 Startups to Watch >>>>>>> EMA Most Notable Vendor >>>>>>> >>>>>>> On Wed, Mar 18, 2015 at 8:01 PM, Irfan Ahmad <ir...@cloudphysics.com >>>>>>> > wrote: >>>>>>> >>>>>>>> Hi David, >>>>>>>> >>>>>>>> W00t indeed and great questions. On the notebook front, there are >>>>>>>> two options depending on what you are looking for. You can either go >>>>>>>> with >>>>>>>> iPython 3 with Spark-kernel as a backend or you can use spark-notebook. >>>>>>>> Both have interesting tradeoffs. >>>>>>>> >>>>>>>> If you have looking for a single notebook platform for your data >>>>>>>> scientists that has R, Python as well as a Spark Shell, you'll likely >>>>>>>> want >>>>>>>> to go with iPython + Spark-kernel. Downsides with the spark-kernel >>>>>>>> project >>>>>>>> are that data visualization isn't quite there yet, early days for >>>>>>>> documentation and blogs/etc. Upside is that R and Python work >>>>>>>> beautifully >>>>>>>> and that the ipython committers are super-helpful. >>>>>>>> >>>>>>>> If you are OK with a primarily spark/scala experience, then I >>>>>>>> suggest you with spark-notebook. Upsides are that the project is a >>>>>>>> little >>>>>>>> further along, visualization support is better than spark-kernel >>>>>>>> (though >>>>>>>> not as good as iPython with Python) and the committer is awesome with >>>>>>>> help. >>>>>>>> Downside is that you won't get R and Python. >>>>>>>> >>>>>>>> FWIW: I'm using both at the moment! >>>>>>>> >>>>>>>> Hope that helps. >>>>>>>> >>>>>>>> >>>>>>>> *Irfan Ahmad* >>>>>>>> CTO | Co-Founder | *CloudPhysics* <http://www.cloudphysics.com/> >>>>>>>> Best of VMworld Finalist >>>>>>>> Best Cloud Management Award >>>>>>>> NetworkWorld 10 Startups to Watch >>>>>>>> EMA Most Notable Vendor >>>>>>>> >>>>>>>> On Wed, Mar 18, 2015 at 5:45 PM, davidh <dav...@annaisystems.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> hi all, I've been DDGing, Stack Overflowing, Twittering, RTFMing, >>>>>>>>> and >>>>>>>>> scanning through this archive with only moderate success. in other >>>>>>>>> words -- >>>>>>>>> my way of saying sorry if this is answered somewhere obvious and I >>>>>>>>> missed it >>>>>>>>> :-) >>>>>>>>> >>>>>>>>> i've been tasked with figuring out how to connect Notebook, Spark, >>>>>>>>> and >>>>>>>>> Accumulo together. The end user will do her work via notebook. >>>>>>>>> thus far, >>>>>>>>> I've successfully setup a Vagrant image containing Spark, >>>>>>>>> Accumulo, and >>>>>>>>> Hadoop. I was able to use some of the Accumulo example code to >>>>>>>>> create a >>>>>>>>> table populated with data, create a simple program in scala that, >>>>>>>>> when fired >>>>>>>>> off to Spark via spark-submit, connects to accumulo and prints the >>>>>>>>> first ten >>>>>>>>> rows of data in the table. so w00t on that - but now I'm left with >>>>>>>>> more >>>>>>>>> questions: >>>>>>>>> >>>>>>>>> 1) I'm still stuck on what's considered 'best practice' in terms >>>>>>>>> of hooking >>>>>>>>> all this together. Let's say Sally, a user, wants to do some >>>>>>>>> analytic work >>>>>>>>> on her data. She pecks the appropriate commands into notebook and >>>>>>>>> fires them >>>>>>>>> off. how does this get wired together on the back end? Do I, from >>>>>>>>> notebook, >>>>>>>>> use spark-submit to send a job to spark and let spark worry about >>>>>>>>> hooking >>>>>>>>> into accumulo or is it preferable to create some kind of open >>>>>>>>> stream between >>>>>>>>> the two? >>>>>>>>> >>>>>>>>> 2) if I want to extend spark's api, do I need to first submit an >>>>>>>>> endless job >>>>>>>>> via spark-submit that does something like what this gentleman >>>>>>>>> describes >>>>>>>>> <http://blog.madhukaraphatak.com/extending-spark-api> ? is there >>>>>>>>> an >>>>>>>>> alternative (other than refactoring spark's source) that doesn't >>>>>>>>> involve >>>>>>>>> extending the api via a job submission? >>>>>>>>> >>>>>>>>> ultimately what I'm looking for help locating docs, blogs, etc >>>>>>>>> that may shed >>>>>>>>> some light on this. >>>>>>>>> >>>>>>>>> t/y in advance! >>>>>>>>> >>>>>>>>> d >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> View this message in context: >>>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/iPython-Notebook-Spark-Accumulo-best-practice-tp22137.html >>>>>>>>> Sent from the Apache Spark User List mailing list archive at >>>>>>>>> Nabble.com <http://nabble.com/>. >>>>>>>>> >>>>>>>>> >>>>>>>>> --------------------------------------------------------------------- >>>>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>>>>>>> For additional commands, e-mail: user-h...@spark.apache.org >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>> >>>> >>>> >>> >> > >