I'm guessing the Accumulo Key and Value classes are not serializable, so you would need to do something like
val rdd = sc.newAPIHadoopRDD(...).map { case (key, value) => (extractScalaType(key), extractScalaType(value)) } Where 'extractScalaType converts the key or Value to a standard Scala type or case class or whatever - basically extracts the data from the Key or Value in a form usable in Scala — Sent from Mailbox On Thu, Mar 26, 2015 at 8:59 PM, Russ Weeks <rwe...@newbrightidea.com> wrote: > Hi, David, > This is the code that I use to create a JavaPairRDD from an Accumulo table: > JavaSparkContext sc = new JavaSparkContext(conf); > Job hadoopJob = Job.getInstance(conf,"TestSparkJob"); > job.setInputFormatClass(AccumuloInputFormat.class); > AccumuloInputFormat.setZooKeeperInstance(job, > conf.get(ZOOKEEPER_INSTANCE_NAME, > conf.get(ZOOKEEPER_HOSTS) > ); > AccumuloInputFormat.setConnectorInfo(job, > conf.get(ACCUMULO_AGILE_USERNAME), > new PasswordToken(conf.get(ACCUMULO_AGILE_PASSWORD)) > ); > AccumuloInputFormat.setInputTableName(job, conf.get(ACCUMULO_TABLE_NAME)); > AccumuloInputFormat.setScanAuthorizations(job, auths); > JavaPairRDD<Key, Value> values = > sc.newAPIHadoopRDD(hadoopJob.getConfiguration(), AccumuloInputFormat.class, > Key.class, Value.class); > Key.class and Value.class are from org.apache.accumulo.core.data. I use a > WholeRowIterator so that the Value is actually an encoded representation of > an entire logical row; it's a useful convenience if you can be sure that > your rows always fit in memory. > I haven't tested it since Spark 1.0.1 but I doubt anything important has > changed. > Regards, > -Russ > On Thu, Mar 26, 2015 at 11:41 AM, David Holiday <dav...@annaisystems.com> > wrote: >> * progress!* >> >> i was able to figure out why the 'input INFO not set' error was occurring. >> the eagle-eyed among you will no doubt see the following code is missing a >> closing '(' >> >> AbstractInputFormat.setConnectorInfo(jobConf, "root", new >> PasswordToken("password") >> >> as I'm doing this in spark-notebook, I'd been clicking the execute button >> and moving on because I wasn't seeing an error. what I forgot was that >> notebook is going to do what spark-shell will do when you leave off a >> closing ')' -- *it will wait forever for you to add it*. so the error was >> the result of the 'setConnectorInfo' method never getting executed. >> >> unfortunately, I'm still unable to shove the accumulo table data into an >> RDD that's useable to me. when I execute >> >> rddX.count >> >> I get back >> >> res15: Long = 10000 >> >> which is the correct response - there are 10,000 rows of data in the table >> I pointed to. however, when I try to grab the first element of data thusly: >> >> rddX.first >> >> I get the following error: >> >> org.apache.spark.SparkException: Job aborted due to stage failure: Task >> 0.0 in stage 0.0 (TID 0) had a not serializable result: >> org.apache.accumulo.core.data.Key >> >> any thoughts on where to go from here? >> DAVID HOLIDAY >> Software Engineer >> 760 607 3300 | Office >> 312 758 8385 | Mobile >> dav...@annaisystems.com <broo...@annaisystems.com> >> >> >> >> www.AnnaiSystems.com >> >> On Mar 26, 2015, at 8:35 AM, David Holiday <dav...@annaisystems.com> >> wrote: >> >> hi Nick >> >> Unfortunately the Accumulo docs are woefully inadequate, and in some >> places, flat wrong. I'm not sure if this is a case where the docs are 'flat >> wrong', or if there's some wrinke with spark-notebook in the mix that's >> messing everything up. I've been working with some people on stack overflow >> on this same issue (including one of the people from the spark-notebook >> team): >> >> >> http://stackoverflow.com/questions/29244530/how-do-i-create-a-spark-rdd-from-accumulo-1-6-in-spark-notebook?noredirect=1#comment46755938_29244530 >> >> if you click the link you can see the entire thread of code, responses >> from notebook, etc. I'm going to try invoking the same techniques both from >> within a stand-alone scala problem and from the shell itself to see if I >> can get some traction. I'll report back when I have more data. >> >> cheers (and thx!) >> >> >> >> DAVID HOLIDAY >> Software Engineer >> 760 607 3300 | Office >> 312 758 8385 | Mobile >> dav...@annaisystems.com <broo...@annaisystems.com> >> >> >> <GetFileAttachment.jpg> >> www.AnnaiSystems.com <http://www.annaisystems.com/> >> >> On Mar 25, 2015, at 11:43 PM, Nick Pentreath <nick.pentre...@gmail.com> >> wrote: >> >> From a quick look at this link - >> http://accumulo.apache.org/1.6/accumulo_user_manual.html#_mapreduce - it >> seems you need to call some static methods on AccumuloInputFormat in order >> to set the auth, table, and range settings. Try setting these config >> options first and then call newAPIHadoopRDD? >> >> On Thu, Mar 26, 2015 at 2:34 AM, David Holiday <dav...@annaisystems.com> >> wrote: >> >>> hi Irfan, >>> >>> thanks for getting back to me - i'll try the accumulo list to be sure. >>> what is the normal use case for spark though? I'm surprised that hooking it >>> into something as common and popular as accumulo isn't more of an every-day >>> task. >>> >>> DAVID HOLIDAY >>> Software Engineer >>> 760 607 3300 | Office >>> 312 758 8385 | Mobile >>> dav...@annaisystems.com <broo...@annaisystems.com> >>> >>> >>> <GetFileAttachment.jpg> >>> www.AnnaiSystems.com <http://www.annaisystems.com/> >>> >>> On Mar 25, 2015, at 5:27 PM, Irfan Ahmad <ir...@cloudphysics.com> >>> wrote: >>> >>> Hmmm.... this seems very accumulo-specific, doesn't it? Not sure how >>> to help with that. >>> >>> >>> *Irfan Ahmad* >>> CTO | Co-Founder | *CloudPhysics* <http://www.cloudphysics.com/> >>> Best of VMworld Finalist >>> Best Cloud Management Award >>> NetworkWorld 10 Startups to Watch >>> EMA Most Notable Vendor >>> >>> On Tue, Mar 24, 2015 at 4:09 PM, David Holiday <dav...@annaisystems.com >>> > wrote: >>> >>>> hi all, >>>> >>>> got a vagrant image with spark notebook, spark, accumulo, and hadoop >>>> all running. from notebook I can manually create a scanner and pull test >>>> data from a table I created using one of the accumulo examples: >>>> >>>> val instanceNameS = "accumulo"val zooServersS = "localhost:2181"val >>>> instance: Instance = new ZooKeeperInstance(instanceNameS, zooServersS)val >>>> connector: Connector = instance.getConnector( "root", new >>>> PasswordToken("password"))val auths = new Authorizations("exampleVis")val >>>> scanner = connector.createScanner("batchtest1", auths) >>>> >>>> scanner.setRange(new Range("row_0000000000", "row_0000000010")) >>>> for(entry: Entry[Key, Value] <- scanner) { >>>> println(entry.getKey + " is " + entry.getValue)} >>>> >>>> will give the first ten rows of table data. when I try to create the RDD >>>> thusly: >>>> >>>> val rdd2 = >>>> sparkContext.newAPIHadoopRDD ( >>>> new Configuration(), >>>> classOf[org.apache.accumulo.core.client.mapreduce.AccumuloInputFormat], >>>> classOf[org.apache.accumulo.core.data.Key], >>>> classOf[org.apache.accumulo.core.data.Value] >>>> ) >>>> >>>> I get an RDD returned to me that I can't do much with due to the >>>> following error: >>>> >>>> java.io.IOException: Input info has not been set. at >>>> org.apache.accumulo.core.client.mapreduce.lib.impl.InputConfigurator.validateOptions(InputConfigurator.java:630) >>>> at >>>> org.apache.accumulo.core.client.mapreduce.AbstractInputFormat.validateOptions(AbstractInputFormat.java:343) >>>> at >>>> org.apache.accumulo.core.client.mapreduce.AbstractInputFormat.getSplits(AbstractInputFormat.java:538) >>>> at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:98) >>>> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222) at >>>> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220) at >>>> scala.Option.getOrElse(Option.scala:120) at >>>> org.apache.spark.rdd.RDD.partitions(RDD.scala:220) at >>>> org.apache.spark.SparkContext.runJob(SparkContext.scala:1367) at >>>> org.apache.spark.rdd.RDD.count(RDD.scala:927) >>>> >>>> which totally makes sense in light of the fact that I haven't specified >>>> any parameters as to which table to connect with, what the auths are, etc. >>>> >>>> so my question is: what do I need to do from here to get those first ten >>>> rows of table data into my RDD? >>>> >>>> >>>> >>>> DAVID HOLIDAY >>>> Software Engineer >>>> 760 607 3300 | Office >>>> 312 758 8385 | Mobile >>>> dav...@annaisystems.com <broo...@annaisystems.com> >>>> >>>> >>>> <GetFileAttachment.jpg> >>>> www.AnnaiSystems.com <http://www.annaisystems.com/> >>>> >>>> On Mar 19, 2015, at 11:25 AM, David Holiday <dav...@annaisystems.com> >>>> wrote: >>>> >>>> kk - I'll put something together and get back to you with more :-) >>>> >>>> DAVID HOLIDAY >>>> Software Engineer >>>> 760 607 3300 | Office >>>> 312 758 8385 | Mobile >>>> dav...@annaisystems.com <broo...@annaisystems.com> >>>> >>>> >>>> <GetFileAttachment.jpg> >>>> www.AnnaiSystems.com <http://www.annaisystems.com/> >>>> >>>> On Mar 19, 2015, at 10:59 AM, Irfan Ahmad <ir...@cloudphysics.com> >>>> wrote: >>>> >>>> Once you setup spark-notebook, it'll handle the submits for >>>> interactive work. Non-interactive is not handled by it. For that >>>> spark-kernel could be used. >>>> >>>> Give it a shot ... it only takes 5 minutes to get it running in >>>> local-mode. >>>> >>>> >>>> *Irfan Ahmad* >>>> CTO | Co-Founder | *CloudPhysics* <http://www.cloudphysics.com/> >>>> Best of VMworld Finalist >>>> Best Cloud Management Award >>>> NetworkWorld 10 Startups to Watch >>>> EMA Most Notable Vendor >>>> >>>> On Thu, Mar 19, 2015 at 9:51 AM, David Holiday <dav...@annaisystems.com> >>>> wrote: >>>> >>>>> hi all - thx for the alacritous replies! so regarding how to get things >>>>> from notebook to spark and back, am I correct that spark-submit is the way >>>>> to go? >>>>> >>>>> DAVID HOLIDAY >>>>> Software Engineer >>>>> 760 607 3300 | Office >>>>> 312 758 8385 | Mobile >>>>> dav...@annaisystems.com <broo...@annaisystems.com> >>>>> >>>>> >>>>> <GetFileAttachment.jpg> >>>>> www.AnnaiSystems.com <http://www.annaisystems.com/> >>>>> >>>>> On Mar 19, 2015, at 1:14 AM, Paolo Platter <paolo.plat...@agilelab.it> >>>>> wrote: >>>>> >>>>> Yes, I would suggest spark-notebook too. >>>>> It's very simple to setup and it's growing pretty fast. >>>>> >>>>> Paolo >>>>> >>>>> Inviata dal mio Windows Phone >>>>> ------------------------------ >>>>> Da: Irfan Ahmad <ir...@cloudphysics.com> >>>>> Inviato: 19/03/2015 04:05 >>>>> A: davidh <dav...@annaisystems.com> >>>>> Cc: user@spark.apache.org >>>>> Oggetto: Re: iPython Notebook + Spark + Accumulo -- best practice? >>>>> >>>>> I forgot to mention that there is also Zeppelin and jove-notebook but >>>>> I haven't got any experience with those yet. >>>>> >>>>> >>>>> *Irfan Ahmad* >>>>> CTO | Co-Founder | *CloudPhysics* <http://www.cloudphysics.com/> >>>>> Best of VMworld Finalist >>>>> Best Cloud Management Award >>>>> NetworkWorld 10 Startups to Watch >>>>> EMA Most Notable Vendor >>>>> >>>>> On Wed, Mar 18, 2015 at 8:01 PM, Irfan Ahmad <ir...@cloudphysics.com> >>>>> wrote: >>>>> >>>>>> Hi David, >>>>>> >>>>>> W00t indeed and great questions. On the notebook front, there are >>>>>> two options depending on what you are looking for. You can either go with >>>>>> iPython 3 with Spark-kernel as a backend or you can use spark-notebook. >>>>>> Both have interesting tradeoffs. >>>>>> >>>>>> If you have looking for a single notebook platform for your data >>>>>> scientists that has R, Python as well as a Spark Shell, you'll likely >>>>>> want >>>>>> to go with iPython + Spark-kernel. Downsides with the spark-kernel >>>>>> project >>>>>> are that data visualization isn't quite there yet, early days for >>>>>> documentation and blogs/etc. Upside is that R and Python work beautifully >>>>>> and that the ipython committers are super-helpful. >>>>>> >>>>>> If you are OK with a primarily spark/scala experience, then I >>>>>> suggest you with spark-notebook. Upsides are that the project is a little >>>>>> further along, visualization support is better than spark-kernel (though >>>>>> not as good as iPython with Python) and the committer is awesome with >>>>>> help. >>>>>> Downside is that you won't get R and Python. >>>>>> >>>>>> FWIW: I'm using both at the moment! >>>>>> >>>>>> Hope that helps. >>>>>> >>>>>> >>>>>> *Irfan Ahmad* >>>>>> CTO | Co-Founder | *CloudPhysics* <http://www.cloudphysics.com/> >>>>>> Best of VMworld Finalist >>>>>> Best Cloud Management Award >>>>>> NetworkWorld 10 Startups to Watch >>>>>> EMA Most Notable Vendor >>>>>> >>>>>> On Wed, Mar 18, 2015 at 5:45 PM, davidh <dav...@annaisystems.com> >>>>>> wrote: >>>>>> >>>>>>> hi all, I've been DDGing, Stack Overflowing, Twittering, RTFMing, and >>>>>>> scanning through this archive with only moderate success. in other >>>>>>> words -- >>>>>>> my way of saying sorry if this is answered somewhere obvious and I >>>>>>> missed it >>>>>>> :-) >>>>>>> >>>>>>> i've been tasked with figuring out how to connect Notebook, Spark, and >>>>>>> Accumulo together. The end user will do her work via notebook. thus >>>>>>> far, >>>>>>> I've successfully setup a Vagrant image containing Spark, Accumulo, >>>>>>> and >>>>>>> Hadoop. I was able to use some of the Accumulo example code to create >>>>>>> a >>>>>>> table populated with data, create a simple program in scala that, >>>>>>> when fired >>>>>>> off to Spark via spark-submit, connects to accumulo and prints the >>>>>>> first ten >>>>>>> rows of data in the table. so w00t on that - but now I'm left with >>>>>>> more >>>>>>> questions: >>>>>>> >>>>>>> 1) I'm still stuck on what's considered 'best practice' in terms of >>>>>>> hooking >>>>>>> all this together. Let's say Sally, a user, wants to do some >>>>>>> analytic work >>>>>>> on her data. She pecks the appropriate commands into notebook and >>>>>>> fires them >>>>>>> off. how does this get wired together on the back end? Do I, from >>>>>>> notebook, >>>>>>> use spark-submit to send a job to spark and let spark worry about >>>>>>> hooking >>>>>>> into accumulo or is it preferable to create some kind of open stream >>>>>>> between >>>>>>> the two? >>>>>>> >>>>>>> 2) if I want to extend spark's api, do I need to first submit an >>>>>>> endless job >>>>>>> via spark-submit that does something like what this gentleman >>>>>>> describes >>>>>>> <http://blog.madhukaraphatak.com/extending-spark-api> ? is there an >>>>>>> alternative (other than refactoring spark's source) that doesn't >>>>>>> involve >>>>>>> extending the api via a job submission? >>>>>>> >>>>>>> ultimately what I'm looking for help locating docs, blogs, etc that >>>>>>> may shed >>>>>>> some light on this. >>>>>>> >>>>>>> t/y in advance! >>>>>>> >>>>>>> d >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> View this message in context: >>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/iPython-Notebook-Spark-Accumulo-best-practice-tp22137.html >>>>>>> Sent from the Apache Spark User List mailing list archive at >>>>>>> Nabble.com <http://nabble.com/>. >>>>>>> >>>>>>> --------------------------------------------------------------------- >>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>>>>> For additional commands, e-mail: user-h...@spark.apache.org >>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>> >>>> >>>> >>> >>> >> >> >>