Re: iPython Notebook + Spark + Accumulo -- best practice?

Corey Nolet Thu, 26 Mar 2015 12:28:07 -0700

Spark uses a SerializableWritable [1] to java serialize writable objects.
I've noticed (at least in Spark 1.2.1) that it breaks down with some
objects when Kryo is used instead of regular java serialization. Though it
is  wrapping the actual AccumuloInputFormat (another example of something
you may want to do in the future), we have Accumulo working to load data
from a table into Spark SQL [2]. The way Spark uses the InputFormat is very
straightforward.


[1]
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SerializableWritable.scala
[2]
https://github.com/calrissian/accumulo-recipes/blob/master/thirdparty/spark/src/main/scala/org/calrissian/accumulorecipes/spark/sql/EventStoreCatalyst.scala#L76

On Thu, Mar 26, 2015 at 3:06 PM, Nick Pentreath <nick.pentre...@gmail.com>
wrote:

> I'm guessing the Accumulo Key and Value classes are not serializable, so
> you would need to do something like
>
> val rdd = sc.newAPIHadoopRDD(...).map { case (key, value) =>
> (extractScalaType(key), extractScalaType(value)) }
>
> Where 'extractScalaType converts the key or Value to a standard Scala type
> or case class or whatever - basically extracts the data from the Key or
> Value in a form usable in Scala
>
> —
> Sent from Mailbox <https://www.dropbox.com/mailbox>
>
>
> On Thu, Mar 26, 2015 at 8:59 PM, Russ Weeks <rwe...@newbrightidea.com>
> wrote:
>
>> Hi, David,
>>
>> This is the code that I use to create a JavaPairRDD from an Accumulo
>> table:
>>
>>  JavaSparkContext sc = new JavaSparkContext(conf);
>> Job hadoopJob = Job.getInstance(conf,"TestSparkJob");
>> job.setInputFormatClass(AccumuloInputFormat.class);
>> AccumuloInputFormat.setZooKeeperInstance(job,
>>     conf.get(ZOOKEEPER_INSTANCE_NAME,
>>     conf.get(ZOOKEEPER_HOSTS)
>> );
>> AccumuloInputFormat.setConnectorInfo(job,
>>     conf.get(ACCUMULO_AGILE_USERNAME),
>>     new PasswordToken(conf.get(ACCUMULO_AGILE_PASSWORD))
>> );
>> AccumuloInputFormat.setInputTableName(job, conf.get(ACCUMULO_TABLE_NAME));
>> AccumuloInputFormat.setScanAuthorizations(job, auths);
>> JavaPairRDD<Key, Value> values =
>> sc.newAPIHadoopRDD(hadoopJob.getConfiguration(), AccumuloInputFormat.class,
>> Key.class, Value.class);
>>
>> Key.class and Value.class are from org.apache.accumulo.core.data. I use a
>> WholeRowIterator so that the Value is actually an encoded representation of
>> an entire logical row; it's a useful convenience if you can be sure that
>> your rows always fit in memory.
>>
>> I haven't tested it since Spark 1.0.1 but I doubt anything important has
>> changed.
>>
>> Regards,
>> -Russ
>>
>>
>> On Thu, Mar 26, 2015 at 11:41 AM, David Holiday <dav...@annaisystems.com>
>> wrote:
>>
>>>  * progress!*
>>>
>>> i was able to figure out why the 'input INFO not set' error was
>>> occurring. the eagle-eyed among you will no doubt see the following code is
>>> missing a closing '('
>>>
>>> AbstractInputFormat.setConnectorInfo(jobConf, "root", new 
>>> PasswordToken("password")
>>>
>>> as I'm doing this in spark-notebook, I'd been clicking the execute
>>> button and moving on because I wasn't seeing an error. what I forgot was
>>> that notebook is going to do what spark-shell will do when you leave off a
>>> closing ')' -- *it will wait forever for you to add it*. so the error
>>> was the result of the 'setConnectorInfo' method never getting executed.
>>>
>>> unfortunately, I'm still unable to shove the accumulo table data into an
>>> RDD that's useable to me. when I execute
>>>
>>> rddX.count
>>>
>>> I get back
>>>
>>> res15: Long = 10000
>>>
>>> which is the correct response - there are 10,000 rows of data in the
>>> table I pointed to. however, when I try to grab the first element of data
>>> thusly:
>>>
>>> rddX.first
>>>
>>> I get the following error:
>>>
>>> org.apache.spark.SparkException: Job aborted due to stage failure: Task
>>> 0.0 in stage 0.0 (TID 0) had a not serializable result:
>>> org.apache.accumulo.core.data.Key
>>>
>>> any thoughts on where to go from here?
>>>   DAVID HOLIDAY
>>>  Software Engineer
>>>  760 607 3300 | Office
>>>  312 758 8385 | Mobile
>>>  dav...@annaisystems.com <broo...@annaisystems.com>
>>>
>>>
>>> <GetFileAttachment.jpg>
>>>
>>> www.AnnaiSystems.com
>>>
>>>  On Mar 26, 2015, at 8:35 AM, David Holiday <dav...@annaisystems.com>
>>> wrote:
>>>
>>>  hi Nick
>>>
>>> Unfortunately the Accumulo docs are woefully inadequate, and in some
>>> places, flat wrong. I'm not sure if this is a case where the docs are 'flat
>>> wrong', or if there's some wrinke with spark-notebook in the mix that's
>>> messing everything up. I've been working with some people on stack overflow
>>> on this same issue (including one of the people from the spark-notebook
>>> team):
>>>
>>>
>>> http://stackoverflow.com/questions/29244530/how-do-i-create-a-spark-rdd-from-accumulo-1-6-in-spark-notebook?noredirect=1#comment46755938_29244530
>>>
>>> if you click the link you can see the entire thread of code, responses
>>> from notebook, etc. I'm going to try invoking the same techniques both from
>>> within a stand-alone scala problem and from the shell itself to see if I
>>> can get some traction. I'll report back when I have more data.
>>>
>>> cheers (and thx!)
>>>
>>>
>>>
>>> DAVID HOLIDAY
>>>  Software Engineer
>>>  760 607 3300 | Office
>>>  312 758 8385 | Mobile
>>>  dav...@annaisystems.com <broo...@annaisystems.com>
>>>
>>>
>>> <GetFileAttachment.jpg>
>>> www.AnnaiSystems.com <http://www.annaisystems.com/>
>>>
>>>  On Mar 25, 2015, at 11:43 PM, Nick Pentreath <nick.pentre...@gmail.com>
>>> wrote:
>>>
>>> From a quick look at this link -
>>> http://accumulo.apache.org/1.6/accumulo_user_manual.html#_mapreduce -
>>> it seems you need to call some static methods on AccumuloInputFormat in
>>> order to set the auth, table, and range settings. Try setting these config
>>> options first and then call newAPIHadoopRDD?
>>>
>>> On Thu, Mar 26, 2015 at 2:34 AM, David Holiday <dav...@annaisystems.com>
>>> wrote:
>>>
>>>> hi Irfan,
>>>>
>>>> thanks for getting back to me - i'll try the accumulo list to be sure.
>>>> what is the normal use case for spark though? I'm surprised that hooking it
>>>> into something as common and popular as accumulo isn't more of an every-day
>>>> task.
>>>>
>>>> DAVID HOLIDAY
>>>>  Software Engineer
>>>>  760 607 3300 | Office
>>>>  312 758 8385 | Mobile
>>>>  dav...@annaisystems.com <broo...@annaisystems.com>
>>>>
>>>>
>>>> <GetFileAttachment.jpg>
>>>> www.AnnaiSystems.com <http://www.annaisystems.com/>
>>>>
>>>>   On Mar 25, 2015, at 5:27 PM, Irfan Ahmad <ir...@cloudphysics.com>
>>>> wrote:
>>>>
>>>>   Hmmm.... this seems very accumulo-specific, doesn't it? Not sure how
>>>> to help with that.
>>>>
>>>>
>>>> *Irfan Ahmad*
>>>> CTO | Co-Founder | *CloudPhysics* <http://www.cloudphysics.com/>
>>>>  Best of VMworld Finalist
>>>>  Best Cloud Management Award
>>>> NetworkWorld 10 Startups to Watch
>>>> EMA Most Notable Vendor
>>>>
>>>>   On Tue, Mar 24, 2015 at 4:09 PM, David Holiday <
>>>> dav...@annaisystems.com> wrote:
>>>>
>>>>>  hi all,
>>>>>
>>>>> got a vagrant image with spark notebook, spark, accumulo, and hadoop
>>>>> all running. from notebook I can manually create a scanner and pull test
>>>>> data from a table I created using one of the accumulo examples:
>>>>>
>>>>> val instanceNameS = "accumulo"val zooServersS = "localhost:2181"val 
>>>>> instance: Instance = new ZooKeeperInstance(instanceNameS, zooServersS)val 
>>>>> connector: Connector = instance.getConnector( "root", new 
>>>>> PasswordToken("password"))val auths = new Authorizations("exampleVis")val 
>>>>> scanner = connector.createScanner("batchtest1", auths)
>>>>>
>>>>> scanner.setRange(new Range("row_0000000000", "row_0000000010"))
>>>>> for(entry: Entry[Key, Value] <- scanner) {
>>>>>   println(entry.getKey + " is " + entry.getValue)}
>>>>>
>>>>> will give the first ten rows of table data. when I try to create the
>>>>> RDD thusly:
>>>>>
>>>>> val rdd2 =
>>>>>   sparkContext.newAPIHadoopRDD (
>>>>>     new Configuration(),
>>>>>     
>>>>> classOf[org.apache.accumulo.core.client.mapreduce.AccumuloInputFormat],
>>>>>     classOf[org.apache.accumulo.core.data.Key],
>>>>>     classOf[org.apache.accumulo.core.data.Value]
>>>>>   )
>>>>>
>>>>> I get an RDD returned to me that I can't do much with due to the
>>>>> following error:
>>>>>
>>>>> java.io.IOException: Input info has not been set. at
>>>>> org.apache.accumulo.core.client.mapreduce.lib.impl.InputConfigurator.validateOptions(InputConfigurator.java:630)
>>>>> at
>>>>> org.apache.accumulo.core.client.mapreduce.AbstractInputFormat.validateOptions(AbstractInputFormat.java:343)
>>>>> at
>>>>> org.apache.accumulo.core.client.mapreduce.AbstractInputFormat.getSplits(AbstractInputFormat.java:538)
>>>>> at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:98)
>>>>> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222) at
>>>>> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220) at
>>>>> scala.Option.getOrElse(Option.scala:120) at
>>>>> org.apache.spark.rdd.RDD.partitions(RDD.scala:220) at
>>>>> org.apache.spark.SparkContext.runJob(SparkContext.scala:1367) at
>>>>> org.apache.spark.rdd.RDD.count(RDD.scala:927)
>>>>>
>>>>> which totally makes sense in light of the fact that I haven't
>>>>> specified any parameters as to which table to connect with, what the auths
>>>>> are, etc.
>>>>>
>>>>> so my question is: what do I need to do from here to get those first
>>>>> ten rows of table data into my RDD?
>>>>>
>>>>>
>>>>>
>>>>>   DAVID HOLIDAY
>>>>>  Software Engineer
>>>>>  760 607 3300 | Office
>>>>>  312 758 8385 | Mobile
>>>>>  dav...@annaisystems.com <broo...@annaisystems.com>
>>>>>
>>>>>
>>>>> <GetFileAttachment.jpg>
>>>>> www.AnnaiSystems.com <http://www.annaisystems.com/>
>>>>>
>>>>>    On Mar 19, 2015, at 11:25 AM, David Holiday <
>>>>> dav...@annaisystems.com> wrote:
>>>>>
>>>>>  kk - I'll put something together and get back to you with more :-)
>>>>>
>>>>> DAVID HOLIDAY
>>>>>  Software Engineer
>>>>>  760 607 3300 | Office
>>>>>  312 758 8385 | Mobile
>>>>>  dav...@annaisystems.com <broo...@annaisystems.com>
>>>>>
>>>>>
>>>>> <GetFileAttachment.jpg>
>>>>> www.AnnaiSystems.com <http://www.annaisystems.com/>
>>>>>
>>>>>  On Mar 19, 2015, at 10:59 AM, Irfan Ahmad <ir...@cloudphysics.com>
>>>>> wrote:
>>>>>
>>>>> Once you setup spark-notebook, it'll handle the submits for
>>>>> interactive work. Non-interactive is not handled by it. For that
>>>>> spark-kernel could be used.
>>>>>
>>>>> Give it a shot ... it only takes 5 minutes to get it running in
>>>>> local-mode.
>>>>>
>>>>>
>>>>> *Irfan Ahmad*
>>>>> CTO | Co-Founder | *CloudPhysics* <http://www.cloudphysics.com/>
>>>>>  Best of VMworld Finalist
>>>>>  Best Cloud Management Award
>>>>> NetworkWorld 10 Startups to Watch
>>>>> EMA Most Notable Vendor
>>>>>
>>>>> On Thu, Mar 19, 2015 at 9:51 AM, David Holiday <
>>>>> dav...@annaisystems.com> wrote:
>>>>>
>>>>>> hi all - thx for the alacritous replies! so regarding how to get
>>>>>> things from notebook to spark and back, am I correct that spark-submit is
>>>>>> the way to go?
>>>>>>
>>>>>> DAVID HOLIDAY
>>>>>>  Software Engineer
>>>>>>  760 607 3300 | Office
>>>>>>  312 758 8385 | Mobile
>>>>>>  dav...@annaisystems.com <broo...@annaisystems.com>
>>>>>>
>>>>>>
>>>>>> <GetFileAttachment.jpg>
>>>>>> www.AnnaiSystems.com <http://www.annaisystems.com/>
>>>>>>
>>>>>>  On Mar 19, 2015, at 1:14 AM, Paolo Platter <
>>>>>> paolo.plat...@agilelab.it> wrote:
>>>>>>
>>>>>>  Yes, I would suggest spark-notebook too.
>>>>>> It's very simple to setup and it's growing pretty fast.
>>>>>>
>>>>>> Paolo
>>>>>>
>>>>>> Inviata dal mio Windows Phone
>>>>>>  ------------------------------
>>>>>> Da: Irfan Ahmad <ir...@cloudphysics.com>
>>>>>> Inviato: ‎19/‎03/‎2015 04:05
>>>>>> A: davidh <dav...@annaisystems.com>
>>>>>> Cc: user@spark.apache.org
>>>>>> Oggetto: Re: iPython Notebook + Spark + Accumulo -- best practice?
>>>>>>
>>>>>>  I forgot to mention that there is also Zeppelin and jove-notebook
>>>>>> but I haven't got any experience with those yet.
>>>>>>
>>>>>>
>>>>>> *Irfan Ahmad*
>>>>>> CTO | Co-Founder | *CloudPhysics* <http://www.cloudphysics.com/>
>>>>>>  Best of VMworld Finalist
>>>>>>  Best Cloud Management Award
>>>>>> NetworkWorld 10 Startups to Watch
>>>>>> EMA Most Notable Vendor
>>>>>>
>>>>>> On Wed, Mar 18, 2015 at 8:01 PM, Irfan Ahmad <ir...@cloudphysics.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi David,
>>>>>>>
>>>>>>> W00t indeed and great questions. On the notebook front, there are
>>>>>>> two options depending on what you are looking for. You can either go 
>>>>>>> with
>>>>>>> iPython 3 with Spark-kernel as a backend or you can use spark-notebook.
>>>>>>> Both have interesting tradeoffs.
>>>>>>>
>>>>>>> If you have looking for a single notebook platform for your data
>>>>>>> scientists that has R, Python as well as a Spark Shell, you'll likely 
>>>>>>> want
>>>>>>> to go with iPython + Spark-kernel. Downsides with the spark-kernel 
>>>>>>> project
>>>>>>> are that data visualization isn't quite there yet, early days for
>>>>>>> documentation and blogs/etc. Upside is that R and Python work 
>>>>>>> beautifully
>>>>>>> and that the ipython committers are super-helpful.
>>>>>>>
>>>>>>> If you are OK with a primarily spark/scala experience, then I
>>>>>>> suggest you with spark-notebook. Upsides are that the project is a 
>>>>>>> little
>>>>>>> further along, visualization support is better than spark-kernel (though
>>>>>>> not as good as iPython with Python) and the committer is awesome with 
>>>>>>> help.
>>>>>>> Downside is that you won't get R and Python.
>>>>>>>
>>>>>>> FWIW: I'm using both at the moment!
>>>>>>>
>>>>>>> Hope that helps.
>>>>>>>
>>>>>>>
>>>>>>> *Irfan Ahmad*
>>>>>>> CTO | Co-Founder | *CloudPhysics* <http://www.cloudphysics.com/>
>>>>>>>  Best of VMworld Finalist
>>>>>>>  Best Cloud Management Award
>>>>>>> NetworkWorld 10 Startups to Watch
>>>>>>> EMA Most Notable Vendor
>>>>>>>
>>>>>>> On Wed, Mar 18, 2015 at 5:45 PM, davidh <dav...@annaisystems.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> hi all, I've been DDGing, Stack Overflowing, Twittering, RTFMing,
>>>>>>>> and
>>>>>>>> scanning through this archive with only moderate success. in other
>>>>>>>> words --
>>>>>>>> my way of saying sorry if this is answered somewhere obvious and I
>>>>>>>> missed it
>>>>>>>> :-)
>>>>>>>>
>>>>>>>> i've been tasked with figuring out how to connect Notebook, Spark,
>>>>>>>> and
>>>>>>>> Accumulo together. The end user will do her work via notebook. thus
>>>>>>>> far,
>>>>>>>> I've successfully setup a Vagrant image containing Spark, Accumulo,
>>>>>>>> and
>>>>>>>> Hadoop. I was able to use some of the Accumulo example code to
>>>>>>>> create a
>>>>>>>> table populated with data, create a simple program in scala that,
>>>>>>>> when fired
>>>>>>>> off to Spark via spark-submit, connects to accumulo and prints the
>>>>>>>> first ten
>>>>>>>> rows of data in the table. so w00t on that - but now I'm left with
>>>>>>>> more
>>>>>>>> questions:
>>>>>>>>
>>>>>>>> 1) I'm still stuck on what's considered 'best practice' in terms of
>>>>>>>> hooking
>>>>>>>> all this together. Let's say Sally, a  user, wants to do some
>>>>>>>> analytic work
>>>>>>>> on her data. She pecks the appropriate commands into notebook and
>>>>>>>> fires them
>>>>>>>> off. how does this get wired together on the back end? Do I, from
>>>>>>>> notebook,
>>>>>>>> use spark-submit to send a job to spark and let spark worry about
>>>>>>>> hooking
>>>>>>>> into accumulo or is it preferable to create some kind of open
>>>>>>>> stream between
>>>>>>>> the two?
>>>>>>>>
>>>>>>>> 2) if I want to extend spark's api, do I need to first submit an
>>>>>>>> endless job
>>>>>>>> via spark-submit that does something like what this gentleman
>>>>>>>> describes
>>>>>>>> <http://blog.madhukaraphatak.com/extending-spark-api>  ? is there
>>>>>>>> an
>>>>>>>> alternative (other than refactoring spark's source) that doesn't
>>>>>>>> involve
>>>>>>>> extending the api via a job submission?
>>>>>>>>
>>>>>>>> ultimately what I'm looking for help locating docs, blogs, etc that
>>>>>>>> may shed
>>>>>>>> some light on this.
>>>>>>>>
>>>>>>>> t/y in advance!
>>>>>>>>
>>>>>>>> d
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> View this message in context:
>>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/iPython-Notebook-Spark-Accumulo-best-practice-tp22137.html
>>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>>> Nabble.com <http://nabble.com/>.
>>>>>>>>
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>

Re: iPython Notebook + Spark + Accumulo -- best practice?

Reply via email to