Re: iPython Notebook + Spark + Accumulo -- best practice?

andy petrella Thu, 26 Mar 2015 15:07:02 -0700

That purely awesome! Don't hesitate to contribute your notebook back to the
spark notebook repo, even rough, I'll help cleaning up if needed.


The vagrant is also appealing 😆

Congrats!

Le jeu 26 mars 2015 22:22, David Holiday <dav...@annaisystems.com> a écrit :

>  w00000000000000000t! that did it! t/y so much!
>
>  I'm going to put together a pastebin or something that has all the code
> put together so if anyone else runs into this issue they will have some
> working code to help them figure out what's going on.
>
> DAVID HOLIDAY
>  Software Engineer
>  760 607 3300 | Office
>  312 758 8385 | Mobile
>  dav...@annaisystems.com <broo...@annaisystems.com>
>
>
>
> www.AnnaiSystems.com
>
>  On Mar 26, 2015, at 12:24 PM, Corey Nolet <cjno...@gmail.com> wrote:
>
>  Spark uses a SerializableWritable [1] to java serialize writable
> objects. I've noticed (at least in Spark 1.2.1) that it breaks down with
> some objects when Kryo is used instead of regular java serialization.
> Though it is  wrapping the actual AccumuloInputFormat (another example of
> something you may want to do in the future), we have Accumulo working to
> load data from a table into Spark SQL [2]. The way Spark uses the
> InputFormat is very straightforward.
>
>  [1]
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SerializableWritable.scala
> [2]
> https://github.com/calrissian/accumulo-recipes/blob/master/thirdparty/spark/src/main/scala/org/calrissian/accumulorecipes/spark/sql/EventStoreCatalyst.scala#L76
>
> On Thu, Mar 26, 2015 at 3:06 PM, Nick Pentreath <nick.pentre...@gmail.com>
> wrote:
>
>> I'm guessing the Accumulo Key and Value classes are not serializable, so
>> you would need to do something like
>>
>>  val rdd = sc.newAPIHadoopRDD(...).map { case (key, value) =>
>> (extractScalaType(key), extractScalaType(value)) }
>>
>>  Where 'extractScalaType converts the key or Value to a standard Scala
>> type or case class or whatever - basically extracts the data from the Key
>> or Value in a form usable in Scala
>>
>> —
>> Sent from Mailbox <https://www.dropbox.com/mailbox>
>>
>>
>>   On Thu, Mar 26, 2015 at 8:59 PM, Russ Weeks <rwe...@newbrightidea.com>
>> wrote:
>>
>>>   Hi, David,
>>>
>>>  This is the code that I use to create a JavaPairRDD from an Accumulo
>>> table:
>>>
>>>  JavaSparkContext sc = new JavaSparkContext(conf);
>>> Job hadoopJob = Job.getInstance(conf,"TestSparkJob");
>>> job.setInputFormatClass(AccumuloInputFormat.class);
>>> AccumuloInputFormat.setZooKeeperInstance(job,
>>>     conf.get(ZOOKEEPER_INSTANCE_NAME,
>>>     conf.get(ZOOKEEPER_HOSTS)
>>> );
>>> AccumuloInputFormat.setConnectorInfo(job,
>>>     conf.get(ACCUMULO_AGILE_USERNAME),
>>>     new PasswordToken(conf.get(ACCUMULO_AGILE_PASSWORD))
>>> );
>>> AccumuloInputFormat.setInputTableName(job,
>>> conf.get(ACCUMULO_TABLE_NAME));
>>> AccumuloInputFormat.setScanAuthorizations(job, auths);
>>> JavaPairRDD<Key, Value> values =
>>> sc.newAPIHadoopRDD(hadoopJob.getConfiguration(), AccumuloInputFormat.class,
>>> Key.class, Value.class);
>>>
>>>  Key.class and Value.class are from org.apache.accumulo.core.data. I
>>> use a WholeRowIterator so that the Value is actually an encoded
>>> representation of an entire logical row; it's a useful convenience if you
>>> can be sure that your rows always fit in memory.
>>>
>>>  I haven't tested it since Spark 1.0.1 but I doubt anything important
>>> has changed.
>>>
>>>  Regards,
>>> -Russ
>>>
>>>
>>>  On Thu, Mar 26, 2015 at 11:41 AM, David Holiday <
>>> dav...@annaisystems.com> wrote:
>>>
>>>>   * progress!*
>>>>
>>>> i was able to figure out why the 'input INFO not set' error was
>>>> occurring. the eagle-eyed among you will no doubt see the following code is
>>>> missing a closing '('
>>>>
>>>> AbstractInputFormat.setConnectorInfo(jobConf, "root", new 
>>>> PasswordToken("password")
>>>>
>>>> as I'm doing this in spark-notebook, I'd been clicking the execute
>>>> button and moving on because I wasn't seeing an error. what I forgot was
>>>> that notebook is going to do what spark-shell will do when you leave off a
>>>> closing ')' -- *it will wait forever for you to add it*. so the error
>>>> was the result of the 'setConnectorInfo' method never getting executed.
>>>>
>>>> unfortunately, I'm still unable to shove the accumulo table data into
>>>> an RDD that's useable to me. when I execute
>>>>
>>>> rddX.count
>>>>
>>>> I get back
>>>>
>>>> res15: Long = 10000
>>>>
>>>> which is the correct response - there are 10,000 rows of data in the
>>>> table I pointed to. however, when I try to grab the first element of data
>>>> thusly:
>>>>
>>>> rddX.first
>>>>
>>>> I get the following error:
>>>>
>>>> org.apache.spark.SparkException: Job aborted due to stage failure: Task
>>>> 0.0 in stage 0.0 (TID 0) had a not serializable result:
>>>> org.apache.accumulo.core.data.Key
>>>>
>>>> any thoughts on where to go from here?
>>>>     DAVID HOLIDAY
>>>>  Software Engineer
>>>>  760 607 3300 | Office
>>>>  312 758 8385 | Mobile
>>>>  dav...@annaisystems.com <broo...@annaisystems.com>
>>>>
>>>>
>>>>  <GetFileAttachment.jpg>
>>>>
>>>> www.AnnaiSystems.com <http://www.annaisystems.com/>
>>>>
>>>>    On Mar 26, 2015, at 8:35 AM, David Holiday <dav...@annaisystems.com>
>>>> wrote:
>>>>
>>>>  hi Nick
>>>>
>>>>  Unfortunately the Accumulo docs are woefully inadequate, and in some
>>>> places, flat wrong. I'm not sure if this is a case where the docs are 'flat
>>>> wrong', or if there's some wrinke with spark-notebook in the mix that's
>>>> messing everything up. I've been working with some people on stack overflow
>>>> on this same issue (including one of the people from the spark-notebook
>>>> team):
>>>>
>>>>
>>>> http://stackoverflow.com/questions/29244530/how-do-i-create-a-spark-rdd-from-accumulo-1-6-in-spark-notebook?noredirect=1#comment46755938_29244530
>>>>
>>>>  if you click the link you can see the entire thread of code,
>>>> responses from notebook, etc. I'm going to try invoking the same techniques
>>>> both from within a stand-alone scala problem and from the shell itself to
>>>> see if I can get some traction. I'll report back when I have more data.
>>>>
>>>>  cheers (and thx!)
>>>>
>>>>
>>>>
>>>> DAVID HOLIDAY
>>>>  Software Engineer
>>>>  760 607 3300 | Office
>>>>  312 758 8385 | Mobile
>>>>  dav...@annaisystems.com <broo...@annaisystems.com>
>>>>
>>>>
>>>> <GetFileAttachment.jpg>
>>>> www.AnnaiSystems.com <http://www.annaisystems.com/>
>>>>
>>>>  On Mar 25, 2015, at 11:43 PM, Nick Pentreath <nick.pentre...@gmail.com>
>>>> wrote:
>>>>
>>>>  From a quick look at this link -
>>>> http://accumulo.apache.org/1.6/accumulo_user_manual.html#_mapreduce -
>>>> it seems you need to call some static methods on AccumuloInputFormat in
>>>> order to set the auth, table, and range settings. Try setting these config
>>>> options first and then call newAPIHadoopRDD?
>>>>
>>>> On Thu, Mar 26, 2015 at 2:34 AM, David Holiday <dav...@annaisystems.com
>>>> > wrote:
>>>>
>>>>> hi Irfan,
>>>>>
>>>>>  thanks for getting back to me - i'll try the accumulo list to be
>>>>> sure. what is the normal use case for spark though? I'm surprised that
>>>>> hooking it into something as common and popular as accumulo isn't more of
>>>>> an every-day task.
>>>>>
>>>>> DAVID HOLIDAY
>>>>>  Software Engineer
>>>>>  760 607 3300 | Office
>>>>>  312 758 8385 | Mobile
>>>>>  dav...@annaisystems.com <broo...@annaisystems.com>
>>>>>
>>>>>
>>>>> <GetFileAttachment.jpg>
>>>>> www.AnnaiSystems.com <http://www.annaisystems.com/>
>>>>>
>>>>>    On Mar 25, 2015, at 5:27 PM, Irfan Ahmad <ir...@cloudphysics.com>
>>>>> wrote:
>>>>>
>>>>>    Hmmm.... this seems very accumulo-specific, doesn't it? Not sure
>>>>> how to help with that.
>>>>>
>>>>>
>>>>>  *Irfan Ahmad*
>>>>> CTO | Co-Founder | *CloudPhysics* <http://www.cloudphysics.com/>
>>>>> Best of VMworld Finalist
>>>>>  Best Cloud Management Award
>>>>>  NetworkWorld 10 Startups to Watch
>>>>> EMA Most Notable Vendor
>>>>>
>>>>>   On Tue, Mar 24, 2015 at 4:09 PM, David Holiday <
>>>>> dav...@annaisystems.com> wrote:
>>>>>
>>>>>>  hi all,
>>>>>>
>>>>>>  got a vagrant image with spark notebook, spark, accumulo, and
>>>>>> hadoop all running. from notebook I can manually create a scanner and 
>>>>>> pull
>>>>>> test data from a table I created using one of the accumulo examples:
>>>>>>
>>>>>> val instanceNameS = "accumulo"val zooServersS = "localhost:2181"val 
>>>>>> instance: Instance = new ZooKeeperInstance(instanceNameS, 
>>>>>> zooServersS)val connector: Connector = instance.getConnector( "root", 
>>>>>> new PasswordToken("password"))val auths = new 
>>>>>> Authorizations("exampleVis")val scanner = 
>>>>>> connector.createScanner("batchtest1", auths)
>>>>>>
>>>>>> scanner.setRange(new Range("row_0000000000", "row_0000000010"))
>>>>>> for(entry: Entry[Key, Value] <- scanner) {
>>>>>>   println(entry.getKey + " is " + entry.getValue)}
>>>>>>
>>>>>> will give the first ten rows of table data. when I try to create the
>>>>>> RDD thusly:
>>>>>>
>>>>>> val rdd2 =
>>>>>>   sparkContext.newAPIHadoopRDD (
>>>>>>     new Configuration(),
>>>>>>     
>>>>>> classOf[org.apache.accumulo.core.client.mapreduce.AccumuloInputFormat],
>>>>>>     classOf[org.apache.accumulo.core.data.Key],
>>>>>>     classOf[org.apache.accumulo.core.data.Value]
>>>>>>   )
>>>>>>
>>>>>> I get an RDD returned to me that I can't do much with due to the
>>>>>> following error:
>>>>>>
>>>>>> java.io.IOException: Input info has not been set. at
>>>>>> org.apache.accumulo.core.client.mapreduce.lib.impl.InputConfigurator.validateOptions(InputConfigurator.java:630)
>>>>>> at
>>>>>> org.apache.accumulo.core.client.mapreduce.AbstractInputFormat.validateOptions(AbstractInputFormat.java:343)
>>>>>> at
>>>>>> org.apache.accumulo.core.client.mapreduce.AbstractInputFormat.getSplits(AbstractInputFormat.java:538)
>>>>>> at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:98)
>>>>>> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222) at
>>>>>> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220) at
>>>>>> scala.Option.getOrElse(Option.scala:120) at
>>>>>> org.apache.spark.rdd.RDD.partitions(RDD.scala:220) at
>>>>>> org.apache.spark.SparkContext.runJob(SparkContext.scala:1367) at
>>>>>> org.apache.spark.rdd.RDD.count(RDD.scala:927)
>>>>>>
>>>>>> which totally makes sense in light of the fact that I haven't
>>>>>> specified any parameters as to which table to connect with, what the 
>>>>>> auths
>>>>>> are, etc.
>>>>>>
>>>>>> so my question is: what do I need to do from here to get those first
>>>>>> ten rows of table data into my RDD?
>>>>>>
>>>>>>
>>>>>>
>>>>>>   DAVID HOLIDAY
>>>>>>  Software Engineer
>>>>>>  760 607 3300 | Office
>>>>>>  312 758 8385 | Mobile
>>>>>>  dav...@annaisystems.com <broo...@annaisystems.com>
>>>>>>
>>>>>>
>>>>>> <GetFileAttachment.jpg>
>>>>>> www.AnnaiSystems.com <http://www.annaisystems.com/>
>>>>>>
>>>>>>    On Mar 19, 2015, at 11:25 AM, David Holiday <
>>>>>> dav...@annaisystems.com> wrote:
>>>>>>
>>>>>>  kk - I'll put something together and get back to you with more :-)
>>>>>>
>>>>>> DAVID HOLIDAY
>>>>>>  Software Engineer
>>>>>>  760 607 3300 | Office
>>>>>>  312 758 8385 | Mobile
>>>>>>  dav...@annaisystems.com <broo...@annaisystems.com>
>>>>>>
>>>>>>
>>>>>> <GetFileAttachment.jpg>
>>>>>> www.AnnaiSystems.com <http://www.annaisystems.com/>
>>>>>>
>>>>>>  On Mar 19, 2015, at 10:59 AM, Irfan Ahmad <ir...@cloudphysics.com>
>>>>>> wrote:
>>>>>>
>>>>>>  Once you setup spark-notebook, it'll handle the submits for
>>>>>> interactive work. Non-interactive is not handled by it. For that
>>>>>> spark-kernel could be used.
>>>>>>
>>>>>>  Give it a shot ... it only takes 5 minutes to get it running in
>>>>>> local-mode.
>>>>>>
>>>>>>
>>>>>>  *Irfan Ahmad*
>>>>>> CTO | Co-Founder | *CloudPhysics* <http://www.cloudphysics.com/>
>>>>>> Best of VMworld Finalist
>>>>>>  Best Cloud Management Award
>>>>>>  NetworkWorld 10 Startups to Watch
>>>>>> EMA Most Notable Vendor
>>>>>>
>>>>>> On Thu, Mar 19, 2015 at 9:51 AM, David Holiday <
>>>>>> dav...@annaisystems.com> wrote:
>>>>>>
>>>>>>> hi all - thx for the alacritous replies! so regarding how to get
>>>>>>> things from notebook to spark and back, am I correct that spark-submit 
>>>>>>> is
>>>>>>> the way to go?
>>>>>>>
>>>>>>> DAVID HOLIDAY
>>>>>>>  Software Engineer
>>>>>>>  760 607 3300 | Office
>>>>>>>  312 758 8385 | Mobile
>>>>>>>  dav...@annaisystems.com <broo...@annaisystems.com>
>>>>>>>
>>>>>>>
>>>>>>> <GetFileAttachment.jpg>
>>>>>>> www.AnnaiSystems.com <http://www.annaisystems.com/>
>>>>>>>
>>>>>>>  On Mar 19, 2015, at 1:14 AM, Paolo Platter <
>>>>>>> paolo.plat...@agilelab.it> wrote:
>>>>>>>
>>>>>>>   Yes, I would suggest spark-notebook too.
>>>>>>> It's very simple to setup and it's growing pretty fast.
>>>>>>>
>>>>>>> Paolo
>>>>>>>
>>>>>>> Inviata dal mio Windows Phone
>>>>>>>  ------------------------------
>>>>>>> Da: Irfan Ahmad <ir...@cloudphysics.com>
>>>>>>> Inviato: ‎19/‎03/‎2015 04:05
>>>>>>> A: davidh <dav...@annaisystems.com>
>>>>>>> Cc: user@spark.apache.org
>>>>>>> Oggetto: Re: iPython Notebook + Spark + Accumulo -- best practice?
>>>>>>>
>>>>>>>  I forgot to mention that there is also Zeppelin and jove-notebook
>>>>>>> but I haven't got any experience with those yet.
>>>>>>>
>>>>>>>
>>>>>>>  *Irfan Ahmad*
>>>>>>> CTO | Co-Founder | *CloudPhysics* <http://www.cloudphysics.com/>
>>>>>>> Best of VMworld Finalist
>>>>>>>  Best Cloud Management Award
>>>>>>>  NetworkWorld 10 Startups to Watch
>>>>>>> EMA Most Notable Vendor
>>>>>>>
>>>>>>> On Wed, Mar 18, 2015 at 8:01 PM, Irfan Ahmad <ir...@cloudphysics.com
>>>>>>> > wrote:
>>>>>>>
>>>>>>>> Hi David,
>>>>>>>>
>>>>>>>>  W00t indeed and great questions. On the notebook front, there are
>>>>>>>> two options depending on what you are looking for. You can either go 
>>>>>>>> with
>>>>>>>> iPython 3 with Spark-kernel as a backend or you can use spark-notebook.
>>>>>>>> Both have interesting tradeoffs.
>>>>>>>>
>>>>>>>>  If you have looking for a single notebook platform for your data
>>>>>>>> scientists that has R, Python as well as a Spark Shell, you'll likely 
>>>>>>>> want
>>>>>>>> to go with iPython + Spark-kernel. Downsides with the spark-kernel 
>>>>>>>> project
>>>>>>>> are that data visualization isn't quite there yet, early days for
>>>>>>>> documentation and blogs/etc. Upside is that R and Python work 
>>>>>>>> beautifully
>>>>>>>> and that the ipython committers are super-helpful.
>>>>>>>>
>>>>>>>>  If you are OK with a primarily spark/scala experience, then I
>>>>>>>> suggest you with spark-notebook. Upsides are that the project is a 
>>>>>>>> little
>>>>>>>> further along, visualization support is better than spark-kernel 
>>>>>>>> (though
>>>>>>>> not as good as iPython with Python) and the committer is awesome with 
>>>>>>>> help.
>>>>>>>> Downside is that you won't get R and Python.
>>>>>>>>
>>>>>>>>  FWIW: I'm using both at the moment!
>>>>>>>>
>>>>>>>>  Hope that helps.
>>>>>>>>
>>>>>>>>
>>>>>>>>  *Irfan Ahmad*
>>>>>>>> CTO | Co-Founder | *CloudPhysics* <http://www.cloudphysics.com/>
>>>>>>>> Best of VMworld Finalist
>>>>>>>>  Best Cloud Management Award
>>>>>>>>  NetworkWorld 10 Startups to Watch
>>>>>>>> EMA Most Notable Vendor
>>>>>>>>
>>>>>>>> On Wed, Mar 18, 2015 at 5:45 PM, davidh <dav...@annaisystems.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> hi all, I've been DDGing, Stack Overflowing, Twittering, RTFMing,
>>>>>>>>> and
>>>>>>>>> scanning through this archive with only moderate success. in other
>>>>>>>>> words --
>>>>>>>>> my way of saying sorry if this is answered somewhere obvious and I
>>>>>>>>> missed it
>>>>>>>>> :-)
>>>>>>>>>
>>>>>>>>> i've been tasked with figuring out how to connect Notebook, Spark,
>>>>>>>>> and
>>>>>>>>> Accumulo together. The end user will do her work via notebook.
>>>>>>>>> thus far,
>>>>>>>>> I've successfully setup a Vagrant image containing Spark,
>>>>>>>>> Accumulo, and
>>>>>>>>> Hadoop. I was able to use some of the Accumulo example code to
>>>>>>>>> create a
>>>>>>>>> table populated with data, create a simple program in scala that,
>>>>>>>>> when fired
>>>>>>>>> off to Spark via spark-submit, connects to accumulo and prints the
>>>>>>>>> first ten
>>>>>>>>> rows of data in the table. so w00t on that - but now I'm left with
>>>>>>>>> more
>>>>>>>>> questions:
>>>>>>>>>
>>>>>>>>> 1) I'm still stuck on what's considered 'best practice' in terms
>>>>>>>>> of hooking
>>>>>>>>> all this together. Let's say Sally, a  user, wants to do some
>>>>>>>>> analytic work
>>>>>>>>> on her data. She pecks the appropriate commands into notebook and
>>>>>>>>> fires them
>>>>>>>>> off. how does this get wired together on the back end? Do I, from
>>>>>>>>> notebook,
>>>>>>>>> use spark-submit to send a job to spark and let spark worry about
>>>>>>>>> hooking
>>>>>>>>> into accumulo or is it preferable to create some kind of open
>>>>>>>>> stream between
>>>>>>>>> the two?
>>>>>>>>>
>>>>>>>>> 2) if I want to extend spark's api, do I need to first submit an
>>>>>>>>> endless job
>>>>>>>>> via spark-submit that does something like what this gentleman
>>>>>>>>> describes
>>>>>>>>> <http://blog.madhukaraphatak.com/extending-spark-api>  ? is there
>>>>>>>>> an
>>>>>>>>> alternative (other than refactoring spark's source) that doesn't
>>>>>>>>> involve
>>>>>>>>> extending the api via a job submission?
>>>>>>>>>
>>>>>>>>> ultimately what I'm looking for help locating docs, blogs, etc
>>>>>>>>> that may shed
>>>>>>>>> some light on this.
>>>>>>>>>
>>>>>>>>> t/y in advance!
>>>>>>>>>
>>>>>>>>> d
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> View this message in context:
>>>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/iPython-Notebook-Spark-Accumulo-best-practice-tp22137.html
>>>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>>>> Nabble.com <http://nabble.com/>.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>
>
>

Re: iPython Notebook + Spark + Accumulo -- best practice?

Reply via email to