Re: iPython Notebook + Spark + Accumulo -- best practice?

Nick Pentreath Thu, 26 Mar 2015 12:08:36 -0700

I'm guessing the Accumulo Key and Value classes are not serializable, so you 
would need to do something like





val rdd = sc.newAPIHadoopRDD(...).map { case (key, value) => 
(extractScalaType(key), extractScalaType(value)) }




Where 'extractScalaType converts the key or Value to a standard Scala type or 
case class or whatever - basically extracts the data from the Key or Value in a 
form usable in Scala 



—
Sent from Mailbox

On Thu, Mar 26, 2015 at 8:59 PM, Russ Weeks <rwe...@newbrightidea.com>
wrote:

> Hi, David,
> This is the code that I use to create a JavaPairRDD from an Accumulo table:
> JavaSparkContext sc = new JavaSparkContext(conf);
> Job hadoopJob = Job.getInstance(conf,"TestSparkJob");
> job.setInputFormatClass(AccumuloInputFormat.class);
> AccumuloInputFormat.setZooKeeperInstance(job,
>     conf.get(ZOOKEEPER_INSTANCE_NAME,
>     conf.get(ZOOKEEPER_HOSTS)
> );
> AccumuloInputFormat.setConnectorInfo(job,
>     conf.get(ACCUMULO_AGILE_USERNAME),
>     new PasswordToken(conf.get(ACCUMULO_AGILE_PASSWORD))
> );
> AccumuloInputFormat.setInputTableName(job, conf.get(ACCUMULO_TABLE_NAME));
> AccumuloInputFormat.setScanAuthorizations(job, auths);
> JavaPairRDD<Key, Value> values =
> sc.newAPIHadoopRDD(hadoopJob.getConfiguration(), AccumuloInputFormat.class,
> Key.class, Value.class);
> Key.class and Value.class are from org.apache.accumulo.core.data. I use a
> WholeRowIterator so that the Value is actually an encoded representation of
> an entire logical row; it's a useful convenience if you can be sure that
> your rows always fit in memory.
> I haven't tested it since Spark 1.0.1 but I doubt anything important has
> changed.
> Regards,
> -Russ
> On Thu, Mar 26, 2015 at 11:41 AM, David Holiday <dav...@annaisystems.com>
> wrote:
>>  * progress!*
>>
>> i was able to figure out why the 'input INFO not set' error was occurring.
>> the eagle-eyed among you will no doubt see the following code is missing a
>> closing '('
>>
>> AbstractInputFormat.setConnectorInfo(jobConf, "root", new 
>> PasswordToken("password")
>>
>> as I'm doing this in spark-notebook, I'd been clicking the execute button
>> and moving on because I wasn't seeing an error. what I forgot was that
>> notebook is going to do what spark-shell will do when you leave off a
>> closing ')' -- *it will wait forever for you to add it*. so the error was
>> the result of the 'setConnectorInfo' method never getting executed.
>>
>> unfortunately, I'm still unable to shove the accumulo table data into an
>> RDD that's useable to me. when I execute
>>
>> rddX.count
>>
>> I get back
>>
>> res15: Long = 10000
>>
>> which is the correct response - there are 10,000 rows of data in the table
>> I pointed to. however, when I try to grab the first element of data thusly:
>>
>> rddX.first
>>
>> I get the following error:
>>
>> org.apache.spark.SparkException: Job aborted due to stage failure: Task
>> 0.0 in stage 0.0 (TID 0) had a not serializable result:
>> org.apache.accumulo.core.data.Key
>>
>> any thoughts on where to go from here?
>>  DAVID HOLIDAY
>>  Software Engineer
>>  760 607 3300 | Office
>>  312 758 8385 | Mobile
>>  dav...@annaisystems.com <broo...@annaisystems.com>
>>
>>
>>
>> www.AnnaiSystems.com
>>
>>  On Mar 26, 2015, at 8:35 AM, David Holiday <dav...@annaisystems.com>
>> wrote:
>>
>>  hi Nick
>>
>>  Unfortunately the Accumulo docs are woefully inadequate, and in some
>> places, flat wrong. I'm not sure if this is a case where the docs are 'flat
>> wrong', or if there's some wrinke with spark-notebook in the mix that's
>> messing everything up. I've been working with some people on stack overflow
>> on this same issue (including one of the people from the spark-notebook
>> team):
>>
>>
>> http://stackoverflow.com/questions/29244530/how-do-i-create-a-spark-rdd-from-accumulo-1-6-in-spark-notebook?noredirect=1#comment46755938_29244530
>>
>>  if you click the link you can see the entire thread of code, responses
>> from notebook, etc. I'm going to try invoking the same techniques both from
>> within a stand-alone scala problem and from the shell itself to see if I
>> can get some traction. I'll report back when I have more data.
>>
>>  cheers (and thx!)
>>
>>
>>
>> DAVID HOLIDAY
>>  Software Engineer
>>  760 607 3300 | Office
>>  312 758 8385 | Mobile
>>  dav...@annaisystems.com <broo...@annaisystems.com>
>>
>>
>> <GetFileAttachment.jpg>
>> www.AnnaiSystems.com <http://www.annaisystems.com/>
>>
>>  On Mar 25, 2015, at 11:43 PM, Nick Pentreath <nick.pentre...@gmail.com>
>> wrote:
>>
>>  From a quick look at this link -
>> http://accumulo.apache.org/1.6/accumulo_user_manual.html#_mapreduce - it
>> seems you need to call some static methods on AccumuloInputFormat in order
>> to set the auth, table, and range settings. Try setting these config
>> options first and then call newAPIHadoopRDD?
>>
>> On Thu, Mar 26, 2015 at 2:34 AM, David Holiday <dav...@annaisystems.com>
>> wrote:
>>
>>> hi Irfan,
>>>
>>>  thanks for getting back to me - i'll try the accumulo list to be sure.
>>> what is the normal use case for spark though? I'm surprised that hooking it
>>> into something as common and popular as accumulo isn't more of an every-day
>>> task.
>>>
>>> DAVID HOLIDAY
>>>  Software Engineer
>>>  760 607 3300 | Office
>>>  312 758 8385 | Mobile
>>>  dav...@annaisystems.com <broo...@annaisystems.com>
>>>
>>>
>>> <GetFileAttachment.jpg>
>>> www.AnnaiSystems.com <http://www.annaisystems.com/>
>>>
>>>    On Mar 25, 2015, at 5:27 PM, Irfan Ahmad <ir...@cloudphysics.com>
>>> wrote:
>>>
>>>    Hmmm.... this seems very accumulo-specific, doesn't it? Not sure how
>>> to help with that.
>>>
>>>
>>>  *Irfan Ahmad*
>>> CTO | Co-Founder | *CloudPhysics* <http://www.cloudphysics.com/>
>>> Best of VMworld Finalist
>>>  Best Cloud Management Award
>>>  NetworkWorld 10 Startups to Watch
>>> EMA Most Notable Vendor
>>>
>>>   On Tue, Mar 24, 2015 at 4:09 PM, David Holiday <dav...@annaisystems.com
>>> > wrote:
>>>
>>>>  hi all,
>>>>
>>>>  got a vagrant image with spark notebook, spark, accumulo, and hadoop
>>>> all running. from notebook I can manually create a scanner and pull test
>>>> data from a table I created using one of the accumulo examples:
>>>>
>>>> val instanceNameS = "accumulo"val zooServersS = "localhost:2181"val 
>>>> instance: Instance = new ZooKeeperInstance(instanceNameS, zooServersS)val 
>>>> connector: Connector = instance.getConnector( "root", new 
>>>> PasswordToken("password"))val auths = new Authorizations("exampleVis")val 
>>>> scanner = connector.createScanner("batchtest1", auths)
>>>>
>>>> scanner.setRange(new Range("row_0000000000", "row_0000000010"))
>>>> for(entry: Entry[Key, Value] <- scanner) {
>>>>   println(entry.getKey + " is " + entry.getValue)}
>>>>
>>>> will give the first ten rows of table data. when I try to create the RDD
>>>> thusly:
>>>>
>>>> val rdd2 =
>>>>   sparkContext.newAPIHadoopRDD (
>>>>     new Configuration(),
>>>>     classOf[org.apache.accumulo.core.client.mapreduce.AccumuloInputFormat],
>>>>     classOf[org.apache.accumulo.core.data.Key],
>>>>     classOf[org.apache.accumulo.core.data.Value]
>>>>   )
>>>>
>>>> I get an RDD returned to me that I can't do much with due to the
>>>> following error:
>>>>
>>>> java.io.IOException: Input info has not been set. at
>>>> org.apache.accumulo.core.client.mapreduce.lib.impl.InputConfigurator.validateOptions(InputConfigurator.java:630)
>>>> at
>>>> org.apache.accumulo.core.client.mapreduce.AbstractInputFormat.validateOptions(AbstractInputFormat.java:343)
>>>> at
>>>> org.apache.accumulo.core.client.mapreduce.AbstractInputFormat.getSplits(AbstractInputFormat.java:538)
>>>> at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:98)
>>>> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222) at
>>>> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220) at
>>>> scala.Option.getOrElse(Option.scala:120) at
>>>> org.apache.spark.rdd.RDD.partitions(RDD.scala:220) at
>>>> org.apache.spark.SparkContext.runJob(SparkContext.scala:1367) at
>>>> org.apache.spark.rdd.RDD.count(RDD.scala:927)
>>>>
>>>> which totally makes sense in light of the fact that I haven't specified
>>>> any parameters as to which table to connect with, what the auths are, etc.
>>>>
>>>> so my question is: what do I need to do from here to get those first ten
>>>> rows of table data into my RDD?
>>>>
>>>>
>>>>
>>>>   DAVID HOLIDAY
>>>>  Software Engineer
>>>>  760 607 3300 | Office
>>>>  312 758 8385 | Mobile
>>>>  dav...@annaisystems.com <broo...@annaisystems.com>
>>>>
>>>>
>>>> <GetFileAttachment.jpg>
>>>> www.AnnaiSystems.com <http://www.annaisystems.com/>
>>>>
>>>>    On Mar 19, 2015, at 11:25 AM, David Holiday <dav...@annaisystems.com>
>>>> wrote:
>>>>
>>>>  kk - I'll put something together and get back to you with more :-)
>>>>
>>>> DAVID HOLIDAY
>>>>  Software Engineer
>>>>  760 607 3300 | Office
>>>>  312 758 8385 | Mobile
>>>>  dav...@annaisystems.com <broo...@annaisystems.com>
>>>>
>>>>
>>>> <GetFileAttachment.jpg>
>>>> www.AnnaiSystems.com <http://www.annaisystems.com/>
>>>>
>>>>  On Mar 19, 2015, at 10:59 AM, Irfan Ahmad <ir...@cloudphysics.com>
>>>> wrote:
>>>>
>>>>  Once you setup spark-notebook, it'll handle the submits for
>>>> interactive work. Non-interactive is not handled by it. For that
>>>> spark-kernel could be used.
>>>>
>>>>  Give it a shot ... it only takes 5 minutes to get it running in
>>>> local-mode.
>>>>
>>>>
>>>>  *Irfan Ahmad*
>>>> CTO | Co-Founder | *CloudPhysics* <http://www.cloudphysics.com/>
>>>> Best of VMworld Finalist
>>>>  Best Cloud Management Award
>>>>  NetworkWorld 10 Startups to Watch
>>>> EMA Most Notable Vendor
>>>>
>>>> On Thu, Mar 19, 2015 at 9:51 AM, David Holiday <dav...@annaisystems.com>
>>>> wrote:
>>>>
>>>>> hi all - thx for the alacritous replies! so regarding how to get things
>>>>> from notebook to spark and back, am I correct that spark-submit is the way
>>>>> to go?
>>>>>
>>>>> DAVID HOLIDAY
>>>>>  Software Engineer
>>>>>  760 607 3300 | Office
>>>>>  312 758 8385 | Mobile
>>>>>  dav...@annaisystems.com <broo...@annaisystems.com>
>>>>>
>>>>>
>>>>> <GetFileAttachment.jpg>
>>>>> www.AnnaiSystems.com <http://www.annaisystems.com/>
>>>>>
>>>>>  On Mar 19, 2015, at 1:14 AM, Paolo Platter <paolo.plat...@agilelab.it>
>>>>> wrote:
>>>>>
>>>>>   Yes, I would suggest spark-notebook too.
>>>>> It's very simple to setup and it's growing pretty fast.
>>>>>
>>>>> Paolo
>>>>>
>>>>> Inviata dal mio Windows Phone
>>>>>  ------------------------------
>>>>> Da: Irfan Ahmad <ir...@cloudphysics.com>
>>>>> Inviato: ‎19/‎03/‎2015 04:05
>>>>> A: davidh <dav...@annaisystems.com>
>>>>> Cc: user@spark.apache.org
>>>>> Oggetto: Re: iPython Notebook + Spark + Accumulo -- best practice?
>>>>>
>>>>>  I forgot to mention that there is also Zeppelin and jove-notebook but
>>>>> I haven't got any experience with those yet.
>>>>>
>>>>>
>>>>>  *Irfan Ahmad*
>>>>> CTO | Co-Founder | *CloudPhysics* <http://www.cloudphysics.com/>
>>>>> Best of VMworld Finalist
>>>>>  Best Cloud Management Award
>>>>>  NetworkWorld 10 Startups to Watch
>>>>> EMA Most Notable Vendor
>>>>>
>>>>> On Wed, Mar 18, 2015 at 8:01 PM, Irfan Ahmad <ir...@cloudphysics.com>
>>>>> wrote:
>>>>>
>>>>>> Hi David,
>>>>>>
>>>>>>  W00t indeed and great questions. On the notebook front, there are
>>>>>> two options depending on what you are looking for. You can either go with
>>>>>> iPython 3 with Spark-kernel as a backend or you can use spark-notebook.
>>>>>> Both have interesting tradeoffs.
>>>>>>
>>>>>>  If you have looking for a single notebook platform for your data
>>>>>> scientists that has R, Python as well as a Spark Shell, you'll likely 
>>>>>> want
>>>>>> to go with iPython + Spark-kernel. Downsides with the spark-kernel 
>>>>>> project
>>>>>> are that data visualization isn't quite there yet, early days for
>>>>>> documentation and blogs/etc. Upside is that R and Python work beautifully
>>>>>> and that the ipython committers are super-helpful.
>>>>>>
>>>>>>  If you are OK with a primarily spark/scala experience, then I
>>>>>> suggest you with spark-notebook. Upsides are that the project is a little
>>>>>> further along, visualization support is better than spark-kernel (though
>>>>>> not as good as iPython with Python) and the committer is awesome with 
>>>>>> help.
>>>>>> Downside is that you won't get R and Python.
>>>>>>
>>>>>>  FWIW: I'm using both at the moment!
>>>>>>
>>>>>>  Hope that helps.
>>>>>>
>>>>>>
>>>>>>  *Irfan Ahmad*
>>>>>> CTO | Co-Founder | *CloudPhysics* <http://www.cloudphysics.com/>
>>>>>> Best of VMworld Finalist
>>>>>>  Best Cloud Management Award
>>>>>>  NetworkWorld 10 Startups to Watch
>>>>>> EMA Most Notable Vendor
>>>>>>
>>>>>> On Wed, Mar 18, 2015 at 5:45 PM, davidh <dav...@annaisystems.com>
>>>>>> wrote:
>>>>>>
>>>>>>> hi all, I've been DDGing, Stack Overflowing, Twittering, RTFMing, and
>>>>>>> scanning through this archive with only moderate success. in other
>>>>>>> words --
>>>>>>> my way of saying sorry if this is answered somewhere obvious and I
>>>>>>> missed it
>>>>>>> :-)
>>>>>>>
>>>>>>> i've been tasked with figuring out how to connect Notebook, Spark, and
>>>>>>> Accumulo together. The end user will do her work via notebook. thus
>>>>>>> far,
>>>>>>> I've successfully setup a Vagrant image containing Spark, Accumulo,
>>>>>>> and
>>>>>>> Hadoop. I was able to use some of the Accumulo example code to create
>>>>>>> a
>>>>>>> table populated with data, create a simple program in scala that,
>>>>>>> when fired
>>>>>>> off to Spark via spark-submit, connects to accumulo and prints the
>>>>>>> first ten
>>>>>>> rows of data in the table. so w00t on that - but now I'm left with
>>>>>>> more
>>>>>>> questions:
>>>>>>>
>>>>>>> 1) I'm still stuck on what's considered 'best practice' in terms of
>>>>>>> hooking
>>>>>>> all this together. Let's say Sally, a  user, wants to do some
>>>>>>> analytic work
>>>>>>> on her data. She pecks the appropriate commands into notebook and
>>>>>>> fires them
>>>>>>> off. how does this get wired together on the back end? Do I, from
>>>>>>> notebook,
>>>>>>> use spark-submit to send a job to spark and let spark worry about
>>>>>>> hooking
>>>>>>> into accumulo or is it preferable to create some kind of open stream
>>>>>>> between
>>>>>>> the two?
>>>>>>>
>>>>>>> 2) if I want to extend spark's api, do I need to first submit an
>>>>>>> endless job
>>>>>>> via spark-submit that does something like what this gentleman
>>>>>>> describes
>>>>>>> <http://blog.madhukaraphatak.com/extending-spark-api>  ? is there an
>>>>>>> alternative (other than refactoring spark's source) that doesn't
>>>>>>> involve
>>>>>>> extending the api via a job submission?
>>>>>>>
>>>>>>> ultimately what I'm looking for help locating docs, blogs, etc that
>>>>>>> may shed
>>>>>>> some light on this.
>>>>>>>
>>>>>>> t/y in advance!
>>>>>>>
>>>>>>> d
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> View this message in context:
>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/iPython-Notebook-Spark-Accumulo-best-practice-tp22137.html
>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>> Nabble.com <http://nabble.com/>.
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>>

Re: iPython Notebook + Spark + Accumulo -- best practice?

Reply via email to