Re: iPython Notebook + Spark + Accumulo -- best practice?

Irfan Ahmad Wed, 25 Mar 2015 17:29:31 -0700

Hmmm.... this seems very accumulo-specific, doesn't it? Not sure how to
help with that.



*Irfan Ahmad*
CTO | Co-Founder | *CloudPhysics* <http://www.cloudphysics.com>
Best of VMworld Finalist
Best Cloud Management Award
NetworkWorld 10 Startups to Watch
EMA Most Notable Vendor

On Tue, Mar 24, 2015 at 4:09 PM, David Holiday <dav...@annaisystems.com>
wrote:

>  hi all,
>
>  got a vagrant image with spark notebook, spark, accumulo, and hadoop all
> running. from notebook I can manually create a scanner and pull test data
> from a table I created using one of the accumulo examples:
>
> val instanceNameS = "accumulo"val zooServersS = "localhost:2181"val instance: 
> Instance = new ZooKeeperInstance(instanceNameS, zooServersS)val connector: 
> Connector = instance.getConnector( "root", new PasswordToken("password"))val 
> auths = new Authorizations("exampleVis")val scanner = 
> connector.createScanner("batchtest1", auths)
>
> scanner.setRange(new Range("row_0000000000", "row_0000000010"))
> for(entry: Entry[Key, Value] <- scanner) {
>   println(entry.getKey + " is " + entry.getValue)}
>
> will give the first ten rows of table data. when I try to create the RDD
> thusly:
>
> val rdd2 =
>   sparkContext.newAPIHadoopRDD (
>     new Configuration(),
>     classOf[org.apache.accumulo.core.client.mapreduce.AccumuloInputFormat],
>     classOf[org.apache.accumulo.core.data.Key],
>     classOf[org.apache.accumulo.core.data.Value]
>   )
>
> I get an RDD returned to me that I can't do much with due to the following
> error:
>
> java.io.IOException: Input info has not been set. at
> org.apache.accumulo.core.client.mapreduce.lib.impl.InputConfigurator.validateOptions(InputConfigurator.java:630)
> at
> org.apache.accumulo.core.client.mapreduce.AbstractInputFormat.validateOptions(AbstractInputFormat.java:343)
> at
> org.apache.accumulo.core.client.mapreduce.AbstractInputFormat.getSplits(AbstractInputFormat.java:538)
> at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:98)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222) at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220) at
> scala.Option.getOrElse(Option.scala:120) at
> org.apache.spark.rdd.RDD.partitions(RDD.scala:220) at
> org.apache.spark.SparkContext.runJob(SparkContext.scala:1367) at
> org.apache.spark.rdd.RDD.count(RDD.scala:927)
>
> which totally makes sense in light of the fact that I haven't specified
> any parameters as to which table to connect with, what the auths are, etc.
>
> so my question is: what do I need to do from here to get those first ten
> rows of table data into my RDD?
>
>
>
>  DAVID HOLIDAY
>  Software Engineer
>  760 607 3300 | Office
>  312 758 8385 | Mobile
>  dav...@annaisystems.com <broo...@annaisystems.com>
>
>
>
> www.AnnaiSystems.com
>
>  On Mar 19, 2015, at 11:25 AM, David Holiday <dav...@annaisystems.com>
> wrote:
>
>  kk - I'll put something together and get back to you with more :-)
>
> DAVID HOLIDAY
>  Software Engineer
>  760 607 3300 | Office
>  312 758 8385 | Mobile
>  dav...@annaisystems.com <broo...@annaisystems.com>
>
>
> <GetFileAttachment.jpg>
> www.AnnaiSystems.com <http://www.annaisystems.com/>
>
>  On Mar 19, 2015, at 10:59 AM, Irfan Ahmad <ir...@cloudphysics.com> wrote:
>
>  Once you setup spark-notebook, it'll handle the submits for interactive
> work. Non-interactive is not handled by it. For that spark-kernel could be
> used.
>
>  Give it a shot ... it only takes 5 minutes to get it running in
> local-mode.
>
>
>  *Irfan Ahmad*
> CTO | Co-Founder | *CloudPhysics* <http://www.cloudphysics.com/>
> Best of VMworld Finalist
>  Best Cloud Management Award
>  NetworkWorld 10 Startups to Watch
> EMA Most Notable Vendor
>
> On Thu, Mar 19, 2015 at 9:51 AM, David Holiday <dav...@annaisystems.com>
> wrote:
>
>> hi all - thx for the alacritous replies! so regarding how to get things
>> from notebook to spark and back, am I correct that spark-submit is the way
>> to go?
>>
>> DAVID HOLIDAY
>>  Software Engineer
>>  760 607 3300 | Office
>>  312 758 8385 | Mobile
>>  dav...@annaisystems.com <broo...@annaisystems.com>
>>
>>
>> <GetFileAttachment.jpg>
>> www.AnnaiSystems.com <http://www.annaisystems.com/>
>>
>>  On Mar 19, 2015, at 1:14 AM, Paolo Platter <paolo.plat...@agilelab.it>
>> wrote:
>>
>>   Yes, I would suggest spark-notebook too.
>> It's very simple to setup and it's growing pretty fast.
>>
>> Paolo
>>
>> Inviata dal mio Windows Phone
>>  ------------------------------
>> Da: Irfan Ahmad <ir...@cloudphysics.com>
>> Inviato: ‎19/‎03/‎2015 04:05
>> A: davidh <dav...@annaisystems.com>
>> Cc: user@spark.apache.org
>> Oggetto: Re: iPython Notebook + Spark + Accumulo -- best practice?
>>
>>  I forgot to mention that there is also Zeppelin and jove-notebook but I
>> haven't got any experience with those yet.
>>
>>
>>  *Irfan Ahmad*
>> CTO | Co-Founder | *CloudPhysics* <http://www.cloudphysics.com/>
>> Best of VMworld Finalist
>>  Best Cloud Management Award
>>  NetworkWorld 10 Startups to Watch
>> EMA Most Notable Vendor
>>
>> On Wed, Mar 18, 2015 at 8:01 PM, Irfan Ahmad <ir...@cloudphysics.com>
>> wrote:
>>
>>> Hi David,
>>>
>>>  W00t indeed and great questions. On the notebook front, there are two
>>> options depending on what you are looking for. You can either go with
>>> iPython 3 with Spark-kernel as a backend or you can use spark-notebook.
>>> Both have interesting tradeoffs.
>>>
>>>  If you have looking for a single notebook platform for your data
>>> scientists that has R, Python as well as a Spark Shell, you'll likely want
>>> to go with iPython + Spark-kernel. Downsides with the spark-kernel project
>>> are that data visualization isn't quite there yet, early days for
>>> documentation and blogs/etc. Upside is that R and Python work beautifully
>>> and that the ipython committers are super-helpful.
>>>
>>>  If you are OK with a primarily spark/scala experience, then I suggest
>>> you with spark-notebook. Upsides are that the project is a little further
>>> along, visualization support is better than spark-kernel (though not as
>>> good as iPython with Python) and the committer is awesome with help.
>>> Downside is that you won't get R and Python.
>>>
>>>  FWIW: I'm using both at the moment!
>>>
>>>  Hope that helps.
>>>
>>>
>>>  *Irfan Ahmad*
>>> CTO | Co-Founder | *CloudPhysics* <http://www.cloudphysics.com/>
>>> Best of VMworld Finalist
>>>  Best Cloud Management Award
>>>  NetworkWorld 10 Startups to Watch
>>> EMA Most Notable Vendor
>>>
>>> On Wed, Mar 18, 2015 at 5:45 PM, davidh <dav...@annaisystems.com> wrote:
>>>
>>>> hi all, I've been DDGing, Stack Overflowing, Twittering, RTFMing, and
>>>> scanning through this archive with only moderate success. in other
>>>> words --
>>>> my way of saying sorry if this is answered somewhere obvious and I
>>>> missed it
>>>> :-)
>>>>
>>>> i've been tasked with figuring out how to connect Notebook, Spark, and
>>>> Accumulo together. The end user will do her work via notebook. thus far,
>>>> I've successfully setup a Vagrant image containing Spark, Accumulo, and
>>>> Hadoop. I was able to use some of the Accumulo example code to create a
>>>> table populated with data, create a simple program in scala that, when
>>>> fired
>>>> off to Spark via spark-submit, connects to accumulo and prints the
>>>> first ten
>>>> rows of data in the table. so w00t on that - but now I'm left with more
>>>> questions:
>>>>
>>>> 1) I'm still stuck on what's considered 'best practice' in terms of
>>>> hooking
>>>> all this together. Let's say Sally, a  user, wants to do some analytic
>>>> work
>>>> on her data. She pecks the appropriate commands into notebook and fires
>>>> them
>>>> off. how does this get wired together on the back end? Do I, from
>>>> notebook,
>>>> use spark-submit to send a job to spark and let spark worry about
>>>> hooking
>>>> into accumulo or is it preferable to create some kind of open stream
>>>> between
>>>> the two?
>>>>
>>>> 2) if I want to extend spark's api, do I need to first submit an
>>>> endless job
>>>> via spark-submit that does something like what this gentleman describes
>>>> <http://blog.madhukaraphatak.com/extending-spark-api>  ? is there an
>>>> alternative (other than refactoring spark's source) that doesn't involve
>>>> extending the api via a job submission?
>>>>
>>>> ultimately what I'm looking for help locating docs, blogs, etc that may
>>>> shed
>>>> some light on this.
>>>>
>>>> t/y in advance!
>>>>
>>>> d
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/iPython-Notebook-Spark-Accumulo-best-practice-tp22137.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com
>>>> <http://nabble.com/>.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>
>>>>
>>>
>>
>>
>
>
>

Re: iPython Notebook + Spark + Accumulo -- best practice?

Reply via email to