Re: z.load() Must be used before SparkInterpreter (%spark) initialized?

2017-07-26 Thread Rick Moritz
Please allow me to opinionate on that subject:

To me, there are two options: Indeed you either run the Spark interpreter
in isolated mode, or you have dedicated Spark Interpreter-Groups per
organziational unit, so you can manage dependencies independently.
Obviously, there's no way around restarting the interpreter, when you need
to tell the classloader of additional Jars in the classloader, never mind
distributing those jars across the cluster without calling spark-submit.
Since an interpreter represents an actual running JVM, you need to treat it
as such. I assume that is also the reason, why z.load has been superseded
by dependency configuration in the Interpreter settings.

A good way to manage dependencies is to collate all the dependencies per
unit in a fat-jar, and manage that via an external build. That way you can
have testable dependencies, and a curated experience, where everything just
works -- as long as someone puts that effort in. Still, with a
collaborative tool, that's better than everyone putting in their favorite
lib, and then causing each interpreter start to pull in half the Internet
in transitive dependencies, with potential conflicts to boot. Zeppelin will
be slowish, if every interpreter start starts off with uploading a GB of
dependencies into the cluster.

In an ad hoc, almost-single-user environment, you can work well with
Zeppelin's built-in dependency management, but I don't really see it scale
to the enterprise level -- and I don't think it should either. There's no
point in investing ressources into something, that external tools can
already easily provide.

I wouldn't deploy Zeppelin as enterprise infrastructure either - deploy one
Zeppelin per project. and manage segregation there by separate
interpreters. This also helps with finer ressource management.

I hope this helps your understanding, as well as giving you some pointers
on how to manage Zeppelin in such a way, that there are less conflicts
between users.

On Wed, Jul 26, 2017 at 2:30 PM, Davidson, Jonathan <
jonathan.david...@optum.com> wrote:

> We’ve also found it undesirable being unable to load extra jars without
> restarting the interpreter. Is the best way to mitigate this by running in
> isolated mode (by note or user), so that other users are less affected? Is
> there any development in progress to load without restart?
>
>
>
> Thanks!
>
>
>
> *From:* Jeff Zhang [mailto:zjf...@gmail.com]
> *Sent:* Tuesday, July 25, 2017 8:31 PM
> *To:* Users 
> *Subject:* Re: z.load() Must be used before SparkInterpreter (%spark)
> initialized?
>
>
>
>
>
> It is not restarting zeppelin, you just need to restart spark interpreter.
>
>
>
>
>
> Richard Xin 于2017年7月26日周三 上午12:53写道:
>
> I used %dep
>
> z.load("path/to/jar")
>
> I got following error:
>
> Must be used before SparkInterpreter (%spark) initialized
>
> Hint: put this paragraph before any Spark code and restart
> Zeppelin/Interpreter
>
>
>
> restart zeppelin did make it work, it seems to be an expected behavior,
> but I don't understand thee reason behind it. If every time I have to
> restart zeppelin before I could dynamically add an external jar, then this
> feature is useless to most people.
>
>
>
> Richard Xin
>
>
> This e-mail, including attachments, may include confidential and/or
> proprietary information, and may be used only by the person or entity
> to which it is addressed. If the reader of this e-mail is not the intended
> recipient or his or her authorized agent, the reader is hereby notified
> that any dissemination, distribution or copying of this e-mail is
> prohibited. If you have received this e-mail in error, please notify the
> sender by replying to this message and delete this e-mail immediately.
>
>
> This e-mail, including attachments, may include confidential and/or
> proprietary information, and may be used only by the person or entity
> to which it is addressed. If the reader of this e-mail is not the intended
> recipient or his or her authorized agent, the reader is hereby notified
> that any dissemination, distribution or copying of this e-mail is
> prohibited. If you have received this e-mail in error, please notify the
> sender by replying to this message and delete this e-mail immediately.
>


Re: Can't download moderately large data or number of rows to csv

2017-05-03 Thread Rick Moritz
I think whether this is an issue or not, depends a lot on how you use
Zeppelin, and what tools you need to integrate with. Sadly Excel is still
around as a data processing tool, and many people who I introduce to
Zeppelin are quite proficient with it, hence the desire to export to csv in
a trivial manner --  or merely the presence of the "download CSV"-button
incites them to expect it to work for reasonably sized data (i.e. up to
around 10^6 rows).

I do prefer Ruslan's idea, but I think Zeppelin should include something
similar out of the box. The key requirement should be that the data doesn't
have to travel through the notebook interface, but rather is made available
in a temporary folder and then served via a download link. The downside to
this approach is, that ideally you'd want this kind of operation to be
interpreter agnostic. In that case every interpreter would need to offer an
interface which allows to collect the data to a local-to-zeppelin temporary
folder.

Nonetheless, to turn Zeppelin into the serve-it-all solution that it could
be, I do believe that "fixing" the csv-export is important. I'd definitely
vote for a Jira advancing this issue.

On Tue, May 2, 2017 at 9:33 PM, Kevin Niemann 
wrote:

> We came across this issue as well, Zeppelin csv export is using the data
> URI scheme which is base64 encoding all the rows into a single string,
> Chrome seems to crash with over a few thousand rows, but Firefox has been
> able to handle over 100k for me. However, the Zeppelin notebook itself
> becomes slow at that point. I would also like better support for the
> ability to export a large set of rows, perhaps another tool is more
> preferred?
>
> On Tue, May 2, 2017 at 10:00 AM, Ruslan Dautkhanov 
> wrote:
>
>> Good idea to introduce in Zeppelin a way to download full datasets
>> without
>> actually visualizing them.
>>
>> Not sure if this helps, we taught our users to use %sh hadoop fs
>> -getmerge /hadoop/path/dir/ /some/nfs/mount/
>> for large files (they sometimes have to download datasets with millions
>> of records).
>> They run Zeppelin on edge nodes that have NFS mounts to a drop zone.
>>
>> ps. Hue has a limit too, by default 100k rows
>> https://github.com/cloudera/hue/blob/release-3.12.0/desktop/
>> conf.dist/hue.ini#L905
>> Not sure how much it scales up.
>>
>>
>>
>> --
>> Ruslan Dautkhanov
>>
>> On Tue, May 2, 2017 at 10:41 AM, Paul Brenner 
>> wrote:
>>
>>> There are limits to how much data the download to csv button will
>>> download (1.5MB? 3500 rows?) which limit zeppelin’s usefulness for our BI
>>> teams. This limit comes up far before we run into issues with showing too
>>> many rows of data in zeppelin.
>>>
>>> Unfortunately (fortunately?) Hue is the other tool the BI team has been
>>> using and there they have no problem downloading much larger datasets to
>>> csv. This is definitely not a requirement I’ve ever run into in the way I
>>> use zeppelin since I would just use spark to write the data out. However,
>>> the BI team is not allowed to run spark jobs (they use hive via jdbc) so
>>> that download to csv button is pretty important to them.
>>>
>>> Would it be possible to significantly increase the limit? Even better
>>> would it be possible to download more data than is shown? I assume this is
>>> the type of thing I would need to open a ticket for, but I wanted to ask
>>> here first.
>>>
>>>  
>>>  Paul Brenner 
>>>  
>>>  
>>> 
>>> 
>>> DATA SCIENTIST
>>> *(217) 390-3033 <(217)%20390-3033> *
>>>
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 

Re: multiple instances of the same interpreter type

2017-04-07 Thread Rick Moritz
Which version of Zeppelin are you running?
There was a bug, where killing Zeppelin wouldn't actually kill the
interpreters, because the kill signal wasn't passed through, but that was
fixed in 0.7.1: https://issues.apache.org/jira/browse/ZEPPELIN-2258

On Fri, Apr 7, 2017 at 5:31 PM, Ruslan Dautkhanov 
wrote:

> We have each user running their own Zeppelin instances,
> so everyone has Spark interpreter group defined as
>
>   "option": {
> ..
> "perNote": "shared",
> "perUser": "shared",
> ..
>   }
>
> which translates to "interpreter will be instantiated
> Globally  in shared
> process."
>
>
>
> --
> Ruslan Dautkhanov
>
> On Thu, Apr 6, 2017 at 6:34 PM, Jeff Zhang  wrote:
>
>>
>> What mode do you use ?
>>
>>
>>
>> Ruslan Dautkhanov 于2017年4月7日周五 上午12:49写道:
>>
>>> A user managed somehow to launch multiple instances of spark interpreter
>>> under the same Zeppelin server.
>>>
>>> See a snippet of `pstree` output:
>>>
>>>   |-java,6360,wabramov -Dfile.encoding=UTF-8 -Xms1024m -Xmx2048m
>>> -XX:MaxPermSize=512m-Dlog4j.configuration=file:///home/wabramov/
>>>   |   |-interpreter.sh,4510 /opt/zeppelin/zeppelin-active/bin/interpreter.sh
>>> -d /opt/zeppelin/zeppelin-active/interpreter/spark -p 45986 -l/opt/zeppe
>>>   |   |   `-interpreter.sh,4523 
>>> /opt/zeppelin/zeppelin-active/bin/interpreter.sh
>>> -d /opt/zeppelin/zeppelin-active/interpreter/spark -p 45986 -l/opt/zeppe
>>>   |   |   `-java,4524 -cp/etc/hive/conf/:/opt/zeppel
>>> in/zeppelin-active/interpreter/spark/*:/opt/zeppelin/
>>> zeppelin-active/zeppeli
>>>   |   |-interpreter.sh,5097 /opt/zeppelin/zeppelin-active/bin/interpreter.sh
>>> -d /opt/zeppelin/zeppelin-active/interpreter/spark -p 39752 -l/opt/zeppe
>>>   |   |   `-interpreter.sh,5110 
>>> /opt/zeppelin/zeppelin-active/bin/interpreter.sh
>>> -d /opt/zeppelin/zeppelin-active/interpreter/spark -p 39752 -l/opt/zeppe
>>>   |   |   `-java,5111 -cp/etc/hive/conf/:/opt/zeppel
>>> in/zeppelin-active/interpreter/spark/*:/opt/zeppelin/
>>> zeppelin-active/zeppeli
>>>
>>>
>>> I see another user has three (3) instances running of %sh interpreter.
>>>
>>> Is this a known issue?
>>>
>>>
>>> --
>>> Ruslan Dautkhanov
>>>
>>
>


Re: Other paragraphs do not wait for %sh paragraphs to finish.

2017-04-06 Thread Rick Moritz
This actually calls for a dependency definition of notes within a notebook,
so the scheduler can decide which tasks to run simultaneously.
I suggest a simple counter of dependency levels, which by default increases
with every new note and can be decremented to allow notes to run
simultaneously. Run-all then submits each level into the target
interpreters for this level, awaits termination of all results, and then
starts the next level's note.


On Thu, Apr 6, 2017 at 12:57 AM, moon soo Lee  wrote:

> Hi,
>
> That's expected behavior at the moment. The reason is
>
> Each interpreter has it's own scheduler (either FIFO, Parallel), and
> run-all just submit all paragraphs into target interpreter's scheduler.
>
> I think we can add feature such as run-all-sequentially.
> Do you mind file a JIRA issue?
>
> Thanks,
> moon
>
> On Thu, Apr 6, 2017 at 5:35 AM  wrote:
>
>> I often have notebooks that have a %sh as the 1st paragraph. This scps
>> some file from another server, and then a number of spark or sparksql
>> paragraphs are after that.
>>
>> If I click on the run-all paragraphs at the top of the notebook the 1st
>> %sh paragraph kicks off as expected, but the 2nd %spark notebook starts too
>> at the same time. The others go into pending state and then start once the
>> spark one has completed.
>>
>> Is this a bug? Or am I doing something wrong?
>>
>> Thanks
>>
>>


Re: why not provide 'test' function in the interpreter

2017-02-06 Thread Rick Moritz
Having gone through configuring Spark 1.6 for Z 0.6.2 without bein able to
use the installer, and using "provided" Spark and Hadoop, I do understand
the appaeal of a test functionality for an interpreter.

The challenge of scoping the test functionality is evident, but I think not
insurmountable.

In particular with the Spark interpreter, I mostly wanted to know whetger I
would be able to instantiate a SparkContext at all. That test is probably
applicable to all interpreter types. Be it a connection test for jdbc, or
other basic functionality test. This should be part of the interpreter API
and - perhaps most importantly - be triggered right when you restart an
interpreter. The current procesure of switching between a failing notebook
and the interpreter settings is quite cludgy.

In the end, each interpreter should implement its own testing logic, and
focus on core functionality. As an example for Spark: Provide a
SparkContext, and a HiveContext, if requested/enabled. This checks whether
we can fit the interpreter into Yarn, and the basic classpath/dependency
requirements.

For a start that kind of functionality would be sufficient, further
requirements can then be added at a later time.

This would also make the per-notebook interpreter settings more concise,
since non-functioning interpreters could be hidden. A next step would be to
do per-user-testing, for example if a ressource manager with quotas is
used, large spark interpreters should not appear fornusers who can't
allocate that many ressources.

Having diagnostic features inside Zeppelin should be part of the next push
to make it more end-user friendly, and could even be offered as a service
to a monitoring tool: for example the Ambari-page could show currently
failing interpreters as a warning/error.

Best regards,

Rick


On 7 Feb 2017 06:47, "Jeff Zhang"  wrote:

It is hard to figure out what user want to test.
Do they want to test whether the interpreter works or whether the changed
interpreter setting take effect. It makes the test function hard to be
implemented.


Windy Qin 于2017年2月7日周二 下午1:22写道:

> hi,
>   why not provide 'test' function in the interpreter.
>   After I cfg the interpreter ,I want to test whether it is ok now, and I
> can't fount where to test in the page of interpreter.
>   How about adding a function to test the interpreter in the page of
> interpreter cfg?
>