Re: Parallel Execution of Spark Jobs

2018-07-26 Thread Ankit Jain
Thanks for further clarification Jeff.

> On Jul 26, 2018, at 8:11 PM, Jeff Zhang  wrote:
> 
> Let me rephrase it.  In scoped mode, there's multiple Interpreter Group 
> (Personally I prefer to call it multiple sessions) in ones JVM (For spark 
> interpreter, there's multiple SparkInterpreter instances). 
> And there's one SparkContext in this JVM which is shared by all the 
> SparkInterpreter instances. Regarding Scheduler, there's multiple Scheduler 
> in scoped mode in this JVM, each SparkInterpreter instance own its own 
> scheduler. Let me know if you have any other question.
> 
> 
> 
> Ankit Jain 于2018年7月25日周三 下午10:27写道:
>> Jeff, what you said seems to be in conflict with what is detailed here - 
>> https://medium.com/@leemoonsoo/apache-zeppelin-interpreter-mode-explained-bae0525d0555
>> 
>> "In Scoped mode, Zeppelin still runs single interpreter JVM process but 
>> multiple Interpreter Group serve each Note."
>> 
>> In practice as well we see one Interpreter process for scoped mode.
>> 
>> Can you please clarify?
>> 
>> Adding Moon too.
>> 
>> Thanks
>> Ankit
>> 
>>> On Tue, Jul 24, 2018 at 11:09 PM, Ankit Jain  
>>> wrote:
>>> Aah that makes sense - so only all jobs from one user will block in 
>>> FIFOScheduler.
>>> 
>>> By moving to ParallelScheduler, only gain achieved is jobs from same user 
>>> can also be run in parallel but may have dependency resolution issues.
>>> 
>>> Just to confirm I have it right - If "Run all" notebook is not a 
>>> requirement and users run one paragraph at a time from different notebooks, 
>>> ParallelScheduler should be ok?
>>> 
>>> Thanks
>>> Ankit
>>> 
>>>> On Tue, Jul 24, 2018 at 10:38 PM, Jeff Zhang  wrote:
>>>> 
>>>> 1. Zeppelin-3563 force FAIR scheduling and just allow to specify the pool
>>>> 2. scheduler can not to figure out the dependencies between paragraphs. 
>>>> That's why SparkInterpreter use FIFOScheduler. 
>>>> If you use per user scoped mode. SparkContext is shared between users but 
>>>> SparkInterpreter is not shared. That means there's multiple 
>>>> SparkInterpreter instances that share the same SparkContext but they 
>>>> doesn't share the same FIFOScheduler, each SparkInterpreter use its own 
>>>> FIFOScheduler. 
>>>> 
>>>> Ankit Jain 于2018年7月25日周三 下午12:58写道:
>>>>> Thanks for the quick feedback Jeff.
>>>>> 
>>>>> Re:1 - I did see Zeppelin-3563 but we are not on .8 yet and also we may 
>>>>> want to force FAIR execution instead of letting user control it.
>>>>> 
>>>>> Re:2 - Is there an architecture issue here or we just need better thread 
>>>>> safety? Ideally scheduler should be able to figure out the dependencies 
>>>>> and run whatever can be parallel.
>>>>> 
>>>>> Re:Interpreter mode, I may not have been clear but we are running per 
>>>>> user scoped mode - so Spark context is shared among all users. 
>>>>> 
>>>>> Doesn't that mean all jobs from different users go to one FIFOScheduler 
>>>>> forcing all small jobs to block on a big one? That is specifically we are 
>>>>> trying to avoid.
>>>>> 
>>>>> Thanks
>>>>> Ankit
>>>>> 
>>>>>> On Tue, Jul 24, 2018 at 5:40 PM, Jeff Zhang  wrote:
>>>>>> Regarding 1.  ZEPPELIN-3563 should be helpful. See 
>>>>>> https://github.com/apache/zeppelin/blob/master/docs/interpreter/spark.md#running-spark-sql-concurrently
>>>>>> for more details. 
>>>>>> https://issues.apache.org/jira/browse/ZEPPELIN-3563
>>>>>> 
>>>>>> Regarding 2. If you use ParallelScheduler for SparkInterpreter, you may 
>>>>>> hit weird issues if your paragraph has dependency between each other. 
>>>>>> e.g. paragraph 1 will use variable v1 which is defined in paragraph p2. 
>>>>>> Then the order of paragraph execution matters here, and 
>>>>>> ParallelScheduler can not guarantee the order of execution.
>>>>>> That's why we use FIFOScheduler for SparkInterpreter. 
>>>>>> 
>>>>>> In your scenario where multiple users share the same sparkcontext, I 
>>>>>> would suggest you to use scoped per user mode. Then each user will share

Re: Parallel Execution of Spark Jobs

2018-07-25 Thread Ankit Jain
Jeff, what you said seems to be in conflict with what is detailed here -
https://medium.com/@leemoonsoo/apache-zeppelin-interpreter-mode-explained-
bae0525d0555

"In *Scoped* mode, Zeppelin still runs single interpreter JVM process but
multiple *Interpreter Group* serve each Note."

In practice as well we see one Interpreter process for scoped mode.

Can you please clarify?

Adding Moon too.

Thanks
Ankit

On Tue, Jul 24, 2018 at 11:09 PM, Ankit Jain 
wrote:

> Aah that makes sense - so only all jobs from one user will block in
> FIFOScheduler.
>
> By moving to ParallelScheduler, only gain achieved is jobs from same user
> can also be run in parallel but may have dependency resolution issues.
>
> Just to confirm I have it right - If "Run all" notebook is not a
> requirement and users run one paragraph at a time from different notebooks, 
> ParallelScheduler
> should be ok?
>
> Thanks
> Ankit
>
> On Tue, Jul 24, 2018 at 10:38 PM, Jeff Zhang  wrote:
>
>>
>> 1. Zeppelin-3563 force FAIR scheduling and just allow to specify the pool
>> 2. scheduler can not to figure out the dependencies between paragraphs.
>> That's why SparkInterpreter use FIFOScheduler.
>> If you use per user scoped mode. SparkContext is shared between users but
>> SparkInterpreter is not shared. That means there's multiple
>> SparkInterpreter instances that share the same SparkContext but they
>> doesn't share the same FIFOScheduler, each SparkInterpreter use its own
>> FIFOScheduler.
>>
>> Ankit Jain 于2018年7月25日周三 下午12:58写道:
>>
>>> Thanks for the quick feedback Jeff.
>>>
>>> Re:1 - I did see Zeppelin-3563 but we are not on .8 yet and also we may
>>> want to force FAIR execution instead of letting user control it.
>>>
>>> Re:2 - Is there an architecture issue here or we just need better thread
>>> safety? Ideally scheduler should be able to figure out the dependencies and
>>> run whatever can be parallel.
>>>
>>> Re:Interpreter mode, I may not have been clear but we are running per
>>> user scoped mode - so Spark context is shared among all users.
>>>
>>> Doesn't that mean all jobs from different users go to one FIFOScheduler
>>> forcing all small jobs to block on a big one? That is specifically we are
>>> trying to avoid.
>>>
>>> Thanks
>>> Ankit
>>>
>>> On Tue, Jul 24, 2018 at 5:40 PM, Jeff Zhang  wrote:
>>>
>>>> Regarding 1.  ZEPPELIN-3563 should be helpful. See
>>>> https://github.com/apache/zeppelin/blob/master/docs/inte
>>>> rpreter/spark.md#running-spark-sql-concurrently
>>>> for more details.
>>>> https://issues.apache.org/jira/browse/ZEPPELIN-3563
>>>>
>>>> Regarding 2. If you use ParallelScheduler for SparkInterpreter, you may
>>>> hit weird issues if your paragraph has dependency between each other. e.g.
>>>> paragraph 1 will use variable v1 which is defined in paragraph p2. Then the
>>>> order of paragraph execution matters here, and ParallelScheduler can
>>>> not guarantee the order of execution.
>>>> That's why we use FIFOScheduler for SparkInterpreter.
>>>>
>>>> In your scenario where multiple users share the same sparkcontext, I
>>>> would suggest you to use scoped per user mode. Then each user will share
>>>> the same sparkcontext which means you can save resources, and also they are
>>>> in each FIFOScheduler which is isolated from each other.
>>>>
>>>> Ankit Jain 于2018年7月25日周三 上午8:14写道:
>>>>
>>>>> Forgot to mention this is for shared scoped mode, so same Spark
>>>>> application and context for all users on a single Zeppelin instance.
>>>>>
>>>>> Thanks
>>>>> Ankit
>>>>>
>>>>> On Jul 24, 2018, at 4:12 PM, Ankit Jain 
>>>>> wrote:
>>>>>
>>>>> Hi,
>>>>> I am playing around with execution policy of Spark jobs(and all
>>>>> Zeppelin paragraphs actually).
>>>>>
>>>>> Looks like there are couple of control points-
>>>>> 1) Spark scheduling - FIFO vs Fair as documented in
>>>>> https://spark.apache.org/docs/2.1.1/job-scheduling.html#
>>>>> fair-scheduler-pools.
>>>>>
>>>>> Since we are still on .7 version and don't have
>>>>> https://issues.apache.org/jira/browse/ZEPPELIN-3563, I am forcefully
>>>>> doing sc.setLocalProperty("s

Re: Parallel Execution of Spark Jobs

2018-07-24 Thread Ankit Jain
Aah that makes sense - so only all jobs from one user will block in
FIFOScheduler.

By moving to ParallelScheduler, only gain achieved is jobs from same user
can also be run in parallel but may have dependency resolution issues.

Just to confirm I have it right - If "Run all" notebook is not a
requirement and users run one paragraph at a time from different
notebooks, ParallelScheduler
should be ok?

Thanks
Ankit

On Tue, Jul 24, 2018 at 10:38 PM, Jeff Zhang  wrote:

>
> 1. Zeppelin-3563 force FAIR scheduling and just allow to specify the pool
> 2. scheduler can not to figure out the dependencies between paragraphs.
> That's why SparkInterpreter use FIFOScheduler.
> If you use per user scoped mode. SparkContext is shared between users but
> SparkInterpreter is not shared. That means there's multiple
> SparkInterpreter instances that share the same SparkContext but they
> doesn't share the same FIFOScheduler, each SparkInterpreter use its own
> FIFOScheduler.
>
> Ankit Jain 于2018年7月25日周三 下午12:58写道:
>
>> Thanks for the quick feedback Jeff.
>>
>> Re:1 - I did see Zeppelin-3563 but we are not on .8 yet and also we may
>> want to force FAIR execution instead of letting user control it.
>>
>> Re:2 - Is there an architecture issue here or we just need better thread
>> safety? Ideally scheduler should be able to figure out the dependencies and
>> run whatever can be parallel.
>>
>> Re:Interpreter mode, I may not have been clear but we are running per
>> user scoped mode - so Spark context is shared among all users.
>>
>> Doesn't that mean all jobs from different users go to one FIFOScheduler
>> forcing all small jobs to block on a big one? That is specifically we are
>> trying to avoid.
>>
>> Thanks
>> Ankit
>>
>> On Tue, Jul 24, 2018 at 5:40 PM, Jeff Zhang  wrote:
>>
>>> Regarding 1.  ZEPPELIN-3563 should be helpful. See
>>> https://github.com/apache/zeppelin/blob/master/docs/
>>> interpreter/spark.md#running-spark-sql-concurrently
>>> for more details.
>>> https://issues.apache.org/jira/browse/ZEPPELIN-3563
>>>
>>> Regarding 2. If you use ParallelScheduler for SparkInterpreter, you may
>>> hit weird issues if your paragraph has dependency between each other. e.g.
>>> paragraph 1 will use variable v1 which is defined in paragraph p2. Then the
>>> order of paragraph execution matters here, and ParallelScheduler can
>>> not guarantee the order of execution.
>>> That's why we use FIFOScheduler for SparkInterpreter.
>>>
>>> In your scenario where multiple users share the same sparkcontext, I
>>> would suggest you to use scoped per user mode. Then each user will share
>>> the same sparkcontext which means you can save resources, and also they are
>>> in each FIFOScheduler which is isolated from each other.
>>>
>>> Ankit Jain 于2018年7月25日周三 上午8:14写道:
>>>
>>>> Forgot to mention this is for shared scoped mode, so same Spark
>>>> application and context for all users on a single Zeppelin instance.
>>>>
>>>> Thanks
>>>> Ankit
>>>>
>>>> On Jul 24, 2018, at 4:12 PM, Ankit Jain 
>>>> wrote:
>>>>
>>>> Hi,
>>>> I am playing around with execution policy of Spark jobs(and all
>>>> Zeppelin paragraphs actually).
>>>>
>>>> Looks like there are couple of control points-
>>>> 1) Spark scheduling - FIFO vs Fair as documented in
>>>> https://spark.apache.org/docs/2.1.1/job-scheduling.
>>>> html#fair-scheduler-pools.
>>>>
>>>> Since we are still on .7 version and don't have https://issues.apache.
>>>> org/jira/browse/ZEPPELIN-3563, I am forcefully doing sc
>>>> .setLocalProperty("spark.scheduler.pool", "fair");
>>>> in both SparkInterpreter.java and SparkSqlInterpreter.java.
>>>>
>>>> Also because we are exposing Zeppelin to multiple users we may not
>>>> actually want users to hog the cluster and always use FAIR.
>>>>
>>>> This may complicate our merge to .8 though.
>>>>
>>>> 2. On top of Spark scheduling, each Zeppelin Interpreter itself seems
>>>> to have a scheduler queue. Each task is submitted to a FIFOScheduler except
>>>> SparkSqlInterpreter which creates a ParallelScheduler ig concurrentsql flag
>>>> is turned on.
>>>>
>>>> I am changing SparkInterpreter.java to use ParallelScheduler too and
>>>> that seems to do the trick.
>>>>
>>>> Now multiple notebooks are able to run in parallel.
>>>>
>>>> My question is if other people have tested SparkInterpreter with 
>>>> ParallelScheduler?
>>>> Also ideally this should be configurable. User should be specify fifo or
>>>> parallel.
>>>>
>>>> Executing all paragraphs does add more complication and maybe
>>>>
>>>> https://issues.apache.org/jira/browse/ZEPPELIN-2368 will help us keep
>>>> the execution order sane.
>>>>
>>>>
>>>> Thoughts?
>>>>
>>>> --
>>>> Thanks & Regards,
>>>> Ankit.
>>>>
>>>>
>>
>>
>> --
>> Thanks & Regards,
>> Ankit.
>>
>


-- 
Thanks & Regards,
Ankit.


Re: Parallel Execution of Spark Jobs

2018-07-24 Thread Ankit Jain
Thanks for the quick feedback Jeff.

Re:1 - I did see Zeppelin-3563 but we are not on .8 yet and also we may
want to force FAIR execution instead of letting user control it.

Re:2 - Is there an architecture issue here or we just need better thread
safety? Ideally scheduler should be able to figure out the dependencies and
run whatever can be parallel.

Re:Interpreter mode, I may not have been clear but we are running per user
scoped mode - so Spark context is shared among all users.

Doesn't that mean all jobs from different users go to one FIFOScheduler
forcing all small jobs to block on a big one? That is specifically we are
trying to avoid.

Thanks
Ankit

On Tue, Jul 24, 2018 at 5:40 PM, Jeff Zhang  wrote:

> Regarding 1.  ZEPPELIN-3563 should be helpful. See
> https://github.com/apache/zeppelin/blob/master/docs/
> interpreter/spark.md#running-spark-sql-concurrently
> for more details.
> https://issues.apache.org/jira/browse/ZEPPELIN-3563
>
> Regarding 2. If you use ParallelScheduler for SparkInterpreter, you may
> hit weird issues if your paragraph has dependency between each other. e.g.
> paragraph 1 will use variable v1 which is defined in paragraph p2. Then the
> order of paragraph execution matters here, and ParallelScheduler can
> not guarantee the order of execution.
> That's why we use FIFOScheduler for SparkInterpreter.
>
> In your scenario where multiple users share the same sparkcontext, I would
> suggest you to use scoped per user mode. Then each user will share the same
> sparkcontext which means you can save resources, and also they are in each
> FIFOScheduler which is isolated from each other.
>
> Ankit Jain 于2018年7月25日周三 上午8:14写道:
>
>> Forgot to mention this is for shared scoped mode, so same Spark
>> application and context for all users on a single Zeppelin instance.
>>
>> Thanks
>> Ankit
>>
>> On Jul 24, 2018, at 4:12 PM, Ankit Jain  wrote:
>>
>> Hi,
>> I am playing around with execution policy of Spark jobs(and all Zeppelin
>> paragraphs actually).
>>
>> Looks like there are couple of control points-
>> 1) Spark scheduling - FIFO vs Fair as documented in
>> https://spark.apache.org/docs/2.1.1/job-scheduling.
>> html#fair-scheduler-pools.
>>
>> Since we are still on .7 version and don't have https://issues.apache.
>> org/jira/browse/ZEPPELIN-3563, I am forcefully doing sc.setLocalProperty(
>> "spark.scheduler.pool", "fair");
>> in both SparkInterpreter.java and SparkSqlInterpreter.java.
>>
>> Also because we are exposing Zeppelin to multiple users we may not
>> actually want users to hog the cluster and always use FAIR.
>>
>> This may complicate our merge to .8 though.
>>
>> 2. On top of Spark scheduling, each Zeppelin Interpreter itself seems to
>> have a scheduler queue. Each task is submitted to a FIFOScheduler except
>> SparkSqlInterpreter which creates a ParallelScheduler ig concurrentsql flag
>> is turned on.
>>
>> I am changing SparkInterpreter.java to use ParallelScheduler too and
>> that seems to do the trick.
>>
>> Now multiple notebooks are able to run in parallel.
>>
>> My question is if other people have tested SparkInterpreter with 
>> ParallelScheduler?
>> Also ideally this should be configurable. User should be specify fifo or
>> parallel.
>>
>> Executing all paragraphs does add more complication and maybe
>>
>> https://issues.apache.org/jira/browse/ZEPPELIN-2368 will help us keep
>> the execution order sane.
>>
>>
>> Thoughts?
>>
>> --
>> Thanks & Regards,
>> Ankit.
>>
>>


-- 
Thanks & Regards,
Ankit.


Re: Parallel Execution of Spark Jobs

2018-07-24 Thread Ankit Jain
Forgot to mention this is for shared scoped mode, so same Spark application and 
context for all users on a single Zeppelin instance.

Thanks
Ankit

> On Jul 24, 2018, at 4:12 PM, Ankit Jain  wrote:
> 
> Hi,
> I am playing around with execution policy of Spark jobs(and all Zeppelin 
> paragraphs actually).
> 
> Looks like there are couple of control points-
> 1) Spark scheduling - FIFO vs Fair as documented in 
> https://spark.apache.org/docs/2.1.1/job-scheduling.html#fair-scheduler-pools.
> 
> Since we are still on .7 version and don't have 
> https://issues.apache.org/jira/browse/ZEPPELIN-3563, I am forcefully doing 
> sc.setLocalProperty("spark.scheduler.pool", "fair");
> in both SparkInterpreter.java and SparkSqlInterpreter.java.
> 
> Also because we are exposing Zeppelin to multiple users we may not actually 
> want users to hog the cluster and always use FAIR.
> 
> This may complicate our merge to .8 though.
> 
> 2. On top of Spark scheduling, each Zeppelin Interpreter itself seems to have 
> a scheduler queue. Each task is submitted to a FIFOScheduler except 
> SparkSqlInterpreter which creates a ParallelScheduler ig concurrentsql flag 
> is turned on.
> 
> I am changing SparkInterpreter.java to use ParallelScheduler too and that 
> seems to do the trick.
> 
> Now multiple notebooks are able to run in parallel.
> 
> My question is if other people have tested SparkInterpreter with 
> ParallelScheduler? Also ideally this should be configurable. User should be 
> specify fifo or parallel.
> 
> Executing all paragraphs does add more complication and maybe
> https://issues.apache.org/jira/browse/ZEPPELIN-2368 will help us keep the 
> execution order sane.
> 
> Thoughts?
> 
> -- 
> Thanks & Regards,
> Ankit.


Parallel Execution of Spark Jobs

2018-07-24 Thread Ankit Jain
Hi,
I am playing around with execution policy of Spark jobs(and all Zeppelin
paragraphs actually).

Looks like there are couple of control points-
1) Spark scheduling - FIFO vs Fair as documented in
https://spark.apache.org/docs/2.1.1/job-scheduling.html#fair-scheduler-pools
.

Since we are still on .7 version and don't have
https://issues.apache.org/jira/browse/ZEPPELIN-3563, I am forcefully doing
sc.setLocalProperty("spark.scheduler.pool", "fair");
in both SparkInterpreter.java and SparkSqlInterpreter.java.

Also because we are exposing Zeppelin to multiple users we may not actually
want users to hog the cluster and always use FAIR.

This may complicate our merge to .8 though.

2. On top of Spark scheduling, each Zeppelin Interpreter itself seems to
have a scheduler queue. Each task is submitted to a FIFOScheduler except
SparkSqlInterpreter which creates a ParallelScheduler ig concurrentsql flag
is turned on.

I am changing SparkInterpreter.java to use ParallelScheduler too and that
seems to do the trick.

Now multiple notebooks are able to run in parallel.

My question is if other people have tested SparkInterpreter with
ParallelScheduler?
Also ideally this should be configurable. User should be specify fifo or
parallel.

Executing all paragraphs does add more complication and maybe

https://issues.apache.org/jira/browse/ZEPPELIN-2368 will help us keep the
execution order sane.


Thoughts?

-- 
Thanks & Regards,
Ankit.


Re: Another silly question from me...

2018-05-30 Thread ankit jain
Add the library to interpreter settings?

On Wed, May 30, 2018 at 11:07 AM, Michael Segel 
wrote:

> Hi,
>
> Ok… I wanted to include the Apache commons compress libraries for use in
> my spark/scala note.
>
> I know I can include it in the first note by telling the interpreter to
> load… but I did some checking…
>
> There’s a local repo.
> ./zeppelin/local-repo/ … that actually has two older jars for
> commons-compres-1.4.1.jar and commons-compress-1.9.jar
>
> Is it possible to place the jar here and would it automatically be on the
> Class Path?
>
> Thx
>
> -Mike
>
>
>
>


-- 
Thanks & Regards,
Ankit.


Re: Is it possible to run Zeppelin on cluster

2018-04-30 Thread ankit jain
Yeah LB will need to make sure requests for a notebook always go to same
machine - then it will be like a single machine deployment, multiple users
using the notebook at same time.

On Mon, Apr 30, 2018 at 3:19 PM, Ruslan Dautkhanov 
wrote:

> It probably would work for an active-passive Zeppelin cluster if all files
> are in a shared NFS directory for example, not as active-active.
> One simple example - they would keep overwriting configuration files /
> notebook files etc.
>
>
>
> --
> Ruslan Dautkhanov
>
> On Mon, Apr 30, 2018 at 4:04 PM, ankit jain 
> wrote:
>
>> You can probably deploy Zeppelin on n machines and manage behind a
>> LoadBalancer?
>>
>> Thanks
>> Ankit
>>
>> On Mon, Apr 30, 2018 at 6:42 AM, Michael Segel > > wrote:
>>
>>> Ok..
>>> The answer is no.
>>>
>>> You have a web interface. It runs on a web server.  Does the web server
>>> run in a cluster?
>>>
>>>
>>> On Apr 30, 2018, at 12:24 AM, Soheil Pourbafrani 
>>> wrote:
>>>
>>> Thanks, I meant the Zeppelin itself, not it's jobs.
>>>
>>> On Sun, Apr 29, 2018 at 11:51 PM, Michael Segel <
>>> msegel_had...@hotmail.com> wrote:
>>>
>>>> Yes if you mean to run the spark jobs on a cluster.
>>>>
>>>>
>>>> On Apr 29, 2018, at 7:25 AM, Soheil Pourbafrani 
>>>> wrote:
>>>>
>>>> I mean to configure Zeppelin in multimode.
>>>>
>>>> On Sun, Apr 29, 2018 at 4:49 PM, Soheil Pourbafrani <
>>>> soheil.i...@gmail.com> wrote:
>>>>
>>>>> Something like Kafka or Hadoop cluster?
>>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>> --
>> Thanks & Regards,
>> Ankit.
>>
>
>


-- 
Thanks & Regards,
Ankit.


Re: Is it possible to run Zeppelin on cluster

2018-04-30 Thread ankit jain
You can probably deploy Zeppelin on n machines and manage behind a
LoadBalancer?

Thanks
Ankit

On Mon, Apr 30, 2018 at 6:42 AM, Michael Segel 
wrote:

> Ok..
> The answer is no.
>
> You have a web interface. It runs on a web server.  Does the web server
> run in a cluster?
>
>
> On Apr 30, 2018, at 12:24 AM, Soheil Pourbafrani 
> wrote:
>
> Thanks, I meant the Zeppelin itself, not it's jobs.
>
> On Sun, Apr 29, 2018 at 11:51 PM, Michael Segel  > wrote:
>
>> Yes if you mean to run the spark jobs on a cluster.
>>
>>
>> On Apr 29, 2018, at 7:25 AM, Soheil Pourbafrani 
>> wrote:
>>
>> I mean to configure Zeppelin in multimode.
>>
>> On Sun, Apr 29, 2018 at 4:49 PM, Soheil Pourbafrani <
>> soheil.i...@gmail.com> wrote:
>>
>>> Something like Kafka or Hadoop cluster?
>>>
>>
>>
>>
>
>


-- 
Thanks & Regards,
Ankit.


Re: multiple users sharing single Spark context

2018-03-14 Thread ankit jain
We are seeing the same PENDING behavior despite running Spark Interpreter
in "Isolated per User" - we expected one SparkContext to be created per
user and indeed did see multiple SparkSubmit processes spun up on Zeppelin
pod.

But why go to PENDING if there are multiple contexts that can be run in
parallel? Is assumption of multiple SparkSubmit = multiple SparkContext
correct?

Thanks
Ankit

On Wed, Mar 14, 2018 at 4:12 PM, Ruslan Dautkhanov 
wrote:

> Looked at the code.. the only place Zeppelin handles spark.scheduler.pool
> is here -
>
> https://github.com/apache/zeppelin/blob/d762b5288536201d8a2964891c556e
> faa1bae867/spark/interpreter/src/main/java/org/apache/zeppelin/spark/
> SparkSqlInterpreter.java#L103
>
> I don't think it matches Spark documentation description that would allow
> multiple concurrent users to submit jobs independently.
> (each user's *thread* has to have different value for  *spark.scheduler.pool
> *)
>
> Filed https://issues.apache.org/jira/browse/ZEPPELIN-3334 to set
> *spark.scheduler.pool* to an authenticated user name.
>
> Other ideas?
>
>
>
>
> --
> Ruslan Dautkhanov
>
> On Wed, Mar 14, 2018 at 4:57 PM, Ruslan Dautkhanov 
> wrote:
>
>> Let's say we have a Spark interpreter set up as
>> " The interpreter will be instantiated *Globally *in *shared *process"
>>
>> When one user is using Spark interpreter,
>> another users that are trying to use the same interpreter,
>> getting PENDING until another user's code completes.
>>
>> Per Spark documentation, https://spark.apache.org/docs/
>> latest/job-scheduling.html
>>
>> " *within* each Spark application, multiple “jobs” (Spark actions) may
>>> be running concurrently if they were submitted by different threads
>>> ... /skip/
>>> threads. By “job”, in this section, we mean a Spark action (e.g. save,
>>> collect) and any tasks that need to run to evaluate that action.
>>> Spark’s scheduler is fully thread-safe and supports this use case to enable
>>> applications that serve multiple requests (e.g. queries for multiple users).
>>> ... /skip/
>>> Without any intervention, newly submitted jobs go into a *default pool*,
>>> but jobs’ pools can be set by adding the *spark.scheduler.pool* “local
>>> property” to the SparkContext in the thread that’s submitting them."
>>
>>
>> So Spark allows multiple users to use the same shared spark context..
>>
>> Two quick questions:
>> 1. Why concurrent users are getting PENDING in Zeppelin?
>> 2. Does Zeppelin set *spark.scheduler.pool* accordingly as described
>> above?
>>
>> PS.
>> We have set following Spark interpreter settings:
>> - zeppelin.spark.concurrentSQL= true
>> - spark.scheduler.mode = FAIR
>>
>>
>> Thank you,
>> Ruslan Dautkhanov
>>
>>
>


-- 
Thanks & Regards,
Ankit.


Re: Zeppelin - Spark Driver location

2018-03-14 Thread ankit jain
Also spark standalone cluster moder should work even before this new
release, right?

On Wed, Mar 14, 2018 at 8:43 AM, ankit jain  wrote:

> Hi Jhang,
> Not clear on that - I thought spark-submit was done when we run a
> paragraph, how does the .sh file come into play?
>
> Thanks
> Ankit
>
> On Tue, Mar 13, 2018 at 5:43 PM, Jeff Zhang  wrote:
>
>>
>> spark-submit is called in bin/interpreter.sh,  I didn't try standalone
>> cluster mode. It is expected to run driver in separate host, but didn't
>> guaranteed zeppelin support this.
>>
>> Ankit Jain 于2018年3月14日周三 上午8:34写道:
>>
>>> Hi Jhang,
>>> What is the expected behavior with standalone cluster mode? Should we
>>> see separate driver processes in the cluster(one per user) or multiple
>>> SparkSubmit processes?
>>>
>>> I was trying to dig in Zeppelin code & didn’t see where Zeppelin does
>>> the Spark-submit to the cluster? Can you please point to it?
>>>
>>> Thanks
>>> Ankit
>>>
>>> On Mar 13, 2018, at 5:25 PM, Jeff Zhang  wrote:
>>>
>>>
>>> ZEPPELIN-2898 <https://issues.apache.org/jira/browse/ZEPPELIN-2898> is
>>> for yarn cluster model.  And Zeppelin have integration test for yarn mode,
>>> so guaranteed it would work. But don't' have test for standalone, so not
>>> sure the behavior of standalone mode.
>>>
>>>
>>> Ruslan Dautkhanov 于2018年3月14日周三 上午8:06写道:
>>>
>>>> https://github.com/apache/zeppelin/pull/2577 pronounces yarn-cluster
>>>> in it's title so I assume it's only yarn-cluster.
>>>> Never used standalone-cluster myself.
>>>>
>>>> Which distro of Hadoop do you use?
>>>> Cloudera desupported standalone in CDH 5.5 and will remove in CDH 6.
>>>> https://www.cloudera.com/documentation/enterprise/release-
>>>> notes/topics/rg_deprecated.html
>>>>
>>>>
>>>>
>>>> --
>>>> Ruslan Dautkhanov
>>>>
>>>> On Tue, Mar 13, 2018 at 5:45 PM, Jhon Anderson Cardenas Diaz <
>>>> jhonderson2...@gmail.com> wrote:
>>>>
>>>>> Does this new feature work only for yarn-cluster ?. Or for spark
>>>>> standalone too ?
>>>>>
>>>> El mar., 13 de mar. de 2018 18:34, Ruslan Dautkhanov <
>>>>> dautkha...@gmail.com> escribió:
>>>>>
>>>> > Zeppelin version: 0.8.0 (merged at September 2017 version)
>>>>>>
>>>>>> https://issues.apache.org/jira/browse/ZEPPELIN-2898 was merged end
>>>>>> of September so not sure if you have that.
>>>>>>
>>>>>> Check out https://medium.com/@zjffdu/zeppelin-0-8-0-new-features-
>>>>>> ea53e8810235 how to set this up.
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ruslan Dautkhanov
>>>>>>
>>>>>> On Tue, Mar 13, 2018 at 5:24 PM, Jhon Anderson Cardenas Diaz <
>>>>>> jhonderson2...@gmail.com> wrote:
>>>>>>
>>>>> Hi zeppelin users !
>>>>>>>
>>>>>>> I am working with zeppelin pointing to a spark in standalone. I am
>>>>>>> trying to figure out a way to make zeppelin runs the spark driver 
>>>>>>> outside
>>>>>>> of client process that submits the application.
>>>>>>>
>>>>>>> According with the documentation (http://spark.apache.org/docs/
>>>>>>> 2.1.1/spark-standalone.html):
>>>>>>>
>>>>>>> *For standalone clusters, Spark currently supports two deploy modes.
>>>>>>> In client mode, the driver is launched in the same process as the client
>>>>>>> that submits the application. In cluster mode, however, the driver is
>>>>>>> launched from one of the Worker processes inside the cluster, and the
>>>>>>> client process exits as soon as it fulfills its responsibility of
>>>>>>> submitting the application without waiting for the application to 
>>>>>>> finish.*
>>>>>>>
>>>>>>> The problem is that, even when I set the properties for
>>>>>>> spark-standalone cluster and deploy mode in cluster, the driver still 
>>>>>>> run
>>>>>>> inside zeppelin machine (according with spark UI/executors page). These 
>>>>>>> are
>>>>>>> properties that I am setting for the spark interpreter:
>>>>>>>
>>>>>>> master: spark://:7077
>>>>>>> spark.submit.deployMode: cluster
>>>>>>> spark.executor.memory: 16g
>>>>>>>
>>>>>>> Any ideas would be appreciated.
>>>>>>>
>>>>>>> Thank you
>>>>>>>
>>>>>>> Details:
>>>>>>> Spark version: 2.1.1
>>>>>>> Zeppelin version: 0.8.0 (merged at September 2017 version)
>>>>>>>
>>>>>>
>
>
> --
> Thanks & Regards,
> Ankit.
>



-- 
Thanks & Regards,
Ankit.


Re: Zeppelin - Spark Driver location

2018-03-14 Thread ankit jain
Hi Jhang,
Not clear on that - I thought spark-submit was done when we run a
paragraph, how does the .sh file come into play?

Thanks
Ankit

On Tue, Mar 13, 2018 at 5:43 PM, Jeff Zhang  wrote:

>
> spark-submit is called in bin/interpreter.sh,  I didn't try standalone
> cluster mode. It is expected to run driver in separate host, but didn't
> guaranteed zeppelin support this.
>
> Ankit Jain 于2018年3月14日周三 上午8:34写道:
>
>> Hi Jhang,
>> What is the expected behavior with standalone cluster mode? Should we see
>> separate driver processes in the cluster(one per user) or multiple
>> SparkSubmit processes?
>>
>> I was trying to dig in Zeppelin code & didn’t see where Zeppelin does the
>> Spark-submit to the cluster? Can you please point to it?
>>
>> Thanks
>> Ankit
>>
>> On Mar 13, 2018, at 5:25 PM, Jeff Zhang  wrote:
>>
>>
>> ZEPPELIN-2898 <https://issues.apache.org/jira/browse/ZEPPELIN-2898> is
>> for yarn cluster model.  And Zeppelin have integration test for yarn mode,
>> so guaranteed it would work. But don't' have test for standalone, so not
>> sure the behavior of standalone mode.
>>
>>
>> Ruslan Dautkhanov 于2018年3月14日周三 上午8:06写道:
>>
>>> https://github.com/apache/zeppelin/pull/2577 pronounces yarn-cluster in
>>> it's title so I assume it's only yarn-cluster.
>>> Never used standalone-cluster myself.
>>>
>>> Which distro of Hadoop do you use?
>>> Cloudera desupported standalone in CDH 5.5 and will remove in CDH 6.
>>> https://www.cloudera.com/documentation/enterprise/
>>> release-notes/topics/rg_deprecated.html
>>>
>>>
>>>
>>> --
>>> Ruslan Dautkhanov
>>>
>>> On Tue, Mar 13, 2018 at 5:45 PM, Jhon Anderson Cardenas Diaz <
>>> jhonderson2...@gmail.com> wrote:
>>>
>>>> Does this new feature work only for yarn-cluster ?. Or for spark
>>>> standalone too ?
>>>>
>>> El mar., 13 de mar. de 2018 18:34, Ruslan Dautkhanov <
>>>> dautkha...@gmail.com> escribió:
>>>>
>>> > Zeppelin version: 0.8.0 (merged at September 2017 version)
>>>>>
>>>>> https://issues.apache.org/jira/browse/ZEPPELIN-2898 was merged end of
>>>>> September so not sure if you have that.
>>>>>
>>>>> Check out https://medium.com/@zjffdu/zeppelin-0-8-0-new-
>>>>> features-ea53e8810235 how to set this up.
>>>>>
>>>>>
>>>>> --
>>>>> Ruslan Dautkhanov
>>>>>
>>>>> On Tue, Mar 13, 2018 at 5:24 PM, Jhon Anderson Cardenas Diaz <
>>>>> jhonderson2...@gmail.com> wrote:
>>>>>
>>>> Hi zeppelin users !
>>>>>>
>>>>>> I am working with zeppelin pointing to a spark in standalone. I am
>>>>>> trying to figure out a way to make zeppelin runs the spark driver outside
>>>>>> of client process that submits the application.
>>>>>>
>>>>>> According with the documentation (http://spark.apache.org/docs/
>>>>>> 2.1.1/spark-standalone.html):
>>>>>>
>>>>>> *For standalone clusters, Spark currently supports two deploy modes.
>>>>>> In client mode, the driver is launched in the same process as the client
>>>>>> that submits the application. In cluster mode, however, the driver is
>>>>>> launched from one of the Worker processes inside the cluster, and the
>>>>>> client process exits as soon as it fulfills its responsibility of
>>>>>> submitting the application without waiting for the application to 
>>>>>> finish.*
>>>>>>
>>>>>> The problem is that, even when I set the properties for
>>>>>> spark-standalone cluster and deploy mode in cluster, the driver still run
>>>>>> inside zeppelin machine (according with spark UI/executors page). These 
>>>>>> are
>>>>>> properties that I am setting for the spark interpreter:
>>>>>>
>>>>>> master: spark://:7077
>>>>>> spark.submit.deployMode: cluster
>>>>>> spark.executor.memory: 16g
>>>>>>
>>>>>> Any ideas would be appreciated.
>>>>>>
>>>>>> Thank you
>>>>>>
>>>>>> Details:
>>>>>> Spark version: 2.1.1
>>>>>> Zeppelin version: 0.8.0 (merged at September 2017 version)
>>>>>>
>>>>>


-- 
Thanks & Regards,
Ankit.


Re: Zeppelin - Spark Driver location

2018-03-13 Thread Ankit Jain
Hi Jhang,
What is the expected behavior with standalone cluster mode? Should we see 
separate driver processes in the cluster(one per user) or multiple SparkSubmit 
processes?

I was trying to dig in Zeppelin code & didn’t see where Zeppelin does the 
Spark-submit to the cluster? Can you please point to it?

Thanks
Ankit

> On Mar 13, 2018, at 5:25 PM, Jeff Zhang  wrote:
> 
> 
> ZEPPELIN-2898 is for yarn cluster model.  And Zeppelin have integration test 
> for yarn mode, so guaranteed it would work. But don't' have test for 
> standalone, so not sure the behavior of standalone mode. 
> 
> 
> Ruslan Dautkhanov 于2018年3月14日周三 上午8:06写道:
>> https://github.com/apache/zeppelin/pull/2577 pronounces yarn-cluster in it's 
>> title so I assume it's only yarn-cluster.
>> Never used standalone-cluster myself. 
>> 
>> Which distro of Hadoop do you use?
>> Cloudera desupported standalone in CDH 5.5 and will remove in CDH 6.
>> https://www.cloudera.com/documentation/enterprise/release-notes/topics/rg_deprecated.html
>> 
>> 
>> 
>> -- 
>> Ruslan Dautkhanov
>> 
>>> On Tue, Mar 13, 2018 at 5:45 PM, Jhon Anderson Cardenas Diaz 
>>>  wrote:
>> 
>>> Does this new feature work only for yarn-cluster ?. Or for spark standalone 
>>> too ?
>> 
>>> El mar., 13 de mar. de 2018 18:34, Ruslan Dautkhanov  
>>> escribió:
>> 
 > Zeppelin version: 0.8.0 (merged at September 2017 version)
 
 https://issues.apache.org/jira/browse/ZEPPELIN-2898 was merged end of 
 September so not sure if you have that.
 
 Check out 
 https://medium.com/@zjffdu/zeppelin-0-8-0-new-features-ea53e8810235 how to 
 set this up.
 
>> 
 
 -- 
 Ruslan Dautkhanov
 
>> 
> On Tue, Mar 13, 2018 at 5:24 PM, Jhon Anderson Cardenas Diaz 
>  wrote:
>> 
> Hi zeppelin users !
> 
> I am working with zeppelin pointing to a spark in standalone. I am trying 
> to figure out a way to make zeppelin runs the spark driver outside of 
> client process that submits the application.
> 
> According with the documentation 
> (http://spark.apache.org/docs/2.1.1/spark-standalone.html):
> 
> For standalone clusters, Spark currently supports two deploy modes. In 
> client mode, the driver is launched in the same process as the client 
> that submits the application. In cluster mode, however, the driver is 
> launched from one of the Worker processes inside the cluster, and the 
> client process exits as soon as it fulfills its responsibility of 
> submitting the application without waiting for the application to finish.
> 
> The problem is that, even when I set the properties for spark-standalone 
> cluster and deploy mode in cluster, the driver still run inside zeppelin 
> machine (according with spark UI/executors page). These are properties 
> that I am setting for the spark interpreter:
> 
> master: spark://:7077
> spark.submit.deployMode: cluster
> spark.executor.memory: 16g
> 
> Any ideas would be appreciated.
> 
> Thank you
> 
> Details:
> Spark version: 2.1.1
> Zeppelin version: 0.8.0 (merged at September 2017 version)


Re: How to pass a note to Zeppelin on invocation

2018-02-13 Thread ankit jain
You can save the notebooks in something like S3 and then copy those to
Zeppelin during restart.

On Tue, Feb 13, 2018 at 9:28 AM, moon soo Lee  wrote:

> Hi,
>
> Currently we don't have a way I think. But it will be really nice to have.
> Especially with https://issues.apache.org/jira/browse/ZEPPELIN-2619,
> it'll be even better i think.
>
> Thanks,
> moon
>
> On Tue, Feb 6, 2018 at 8:10 AM Leon Katsnelson  wrote:
>
>> When we use Jupyter and JupyterLab we are able to pass the notebook as a
>> parameter so that when the Jupyter starts the notebook is already loaded.
>> We would like to do the same with Zeppelin 0.7.3.
>>
>> Is it possible and if so, how?
>>
>


-- 
Thanks & Regards,
Ankit.


Re: Last stable version

2018-02-02 Thread ankit jain
Any rough estimate Jeff - within next week or so or by end of Feb?

Thanks
Ankit

On Fri, Feb 2, 2018 at 6:35 AM, Jeff Zhang  wrote:

>
> It would be pretty soon, I am preparing the release of 0.8.0
>
>
> wilsonr guevara arevalo 于2018年2月2日周五 下午10:08写道:
>
>> Hi,
>>
>> I'm currently working with a zeppelin version 0.8.0-SNAPSHOT but I want
>> to be clear about what´s the last stable version and when is there going to
>> be a new one?
>>
>> I understand that the current stable is 0.7.3. But, from the current
>> zeppelin code I'm seeing that there's already a version number 0.9.0 in
>> progress. When would be released the 0.8.0 stable version and how could I
>> know which changes would or wouldn´t be in that version?
>>
>> Thanks in advance!
>>
>> *Wilson René Guevara Arévalo* | *Software Developer*
>> Skype: wilsonr990
>> wilsonr...@hotmail.com | wilsonr...@gmail.com
>>
>>
>>


-- 
Thanks & Regards,
Ankit.


Re: Extending SparkInterpreter functionality

2018-02-01 Thread Ankit Jain
Hi Jeff,
#3 is not about different spark versions. 

Same spark versions but multiple clusters. So based on logged in user, we may 
want to route to different spark cluster or even let user choose the spark he 
wants to connect to.

Will work with Jhon to create tickets on other #2.
What is the turn-around time for such tasks usually?
Are you okay with us working on those tickets?

Maybe we can setup a meeting early Monday to discuss our proposals in detail?

Thanks
Ankit

> On Feb 1, 2018, at 10:15 PM, Jeff Zhang  wrote:
> 
> 
> 1) Spark UI which works differently on EMR than standalone, so that logic 
> will be in an interpreter specific to emr.
>Could you create a ticket for that, and please add details of that ? I 
> don't know exactly what the difference between EMR and standalone, we can 
> expose api to allow customization if necessary. 
> 
> 
> 2) We want to add more metrics & logs in the interpreter, say number of 
> requests coming to the interpreter.
>Could you create a ticket for that as well ? I think it is not difficult 
> to do that. 
> 
> 3) Ideally we will like to connect to different spark clusters in 
> spark-submit and not tie to one which happens on Zeppelin startup right now.
> It is possible now already, you can create different spark interpreter for 
> different spark clusters. e.g. you can create spark_16 for spark 1.6 and 
> spark_22 for spark 2.2, and what you need to do is just setting SPARK_HOME 
> properly in their interpreter setting.
> 
> 
> Ankit Jain 于2018年2月2日周五 下午1:36写道:
>> This is exactly what we want Jeff! A hook to plug in our own interpreters.
>> (I am on same team as Jhon btw)
>> 
>> Right now there are too many concrete references and injecting stuff is not 
>> possible. 
>> 
>> Eg of customizations - 
>> 1) Spark UI which works differently on EMR than standalone, so that logic 
>> will be in an interpreter specific to emr.
>> 2) We want to add more metrics & logs in the interpreter, say number of 
>> requests coming to the interpreter.
>> 3) Ideally we will like to connect to different spark clusters in 
>> spark-submit and not tie to one which happens on Zeppelin startup right now.
>> 
>> Basically we want to add lot more flexibility.
>> 
>> We are building a platform to cater to multiple clients. So, multiple 
>> Zeppelin instances, multiple spark clusters, multiple Spark UIs and on top 
>> of that maintaining the security and privacy in a shared multi-tenant env 
>> will need all the flexibility we can get!
>> 
>> Thanks
>> Ankit
>> 
>>> On Feb 1, 2018, at 7:51 PM, Jeff Zhang  wrote:
>>> 
>>> 
>>> Hi Jhon,
>>> 
>>> Do you mind to share what kind of custom function you want to add to spark 
>>> interpreter ? One idea in my mind is that we could add extension point to 
>>> the existing SparkInterpreter, and user can enhance SparkInterpreter via 
>>> these extension point. That means we just open some interfaces and users 
>>> can implement those interfaces, and just add their jars to spark 
>>> interpreter folder.
>>> 
>>> 
>>> 
>>> Jhon Anderson Cardenas Diaz 于2018年2月2日周五 上午5:30写道:
>>>> Hello!
>>>> 
>>>> I'm a software developer and as part of a project I require to extend the 
>>>> functionality of SparkInterpreter without modifying it. I need instead 
>>>> create a new interpreter that extends it or wrap its functionality.
>>>> 
>>>> I also need the spark sub-interpreters to use my new custom interpreter, 
>>>> but the problem comes here, because the spark sub-interpreters has a 
>>>> direct dependency to spark interpreter as they use the class name of spark 
>>>> interpreter to obtain its instance:
>>>> 
>>>> 
>>>> private SparkInterpreter getSparkInterpreter() {
>>>> ...
>>>> Interpreter p = 
>>>> getInterpreterInTheSameSessionByClassName(SparkInterpreter.class.getName());
>>>> }
>>>> 
>>>> 
>>>> Approach without modify apache zeppelin
>>>> 
>>>> My current approach to solve is to create a SparkCustomInterpreter that 
>>>> override the getClassName method as follows:
>>>> 
>>>> public class SparkCustomInterpreter extends SparkInterpreter {
>>>> ...
>>>> 
>>>> @Override
>>>> public String getClassName() {
>>>> return SparkInterpreter.class.getName();
>>>> }
>>&

Re: Extending SparkInterpreter functionality

2018-02-01 Thread Ankit Jain
This is exactly what we want Jeff! A hook to plug in our own interpreters.
(I am on same team as Jhon btw)

Right now there are too many concrete references and injecting stuff is not 
possible. 

Eg of customizations - 
1) Spark UI which works differently on EMR than standalone, so that logic will 
be in an interpreter specific to emr.
2) We want to add more metrics & logs in the interpreter, say number of 
requests coming to the interpreter.
3) Ideally we will like to connect to different spark clusters in spark-submit 
and not tie to one which happens on Zeppelin startup right now.

Basically we want to add lot more flexibility.

We are building a platform to cater to multiple clients. So, multiple Zeppelin 
instances, multiple spark clusters, multiple Spark UIs and on top of that 
maintaining the security and privacy in a shared multi-tenant env will need all 
the flexibility we can get!

Thanks
Ankit

> On Feb 1, 2018, at 7:51 PM, Jeff Zhang  wrote:
> 
> 
> Hi Jhon,
> 
> Do you mind to share what kind of custom function you want to add to spark 
> interpreter ? One idea in my mind is that we could add extension point to the 
> existing SparkInterpreter, and user can enhance SparkInterpreter via these 
> extension point. That means we just open some interfaces and users can 
> implement those interfaces, and just add their jars to spark interpreter 
> folder.
> 
> 
> 
> Jhon Anderson Cardenas Diaz 于2018年2月2日周五 上午5:30写道:
>> Hello!
>> 
>> I'm a software developer and as part of a project I require to extend the 
>> functionality of SparkInterpreter without modifying it. I need instead 
>> create a new interpreter that extends it or wrap its functionality.
>> 
>> I also need the spark sub-interpreters to use my new custom interpreter, but 
>> the problem comes here, because the spark sub-interpreters has a direct 
>> dependency to spark interpreter as they use the class name of spark 
>> interpreter to obtain its instance:
>> 
>> 
>> private SparkInterpreter getSparkInterpreter() {
>> ...
>> Interpreter p = 
>> getInterpreterInTheSameSessionByClassName(SparkInterpreter.class.getName());
>> }
>> 
>> 
>> Approach without modify apache zeppelin
>> 
>> My current approach to solve is to create a SparkCustomInterpreter that 
>> override the getClassName method as follows:
>> 
>> public class SparkCustomInterpreter extends SparkInterpreter {
>> ...
>> 
>> @Override
>> public String getClassName() {
>> return SparkInterpreter.class.getName();
>> }
>> }
>> 
>> and put the new class name in the interpreter-setting.json file of spark:
>> 
>> [
>>   {
>> "group": "spark",
>> "name": "spark",
>> "className": "org.apache.zeppelin.spark.SparkCustomInterpreter",
>> ...
>> "properties": {...}
>>   }, ...
>> ]
>> 
>> The problem with this approach is that when I run a paragraph it fails. In 
>> general it fails because zeppelin uses both the class name of the instance 
>> and the getClassName() method to access the instance, and that causes many 
>> problems.
>> 
>> Approaches modifying apache zeppelin
>> 
>> There are two possible solutions related with the way in which the 
>> sub-interpreters get the SparkInterpreter instance class, one is getting the 
>> class name from a property:
>> 
>> 
>> private SparkInterpreter getSparkInterpreter() {
>> ...
>> Interpreter p = 
>> getInterpreterInTheSameSessionByClassName(property.getProperty("zeppelin.spark.mainClass",
>>  SparkInterpreter.class.getName()) );
>> }
>> And the other possibility is to modify the method 
>> Interpreter.getInterpreterInTheSameSessionByClassName(String) in order to 
>> return the instance that whether has the same class name specified in the 
>> parameter or which super class has the same class name specified in the 
>> parameter:
>> 
>> 
>> @ZeppelinApi
>> public Interpreter getInterpreterInTheSameSessionByClassName(String 
>> className) {
>>   synchronized (interpreterGroup) {
>> for (List interpreters : interpreterGroup.values()) {
>>   
>>   for (Interpreter intp : interpreters) {
>> if (intp.getClassName().equals(className) || 
>> intp.getClass().getSuperclass().getName().equals(className)) {
>>   interpreterFound = intp;
>> }
>> 
>> ...
>>   }
>> 
>>   ...
>> }
>>   }
>>   return null;
>> }
>> 
>> Either of the two solutions would involve the modification of apache 
>> zeppelin code; do you think the change could be contributed to the 
>> community?, or maybe do you realize some other approach to change the way in 
>> which sub-interpreters of spark get the instance of spark interpreter?
>> 
>> Any information about it I'll be attempt.
>> 
>> Greetings
>> 
>> 
>> Jhon


Re: Implementation of NotebookRepo

2018-01-26 Thread ankit jain
We can put it like notebook-security jar is handled Jhon.

On Fri, Jan 26, 2018 at 2:21 PM, Jhon Anderson Cardenas Diaz <
jhonderson2...@gmail.com> wrote:

> Hi fellow Zeppelin users,
>
> I would like to create another implementation of
> org.apache.zeppelin.notebook.repo.NotebookRepo interface in order to
> persist the notebooks from zeppelin in S3 but in a versioned way (like a
> Git on S3).
>
> How do you recommend that i can add my jar file with the custom
> implementation to my zeppelin docker deployment?, is there maybe an
> zeppelin folder where i can put custom libraries or do i have to extend
> zeppelin class path?.
>
> Thanks & Regards,
> Jhon
>



-- 
Thanks & Regards,
Ankit.


Re: Custom Spark Interpreter?

2018-01-25 Thread ankit jain
Don't think that works, it just loads a blank page.

On Wed, Jan 24, 2018 at 11:06 PM, Jeff Zhang  wrote:

>
> But if you don't set it in interpreter setting, it would get spark ui url
> dynamically.
>
>
>
> ankit jain 于2018年1月25日周四 下午3:03写道:
>
>> That method is just reading it from a config defined in interpreter
>> settings called "uiWebUrl" which makes it configurable but still static.
>>
>> On Wed, Jan 24, 2018 at 10:58 PM, Jeff Zhang  wrote:
>>
>>>
>>> IIRC, spark interpreter can get web ui url at runtime instead of static
>>> url.
>>>
>>> https://github.com/apache/zeppelin/blob/master/spark/
>>> src/main/java/org/apache/zeppelin/spark/SparkInterpreter.java#L940
>>>
>>>
>>> ankit jain 于2018年1月25日周四 下午2:55写道:
>>>
>>>> Issue with Spark UI when running on AWS EMR is it requires ssh
>>>> tunneling to be setup which requires private aws keys.
>>>>
>>>> Our team is building a analytic platform on zeppelin for end-users who
>>>> we obviously can't hand out these keys.
>>>>
>>>> Another issue is setting up correct port - Zeppelin tries to use 4040
>>>> for spark but during an interpreter restart 4040 could be used by an old
>>>> still stuck paragraph. In that case Zeppelin simply tries the next port and
>>>> so on.
>>>>
>>>> Static url for Spark can't handle this and hence requires some dynamic
>>>> implementation.
>>>>
>>>> PS - As I write this a lightbulb goes on in my head. I guess we could
>>>> also modify Zeppelin restart script to kill those rogue processes and make
>>>> sure 4040 is always available?
>>>>
>>>> Thanks
>>>> Ankit
>>>>
>>>> On Wed, Jan 24, 2018 at 6:10 PM, Jeff Zhang  wrote:
>>>>
>>>>>
>>>>> If Spark interpreter didn't give you the correct spark UI, this should
>>>>> be a bug, you can file a ticket to fix it. Although you can make a custom
>>>>> interpreter by extending the current spark interpreter, it is not a 
>>>>> trivial
>>>>> work.
>>>>>
>>>>>
>>>>> ankit jain 于2018年1月25日周四 上午8:07写道:
>>>>>
>>>>>> Hi fellow Zeppelin users,
>>>>>> Has anyone tried to write a custom Spark Interpreter perhaps
>>>>>> extending from the one that ships currently with zeppelin -
>>>>>> spark/src/main/java/org/apache/zeppelin/spark/
>>>>>> *SparkInterpreter.java?*
>>>>>>
>>>>>> We are coming across cases where we need the interpreter to do
>>>>>> "more", eg change getSparkUIUrl() to directly load Yarn
>>>>>> ResourceManager/proxy/application_id123 rather than a fixed web ui.
>>>>>>
>>>>>> If we directly modify Zeppelin source code, upgrading to new zeppelin
>>>>>> versions will be a mess.
>>>>>>
>>>>>> Before we get too deep into it, wanted to get thoughts of the
>>>>>> community.
>>>>>>
>>>>>> What is a "clean" way to do such changes?
>>>>>>
>>>>>> --
>>>>>> Thanks & Regards,
>>>>>> Ankit.
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Thanks & Regards,
>>>> Ankit.
>>>>
>>>
>>
>>
>> --
>> Thanks & Regards,
>> Ankit.
>>
>


-- 
Thanks & Regards,
Ankit.


Re: Custom Spark Interpreter?

2018-01-24 Thread ankit jain
That method is just reading it from a config defined in interpreter
settings called "uiWebUrl" which makes it configurable but still static.

On Wed, Jan 24, 2018 at 10:58 PM, Jeff Zhang  wrote:

>
> IIRC, spark interpreter can get web ui url at runtime instead of static
> url.
>
> https://github.com/apache/zeppelin/blob/master/spark/
> src/main/java/org/apache/zeppelin/spark/SparkInterpreter.java#L940
>
>
> ankit jain 于2018年1月25日周四 下午2:55写道:
>
>> Issue with Spark UI when running on AWS EMR is it requires ssh tunneling
>> to be setup which requires private aws keys.
>>
>> Our team is building a analytic platform on zeppelin for end-users who we
>> obviously can't hand out these keys.
>>
>> Another issue is setting up correct port - Zeppelin tries to use 4040 for
>> spark but during an interpreter restart 4040 could be used by an old still
>> stuck paragraph. In that case Zeppelin simply tries the next port and so on.
>>
>> Static url for Spark can't handle this and hence requires some dynamic
>> implementation.
>>
>> PS - As I write this a lightbulb goes on in my head. I guess we could
>> also modify Zeppelin restart script to kill those rogue processes and make
>> sure 4040 is always available?
>>
>> Thanks
>> Ankit
>>
>> On Wed, Jan 24, 2018 at 6:10 PM, Jeff Zhang  wrote:
>>
>>>
>>> If Spark interpreter didn't give you the correct spark UI, this should
>>> be a bug, you can file a ticket to fix it. Although you can make a custom
>>> interpreter by extending the current spark interpreter, it is not a trivial
>>> work.
>>>
>>>
>>> ankit jain 于2018年1月25日周四 上午8:07写道:
>>>
>>>> Hi fellow Zeppelin users,
>>>> Has anyone tried to write a custom Spark Interpreter perhaps extending
>>>> from the one that ships currently with zeppelin -
>>>> spark/src/main/java/org/apache/zeppelin/spark/*SparkInterpreter.java?*
>>>>
>>>> We are coming across cases where we need the interpreter to do "more",
>>>> eg change getSparkUIUrl() to directly load Yarn 
>>>> ResourceManager/proxy/application_id123
>>>> rather than a fixed web ui.
>>>>
>>>> If we directly modify Zeppelin source code, upgrading to new zeppelin
>>>> versions will be a mess.
>>>>
>>>> Before we get too deep into it, wanted to get thoughts of the community.
>>>>
>>>> What is a "clean" way to do such changes?
>>>>
>>>> --
>>>> Thanks & Regards,
>>>> Ankit.
>>>>
>>>
>>
>>
>> --
>> Thanks & Regards,
>> Ankit.
>>
>


-- 
Thanks & Regards,
Ankit.


Re: Custom Spark Interpreter?

2018-01-24 Thread ankit jain
Issue with Spark UI when running on AWS EMR is it requires ssh tunneling to
be setup which requires private aws keys.

Our team is building a analytic platform on zeppelin for end-users who we
obviously can't hand out these keys.

Another issue is setting up correct port - Zeppelin tries to use 4040 for
spark but during an interpreter restart 4040 could be used by an old still
stuck paragraph. In that case Zeppelin simply tries the next port and so on.

Static url for Spark can't handle this and hence requires some dynamic
implementation.

PS - As I write this a lightbulb goes on in my head. I guess we could also
modify Zeppelin restart script to kill those rogue processes and make sure
4040 is always available?

Thanks
Ankit

On Wed, Jan 24, 2018 at 6:10 PM, Jeff Zhang  wrote:

>
> If Spark interpreter didn't give you the correct spark UI, this should be
> a bug, you can file a ticket to fix it. Although you can make a custom
> interpreter by extending the current spark interpreter, it is not a trivial
> work.
>
>
> ankit jain 于2018年1月25日周四 上午8:07写道:
>
>> Hi fellow Zeppelin users,
>> Has anyone tried to write a custom Spark Interpreter perhaps extending
>> from the one that ships currently with zeppelin -
>> spark/src/main/java/org/apache/zeppelin/spark/*SparkInterpreter.java?*
>>
>> We are coming across cases where we need the interpreter to do "more", eg
>> change getSparkUIUrl() to directly load Yarn 
>> ResourceManager/proxy/application_id123
>> rather than a fixed web ui.
>>
>> If we directly modify Zeppelin source code, upgrading to new zeppelin
>> versions will be a mess.
>>
>> Before we get too deep into it, wanted to get thoughts of the community.
>>
>> What is a "clean" way to do such changes?
>>
>> --
>> Thanks & Regards,
>> Ankit.
>>
>


-- 
Thanks & Regards,
Ankit.


Custom Spark Interpreter?

2018-01-24 Thread ankit jain
Hi fellow Zeppelin users,
Has anyone tried to write a custom Spark Interpreter perhaps extending from
the one that ships currently with zeppelin - spark/src/main/java/org/
apache/zeppelin/spark/*SparkInterpreter.java?*

We are coming across cases where we need the interpreter to do "more", eg
change getSparkUIUrl() to directly load Yarn
ResourceManager/proxy/application_id123 rather than a fixed web ui.

If we directly modify Zeppelin source code, upgrading to new zeppelin
versions will be a mess.

Before we get too deep into it, wanted to get thoughts of the community.

What is a "clean" way to do such changes?

-- 
Thanks & Regards,
Ankit.


Accessing Spark UI from Zeppelin

2017-12-14 Thread ankit jain
Hi Zeppelin users,

We are following https://issues.apache.org/jira/browse/ZEPPELIN-2949 to
launch spark ui.

Our Zeppelin instance is deployed on AWS EMR master node and setting
zeppelin.spark.uiWebUrl to a url which elb maps to https://masternode:4040.

When user clicks on spark url within Zeppelin it redirects him to Yarn RM(
something like http://masternode:20888/proxy/application_1511906080313_0023/)
which fails to load.

Usually to access EMR Web interfaces requires to setup a SSH tunnel and
change proxy settings in the browser -
http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-history.html

Is there a way we can avoid users having to setup ssh tunnel and allow
direct access to Spark UI?

Ideally, we will implement a filter which does Authentication on the user
and then redirect to Spark UI – right now not sure what the redirect URL
should be?


-- 
Thanks & Regards,
Anki