Re: [DISCUSS] Share Data in Zeppelin

Sanjay Dasgupta Thu, 12 Jul 2018 19:52:50 -0700

I prefer 2.b also. Could we use (save*Result*AsTable=people) instead?

There are a few typos in the example note shared:


1) The line val peopleDF = spark.read.format("zeppelin").load() should
mention the table name (possibly as argument to load?)
2) The python line val peopleDF = z.getTable("people").toPandas() should
not have the val


The z.getTable(<table-name>) method could be a very good tool to judge
which use-cases are important in the community. It is easy to implement for
the in-memory data case, and could be very useful for many situations where
a small amount of data is being transferred across interpreters (like the
jdbc -> matplotlib case mentioned).

Thanks,
Sanjay

On Fri, Jul 13, 2018 at 8:07 AM, Jongyoul Lee <[email protected]> wrote:

> Yes, it's similar to 2.b.
>
> Basically, my concern is to handle all kinds of data. But in your case, it
> looks like focusing on table data. It's also useful but it would be better
> to handle all of the data including table or plain text as well. WDYT?
>
> About storage, we could discuss it later.
>
> On Fri, Jul 13, 2018 at 11:25 AM, Jeff Zhang <[email protected]> wrote:
>
>>
>> I think your use case is the same of 2.b.  Personally I don't recommend
>> to use z.get(noteId, paragraphId) to get the shared data for 2 reasons
>> 1.  noteId, paragraphId is meaningless, which is not readable
>> 2. The note will break if we clone it as the noteId is changed.
>> That's why I suggest to use paragraph property to save paragraph's result
>>
>> Regarding the intermediate storage, I also though about it and agree that
>> in the long term we should provide such layer to support large data,
>> currently we put the shared data in memory which is not a scalable
>> solution.  One candidate in my mind is alluxio [1], and regarding the data
>> format I think apache arrow [2] is another good option for zeppelin to
>> share table data across interpreter processes and different languages. But
>> these are all implementation details, I think we can talk about them in
>> another thread. In this thread, I think we should focus on the user facing
>> api.
>>
>>
>> [1] http://www.alluxio.org/
>> [2] https://arrow.apache.org/
>>
>>
>>
>> Jongyoul Lee <[email protected]>于2018年7月13日周五 上午10:11写道：
>>
>>> I have a bit different idea to share data.
>>>
>>> In my case,
>>>
>>> It would be very useful to get a paragraph's result as an input of other
>>> paragraphs.
>>>
>>> e.g.
>>>
>>> -- Paragrph 1
>>> %jdbc
>>> select * from some_table;
>>>
>>> -- Paragraph 2
>>> %spark
>>> val rdd = z.get("noteId", "paragraphId").parse.makeRddByMyself
>>> spark.read(table).select....
>>>
>>> If paragraph 1's result is too big to show on FE, it would be saved in
>>> Zeppelin Server with proper way and pass to SparkInterpreter when Paragraph
>>> 2 is executed.
>>>
>>> Basically, I think we need to intermediate storage to store paragraph's
>>> results to share them. We can introduce another layer or extend
>>> NotebootRepo. In some cases, we might change notebook repos as well.
>>>
>>> JL
>>>
>>>
>>>
>>> On Fri, Jul 13, 2018 at 10:39 AM, Jeff Zhang <[email protected]> wrote:
>>>
>>>> Hi Folks,
>>>>
>>>> Recently, there's several tickets [1][2][3] about sharing data in
>>>> zeppelin.
>>>> Zeppelin's goal is to be an unified data analyst platform which could
>>>> integrate most of the big data tools and help user to switch between
>>>> tools
>>>> and share data between tools easily. So sharing data is a very critical
>>>> and
>>>> killer feature of Zeppelin IMHO.
>>>>
>>>> I raise this ticket to discuss about the scenario of sharing data and
>>>> how
>>>> to do that. Although zeppelin already provides tools and api to share
>>>> data,
>>>> I don't think it is mature and stable enough. After seeing these
>>>> tickets, I
>>>> think it might be a good time to talk about it in community and gather
>>>> more
>>>> feedback, so that we could provide a more stable and mature approach for
>>>> it.
>>>>
>>>> Currently, there're 3 approaches to share data between interpreters and
>>>> interpreter processes.
>>>> 1. Sharing data across interpreter in the same interpreter process. Like
>>>> sharing data via the same SparkContext in %spark, %spark.pyspark and
>>>> %spark.r.
>>>> 2. Sharing data between frontend and backend via angularObject
>>>> 3. Sharing data across interpreter processes via Zeppelin's ResourcePool
>>>>
>>>> For this thread, I would like to talk about the approach 3 (Sharing data
>>>> via Zeppelin's ResourcePool)
>>>>
>>>> Here's my current thinking of sharing data.
>>>> 1. What kind of data would be shared ?
>>>>    IMHO, users would share 2 kinds of data: primitive data (string,
>>>> number)
>>>> and table data.
>>>>
>>>> 2. How to write shared data ?
>>>>     User may want to share data via 2 approches
>>>>     a. Use ZeppelinContext (e.g. z.put).
>>>>     b. Share the paragraph result via paragraph properties. e.g. user
>>>> may
>>>> want to read data from oracle database via jdbc interpreter and then do
>>>> plotting in python interpreter. In such scenario. he can save the jdbc
>>>> result in ResourcePool via paragraph property and then read it it via
>>>> z.get. Here's one simple example (Not implemented yet)
>>>>
>>>>         %jdbc(saveAsTable=people)
>>>>          select * from oracle_table
>>>>
>>>>          %python
>>>>          z.getTable("people).toPandas()
>>>>
>>>> 3. How to read shared data ?
>>>>     User can also have 2 approaches to read the shared data.
>>>>     a. Via ZeppelinContext. (e.g.  z.get, z.getTable)
>>>>     b. Via variable substitution [1]
>>>>
>>>> Here's one sample note which illustrate the scenario of sharing data.
>>>> https://www.zepl.com/viewer/notebooks/bm90ZTovL3pqZmZkdS8zMz
>>>> kxZjg3YmFhMjg0MDY3OGM1ZmYzODAwODAxMGJhNy9ub3RlLmpzb24
>>>>
>>>> This is just my current thinking of sharing data in zeppelin, it
>>>> definitely
>>>> doesn't cover all the scenarios, so I raise this thread to discuss
>>>> about in
>>>> community, welcome any feedback and comments.
>>>>
>>>>
>>>> [1]. https://issues.apache.org/jira/browse/ZEPPELIN-3377
>>>> [2]. https://issues.apache.org/jira/browse/ZEPPELIN-3596
>>>> [3]. https://issues.apache.org/jira/browse/ZEPPELIN-3617
>>>>
>>>
>>>
>>>
>>> --
>>> 이종열, Jongyoul Lee, 李宗烈
>>> http://madeng.net
>>>
>>
>
>
> --
> 이종열, Jongyoul Lee, 李宗烈
> http://madeng.net
>

Re: [DISCUSS] Share Data in Zeppelin

Reply via email to