Re: [DISCUSSION] Extending TableData API

Jeff Zhang Tue, 13 Jun 2017 20:21:07 -0700

Hi Park,

Thanks for the sharing, this is a very interested and innovated idea. I
have several comments and concerns.


1. What does the resource registration mean ?
   IIUC, currently it means it would cache the data in Interpreter Process.
Then it might be a memory issue when more and more resources are
registered. Maybe we could introduce resource retention mechanism or cache
the data in other formats (just like the spark table cache policy, user can
specify how to cache the data, like memory, disk and etc.)

2. The scope of resource sharing
   For now, it seems it is globally shared. But I think user level sharing
might be more common. Then we need to create a namespace for each user.
That means the same resource name could exist in different user namespace.

3. The data route might cause performance issue.
   From the diagram, If spark interpreter needs to access a resource from
jdbc interpreter. Then first data needs to be send to zeppelin server, and
then zeppelin server send the data to spark interpreter. This kind of data
route introduce a bit more overhead to me. And zeppelin server will become
a bottleneck and require large memory when there're many resources to be
shared across users/interpreters. So I would suggest the following
approach. Zeppelin Server just control the metadata and ACL of resources.
And Spark Interpreter will fetch data from Jdbc Interpreter directly
instead of through zeppelin server.  Here's the sequences
       1). SparkInterpreter ask for metadata and token for the resource
       2). Zeppelin Server will check whether this SparkInterprter has
permission to access this resource, if yes, then send the metadata and
token to SparkInterpreter. The metadata includes the RPC address of the
JdbcInterpreter and token is for security.
       3). SparkInterpreter ask JdbcInterpreter for the resource via the
the token and metadata received in step 2
       4). JdbcInterpreter verify the token, and send the data to
SparkInterpreter.
[image: image.png]


Khalid Huseynov <kha...@apache.org>于2017年6月13日周二 上午11:53写道：

> Thanks for the questions guys!
>
> @Jun Kim actually that feature was originally discussed and was put into
> backlog since proposal was more about tables processed by interpreters and
> their sharing. However having quick visualisation on the fly for not so
> large data makes sense indeed, and possibly could be done by importing data
> into some interpreter by default (Spark, python, etc). So I believe it can
> be done once initial basics for resource sharing is completed.
>
> @Andrea Santurbano there should be listing of tables with schema info,
> but i'm not sure exactly what you mean by drop-down feature between
> tables in the UI. Could you give little more details/example on that as
> well as  enhancements on graph part you meant?
>
>
> On Mon, Jun 12, 2017 at 4:01 PM, Andrea Santurbano <sant...@gmail.com>
> wrote:
>
>> Hi guys,
>> this is great! I think this can also enable some drop-down feature
>> between tables in the UI...
>> Do you think this enhancements can also include the graph part?
>>
>> Andrea
>>
>> Il giorno lun 12 giu 2017 alle ore 05:47 Jun Kim <i2r....@gmail.com> ha
>> scritto:
>>
>>> All of the enhancements looks great to me!
>>>
>>> And I wish a feature which can upload a small CSV file (maybe about
>>> 20MB..?) and play with it directly.
>>> It would be great if I can drag a file to Zeppelin and register it as
>>> the table.
>>>
>>> Thanks :)
>>>
>>> 2017년 6월 12일 (월) 오전 11:40, Park Hoon <1am...@gmail.com>님이 작성:
>>>
>>>> Hi All,
>>>>
>>>> Recently, ZEPPELIN-753
>>>> <https://issues.apache.org/jira/browse/ZEPPELIN-753> (Tabledata
>>>> abstraction) and ZEPPELIN-2020
>>>> <https://issues.apache.org/jira/browse/ZEPPELIN-2020> (Remote method
>>>> invocation for resources) were resolved.
>>>> Based on this work, we can improve Zeppelin with the following
>>>> enhancements:
>>>>
>>>> * register the table result as a shared resource
>>>> * list all available (registered) tables
>>>> * preview tables including its meta information (e.g columns, types, ..)
>>>> * download registered tables as CSV, and other formats.
>>>> * pivoting/filtering in backend to transforming larger data
>>>> * cross join tables in different interpreters (e.g Spark interpreter
>>>> uses a table result generated from JDBC interpreter)
>>>>
>>>> You can find the full proposal in Extending Table Data API
>>>> <https://cwiki.apache.org/confluence/display/ZEPPELIN/Proposal%3A+Extending+TableData+API>
>>>>  which
>>>> is contributed by @1ambda, @khalidhuseynov, @Leemoonsoo.
>>>>
>>>> Any question, feedback or discussion will be welcomed.
>>>>
>>>>
>>>> Thanks.
>>>>
>>> --
>>> Taejun Kim
>>>
>>> Data Mining Lab.
>>> School of Electrical and Computer Engineering
>>> University of Seoul
>>>
>>
>

Re: [DISCUSSION] Extending TableData API

Reply via email to