Re: [DISCUSSION] Extending TableData API

2017-06-14 Thread Jeff Zhang
>>> But not sure about how other interpreters can do the same thing. (e.g
trivial, but let’s think about shell interpreter which keeps it’s tabledata
on memory)

The approach I proposed is general for all the interpreters. What we need
do is to add one method in RemoteInterpreterProcess for other
interpreters to fetch resources.

>>> Some people might wonder why we do not use external storages to persist
(large) table resources instead of keeping them in memory of ZeppelinServer.

It is fine to use memory for now. But we should leave an interface there
for other storages. For now we could just have MemoryStorage, could have
other implementations in future.


Park Hoon <1am...@gmail.com>于2017年6月14日周三 下午10:22写道:

> @Jeff, Thanks for sharing your opinions and important questions.
>
>
> > Q1. What does the resource registration mean? IIUC, currently it means
> it would cache the data in Interpreter Process. Then it might be a memory
> issue when more and more resources are registered. Maybe we could introduce
> resource retention mechanism or cache the data in other formats (just like
> the spark table cache policy, user can specify how to cache the data, like
> memory, disk and etc.
>
> A1. It depends on an implementation of TableData for each interpreter.
> For example,
>
> If JDBC interpreter only keeps the SQL in a paragraph to reproduce the
> table, we don’t need to persist the whole table data in memory or file
> system or an external storage. That’s what the section 3.2 describes.
>
> [image: Inline image 2]
>
>
>
>
> > Q2. The scope of resource sharing. For now, it seems it is globally
> shared. But I think user level sharing might be more common. Then we need
> to create a namespace for each user. That means the same resource name
> could exist in different user namespace.
>
> A2. Regarding the namespace concept, the proposal only describes what the
> table resource name should be? (Section 5.3) not about namespaces.
>
> The namespace can be the name of a note or custom (e.g creating users’
> namespace). We can discuss this.
>
> Personally, +1 for having namespace because it is helpful for searching
> and sharing. This might be included by `ResourceRegistry`
>
>
> [image: Inline image 1]
>
>
> > Q3. The data route might cause performance issue.  From the diagram, If
> spark interpreter needs to access a resource from jdbc interpreter. Then
> first data needs to be send to zeppelin server, and then zeppelin server
> send the data to spark interpreter. This kind of data route introduce a bit
> more overhead to me. And zeppelin server will become a bottleneck and
> require large memory when there're many resources to be shared across
> users/interpreters. So I would suggest the following approach. Zeppelin
> Server just control the metadata and ACL of resources. And Spark
> Interpreter will fetch data from Jdbc Interpreter directly instead of
> through zeppelin server.  Here's the sequences
>1). SparkInterpreter ask for metadata and token for the resource
>2). Zeppelin Server will check whether this SparkInterprter has
> permission to access this resource, if yes, then send the metadata and
> token to SparkInterpreter. The metadata includes the RPC address of the
> JdbcInterpreter and token is for security.
>3). SparkInterpreter ask JdbcInterpreter for the resource via the
> the token and metadata received in step 2
>4). JdbcInterpreter verify the token, and send the data to
> SparkInterpreter.
>
> A3. +1 direct accessing in spark interpreter to JDBC since it’s better for
> large data handling. But not sure about how other interpreters can do the
> same thing. (e.g trivial, but let’s think about shell interpreter which
> keeps it’s tabledata on memory)
>
>
> --
>
> Some people might wonder why we do not use external storages to persist
> (large) table resources instead of keeping them in memory of ZeppelinServer.
>
> The authors originally discussed whether having an external storage or
> not. But having external storage
>
> - requires additional (lots of) dependencies. (Geode? Redis? HDFS? Which
> one should we use? or support all?)
> - even with external storage, we might not persist 400GB, 10TB.
>
> Thus, the proposal was written to
>
> - utilize interpreter’s own storage (e.g spark cluster for spark
> interpreter)
> - keep the minimal things to reproduce the table result (e.g keeping the
> only query) while don’t affect on external storage as well at first.
>
>
> And now we are discussing, hope we can improve the proposal and turn it
> into a reall implementation soon. :)
>
>
>
> Thanks.
>
>
>
>
> On Wed, Jun 14, 2017 at 12:20 PM, Jeff Zhang  wrote:
>
>>
>> Hi Park,
>>
>> Thanks for the sharing, this is a very interested and innovated idea. I
>> have several comments and concerns.
>>
>> 1. What does the resource registration mean ?
>>IIUC, currently it means it would cache the data in Interpreter
>> Process. Then it 

Re: [DISCUSSION] Extending TableData API

2017-06-12 Thread Khalid Huseynov
Thanks for the questions guys!

@Jun Kim actually that feature was originally discussed and was put into
backlog since proposal was more about tables processed by interpreters and
their sharing. However having quick visualisation on the fly for not so
large data makes sense indeed, and possibly could be done by importing data
into some interpreter by default (Spark, python, etc). So I believe it can
be done once initial basics for resource sharing is completed.

@Andrea Santurbano there should be listing of tables with schema info, but
i'm not sure exactly what you mean by drop-down feature between tables in
the UI. Could you give little more details/example on that as well as
 enhancements on graph part you meant?


On Mon, Jun 12, 2017 at 4:01 PM, Andrea Santurbano 
wrote:

> Hi guys,
> this is great! I think this can also enable some drop-down feature between
> tables in the UI...
> Do you think this enhancements can also include the graph part?
>
> Andrea
>
> Il giorno lun 12 giu 2017 alle ore 05:47 Jun Kim  ha
> scritto:
>
>> All of the enhancements looks great to me!
>>
>> And I wish a feature which can upload a small CSV file (maybe about
>> 20MB..?) and play with it directly.
>> It would be great if I can drag a file to Zeppelin and register it as the
>> table.
>>
>> Thanks :)
>>
>> 2017년 6월 12일 (월) 오전 11:40, Park Hoon <1am...@gmail.com>님이 작성:
>>
>>> Hi All,
>>>
>>> Recently, ZEPPELIN-753
>>>  (Tabledata
>>> abstraction) and ZEPPELIN-2020
>>>  (Remote method
>>> invocation for resources) were resolved.
>>> Based on this work, we can improve Zeppelin with the following
>>> enhancements:
>>>
>>> * register the table result as a shared resource
>>> * list all available (registered) tables
>>> * preview tables including its meta information (e.g columns, types, ..)
>>> * download registered tables as CSV, and other formats.
>>> * pivoting/filtering in backend to transforming larger data
>>> * cross join tables in different interpreters (e.g Spark interpreter
>>> uses a table result generated from JDBC interpreter)
>>>
>>> You can find the full proposal in Extending Table Data API
>>> 
>>>  which
>>> is contributed by @1ambda, @khalidhuseynov, @Leemoonsoo.
>>>
>>> Any question, feedback or discussion will be welcomed.
>>>
>>>
>>> Thanks.
>>>
>> --
>> Taejun Kim
>>
>> Data Mining Lab.
>> School of Electrical and Computer Engineering
>> University of Seoul
>>
>


Re: [DISCUSSION] Extending TableData API

2017-06-12 Thread Andrea Santurbano
Hi guys,
this is great! I think this can also enable some drop-down feature between
tables in the UI...
Do you think this enhancements can also include the graph part?

Andrea

Il giorno lun 12 giu 2017 alle ore 05:47 Jun Kim  ha
scritto:

> All of the enhancements looks great to me!
>
> And I wish a feature which can upload a small CSV file (maybe about
> 20MB..?) and play with it directly.
> It would be great if I can drag a file to Zeppelin and register it as the
> table.
>
> Thanks :)
>
> 2017년 6월 12일 (월) 오전 11:40, Park Hoon <1am...@gmail.com>님이 작성:
>
>> Hi All,
>>
>> Recently, ZEPPELIN-753
>>  (Tabledata
>> abstraction) and ZEPPELIN-2020
>>  (Remote method
>> invocation for resources) were resolved.
>> Based on this work, we can improve Zeppelin with the following
>> enhancements:
>>
>> * register the table result as a shared resource
>> * list all available (registered) tables
>> * preview tables including its meta information (e.g columns, types, ..)
>> * download registered tables as CSV, and other formats.
>> * pivoting/filtering in backend to transforming larger data
>> * cross join tables in different interpreters (e.g Spark interpreter uses
>> a table result generated from JDBC interpreter)
>>
>> You can find the full proposal in Extending Table Data API
>> 
>>  which
>> is contributed by @1ambda, @khalidhuseynov, @Leemoonsoo.
>>
>> Any question, feedback or discussion will be welcomed.
>>
>>
>> Thanks.
>>
> --
> Taejun Kim
>
> Data Mining Lab.
> School of Electrical and Computer Engineering
> University of Seoul
>


Re: [DISCUSSION] Extending TableData API

2017-06-11 Thread Jun Kim
All of the enhancements looks great to me!

And I wish a feature which can upload a small CSV file (maybe about
20MB..?) and play with it directly.
It would be great if I can drag a file to Zeppelin and register it as the
table.

Thanks :)

2017년 6월 12일 (월) 오전 11:40, Park Hoon <1am...@gmail.com>님이 작성:

> Hi All,
>
> Recently, ZEPPELIN-753
>  (Tabledata
> abstraction) and ZEPPELIN-2020
>  (Remote method
> invocation for resources) were resolved.
> Based on this work, we can improve Zeppelin with the following
> enhancements:
>
> * register the table result as a shared resource
> * list all available (registered) tables
> * preview tables including its meta information (e.g columns, types, ..)
> * download registered tables as CSV, and other formats.
> * pivoting/filtering in backend to transforming larger data
> * cross join tables in different interpreters (e.g Spark interpreter uses
> a table result generated from JDBC interpreter)
>
> You can find the full proposal in Extending Table Data API
> 
>  which
> is contributed by @1ambda, @khalidhuseynov, @Leemoonsoo.
>
> Any question, feedback or discussion will be welcomed.
>
>
> Thanks.
>
-- 
Taejun Kim

Data Mining Lab.
School of Electrical and Computer Engineering
University of Seoul


[DISCUSSION] Extending TableData API

2017-06-11 Thread Park Hoon
Hi All,

Recently, ZEPPELIN-753 
(Tabledata abstraction) and ZEPPELIN-2020
 (Remote method
invocation for resources) were resolved.
Based on this work, we can improve Zeppelin with the following enhancements:

* register the table result as a shared resource
* list all available (registered) tables
* preview tables including its meta information (e.g columns, types, ..)
* download registered tables as CSV, and other formats.
* pivoting/filtering in backend to transforming larger data
* cross join tables in different interpreters (e.g Spark interpreter uses a
table result generated from JDBC interpreter)

You can find the full proposal in Extending Table Data API

which
is contributed by @1ambda, @khalidhuseynov, @Leemoonsoo.

Any question, feedback or discussion will be welcomed.


Thanks.