[DISCUSSION] Extending TableData API

2017-06-11 Thread Park Hoon
Hi All,

Recently, ZEPPELIN-753 
(Tabledata abstraction) and ZEPPELIN-2020
 (Remote method
invocation for resources) were resolved.
Based on this work, we can improve Zeppelin with the following enhancements:

* register the table result as a shared resource
* list all available (registered) tables
* preview tables including its meta information (e.g columns, types, ..)
* download registered tables as CSV, and other formats.
* pivoting/filtering in backend to transforming larger data
* cross join tables in different interpreters (e.g Spark interpreter uses a
table result generated from JDBC interpreter)

You can find the full proposal in Extending Table Data API

which
is contributed by @1ambda, @khalidhuseynov, @Leemoonsoo.

Any question, feedback or discussion will be welcomed.


Thanks.


Re: [DISCUSSION] Extending TableData API

2017-06-11 Thread Jun Kim
All of the enhancements looks great to me!

And I wish a feature which can upload a small CSV file (maybe about
20MB..?) and play with it directly.
It would be great if I can drag a file to Zeppelin and register it as the
table.

Thanks :)

2017년 6월 12일 (월) 오전 11:40, Park Hoon <1am...@gmail.com>님이 작성:

> Hi All,
>
> Recently, ZEPPELIN-753
>  (Tabledata
> abstraction) and ZEPPELIN-2020
>  (Remote method
> invocation for resources) were resolved.
> Based on this work, we can improve Zeppelin with the following
> enhancements:
>
> * register the table result as a shared resource
> * list all available (registered) tables
> * preview tables including its meta information (e.g columns, types, ..)
> * download registered tables as CSV, and other formats.
> * pivoting/filtering in backend to transforming larger data
> * cross join tables in different interpreters (e.g Spark interpreter uses
> a table result generated from JDBC interpreter)
>
> You can find the full proposal in Extending Table Data API
> 
>  which
> is contributed by @1ambda, @khalidhuseynov, @Leemoonsoo.
>
> Any question, feedback or discussion will be welcomed.
>
>
> Thanks.
>
-- 
Taejun Kim

Data Mining Lab.
School of Electrical and Computer Engineering
University of Seoul


Re: [DISCUSSION] Extending TableData API

2017-06-12 Thread Andrea Santurbano
Hi guys,
this is great! I think this can also enable some drop-down feature between
tables in the UI...
Do you think this enhancements can also include the graph part?

Andrea

Il giorno lun 12 giu 2017 alle ore 05:47 Jun Kim  ha
scritto:

> All of the enhancements looks great to me!
>
> And I wish a feature which can upload a small CSV file (maybe about
> 20MB..?) and play with it directly.
> It would be great if I can drag a file to Zeppelin and register it as the
> table.
>
> Thanks :)
>
> 2017년 6월 12일 (월) 오전 11:40, Park Hoon <1am...@gmail.com>님이 작성:
>
>> Hi All,
>>
>> Recently, ZEPPELIN-753
>>  (Tabledata
>> abstraction) and ZEPPELIN-2020
>>  (Remote method
>> invocation for resources) were resolved.
>> Based on this work, we can improve Zeppelin with the following
>> enhancements:
>>
>> * register the table result as a shared resource
>> * list all available (registered) tables
>> * preview tables including its meta information (e.g columns, types, ..)
>> * download registered tables as CSV, and other formats.
>> * pivoting/filtering in backend to transforming larger data
>> * cross join tables in different interpreters (e.g Spark interpreter uses
>> a table result generated from JDBC interpreter)
>>
>> You can find the full proposal in Extending Table Data API
>> 
>>  which
>> is contributed by @1ambda, @khalidhuseynov, @Leemoonsoo.
>>
>> Any question, feedback or discussion will be welcomed.
>>
>>
>> Thanks.
>>
> --
> Taejun Kim
>
> Data Mining Lab.
> School of Electrical and Computer Engineering
> University of Seoul
>


Re: [DISCUSSION] Extending TableData API

2017-06-12 Thread Khalid Huseynov
Thanks for the questions guys!

@Jun Kim actually that feature was originally discussed and was put into
backlog since proposal was more about tables processed by interpreters and
their sharing. However having quick visualisation on the fly for not so
large data makes sense indeed, and possibly could be done by importing data
into some interpreter by default (Spark, python, etc). So I believe it can
be done once initial basics for resource sharing is completed.

@Andrea Santurbano there should be listing of tables with schema info, but
i'm not sure exactly what you mean by drop-down feature between tables in
the UI. Could you give little more details/example on that as well as
 enhancements on graph part you meant?


On Mon, Jun 12, 2017 at 4:01 PM, Andrea Santurbano 
wrote:

> Hi guys,
> this is great! I think this can also enable some drop-down feature between
> tables in the UI...
> Do you think this enhancements can also include the graph part?
>
> Andrea
>
> Il giorno lun 12 giu 2017 alle ore 05:47 Jun Kim  ha
> scritto:
>
>> All of the enhancements looks great to me!
>>
>> And I wish a feature which can upload a small CSV file (maybe about
>> 20MB..?) and play with it directly.
>> It would be great if I can drag a file to Zeppelin and register it as the
>> table.
>>
>> Thanks :)
>>
>> 2017년 6월 12일 (월) 오전 11:40, Park Hoon <1am...@gmail.com>님이 작성:
>>
>>> Hi All,
>>>
>>> Recently, ZEPPELIN-753
>>>  (Tabledata
>>> abstraction) and ZEPPELIN-2020
>>>  (Remote method
>>> invocation for resources) were resolved.
>>> Based on this work, we can improve Zeppelin with the following
>>> enhancements:
>>>
>>> * register the table result as a shared resource
>>> * list all available (registered) tables
>>> * preview tables including its meta information (e.g columns, types, ..)
>>> * download registered tables as CSV, and other formats.
>>> * pivoting/filtering in backend to transforming larger data
>>> * cross join tables in different interpreters (e.g Spark interpreter
>>> uses a table result generated from JDBC interpreter)
>>>
>>> You can find the full proposal in Extending Table Data API
>>> 
>>>  which
>>> is contributed by @1ambda, @khalidhuseynov, @Leemoonsoo.
>>>
>>> Any question, feedback or discussion will be welcomed.
>>>
>>>
>>> Thanks.
>>>
>> --
>> Taejun Kim
>>
>> Data Mining Lab.
>> School of Electrical and Computer Engineering
>> University of Seoul
>>
>


Re: [DISCUSSION] Extending TableData API

2017-06-13 Thread Jeff Zhang
Hi Park,

Thanks for the sharing, this is a very interested and innovated idea. I
have several comments and concerns.

1. What does the resource registration mean ?
   IIUC, currently it means it would cache the data in Interpreter Process.
Then it might be a memory issue when more and more resources are
registered. Maybe we could introduce resource retention mechanism or cache
the data in other formats (just like the spark table cache policy, user can
specify how to cache the data, like memory, disk and etc.)

2. The scope of resource sharing
   For now, it seems it is globally shared. But I think user level sharing
might be more common. Then we need to create a namespace for each user.
That means the same resource name could exist in different user namespace.

3. The data route might cause performance issue.
   From the diagram, If spark interpreter needs to access a resource from
jdbc interpreter. Then first data needs to be send to zeppelin server, and
then zeppelin server send the data to spark interpreter. This kind of data
route introduce a bit more overhead to me. And zeppelin server will become
a bottleneck and require large memory when there're many resources to be
shared across users/interpreters. So I would suggest the following
approach. Zeppelin Server just control the metadata and ACL of resources.
And Spark Interpreter will fetch data from Jdbc Interpreter directly
instead of through zeppelin server.  Here's the sequences
   1). SparkInterpreter ask for metadata and token for the resource
   2). Zeppelin Server will check whether this SparkInterprter has
permission to access this resource, if yes, then send the metadata and
token to SparkInterpreter. The metadata includes the RPC address of the
JdbcInterpreter and token is for security.
   3). SparkInterpreter ask JdbcInterpreter for the resource via the
the token and metadata received in step 2
   4). JdbcInterpreter verify the token, and send the data to
SparkInterpreter.
[image: image.png]


Khalid Huseynov 于2017年6月13日周二 上午11:53写道:

> Thanks for the questions guys!
>
> @Jun Kim actually that feature was originally discussed and was put into
> backlog since proposal was more about tables processed by interpreters and
> their sharing. However having quick visualisation on the fly for not so
> large data makes sense indeed, and possibly could be done by importing data
> into some interpreter by default (Spark, python, etc). So I believe it can
> be done once initial basics for resource sharing is completed.
>
> @Andrea Santurbano there should be listing of tables with schema info,
> but i'm not sure exactly what you mean by drop-down feature between
> tables in the UI. Could you give little more details/example on that as
> well as  enhancements on graph part you meant?
>
>
> On Mon, Jun 12, 2017 at 4:01 PM, Andrea Santurbano 
> wrote:
>
>> Hi guys,
>> this is great! I think this can also enable some drop-down feature
>> between tables in the UI...
>> Do you think this enhancements can also include the graph part?
>>
>> Andrea
>>
>> Il giorno lun 12 giu 2017 alle ore 05:47 Jun Kim  ha
>> scritto:
>>
>>> All of the enhancements looks great to me!
>>>
>>> And I wish a feature which can upload a small CSV file (maybe about
>>> 20MB..?) and play with it directly.
>>> It would be great if I can drag a file to Zeppelin and register it as
>>> the table.
>>>
>>> Thanks :)
>>>
>>> 2017년 6월 12일 (월) 오전 11:40, Park Hoon <1am...@gmail.com>님이 작성:
>>>
 Hi All,

 Recently, ZEPPELIN-753
  (Tabledata
 abstraction) and ZEPPELIN-2020
  (Remote method
 invocation for resources) were resolved.
 Based on this work, we can improve Zeppelin with the following
 enhancements:

 * register the table result as a shared resource
 * list all available (registered) tables
 * preview tables including its meta information (e.g columns, types, ..)
 * download registered tables as CSV, and other formats.
 * pivoting/filtering in backend to transforming larger data
 * cross join tables in different interpreters (e.g Spark interpreter
 uses a table result generated from JDBC interpreter)

 You can find the full proposal in Extending Table Data API
 
  which
 is contributed by @1ambda, @khalidhuseynov, @Leemoonsoo.

 Any question, feedback or discussion will be welcomed.


 Thanks.

>>> --
>>> Taejun Kim
>>>
>>> Data Mining Lab.
>>> School of Electrical and Computer Engineering
>>> University of Seoul
>>>
>>
>


Re: [DISCUSSION] Extending TableData API

2017-06-14 Thread Park Hoon
 @Jeff, Thanks for sharing your opinions and important questions.


> Q1. What does the resource registration mean? IIUC, currently it means it
would cache the data in Interpreter Process. Then it might be a memory
issue when more and more resources are registered. Maybe we could introduce
resource retention mechanism or cache the data in other formats (just like
the spark table cache policy, user can specify how to cache the data, like
memory, disk and etc.

A1. It depends on an implementation of TableData for each interpreter. For
example,

If JDBC interpreter only keeps the SQL in a paragraph to reproduce the
table, we don’t need to persist the whole table data in memory or file
system or an external storage. That’s what the section 3.2 describes.

[image: Inline image 2]




> Q2. The scope of resource sharing. For now, it seems it is globally
shared. But I think user level sharing might be more common. Then we need
to create a namespace for each user. That means the same resource name
could exist in different user namespace.

A2. Regarding the namespace concept, the proposal only describes what the
table resource name should be? (Section 5.3) not about namespaces.

The namespace can be the name of a note or custom (e.g creating users’
namespace). We can discuss this.

Personally, +1 for having namespace because it is helpful for searching and
sharing. This might be included by `ResourceRegistry`


[image: Inline image 1]


> Q3. The data route might cause performance issue.  From the diagram, If
spark interpreter needs to access a resource from jdbc interpreter. Then
first data needs to be send to zeppelin server, and then zeppelin server
send the data to spark interpreter. This kind of data route introduce a bit
more overhead to me. And zeppelin server will become a bottleneck and
require large memory when there're many resources to be shared across
users/interpreters. So I would suggest the following approach. Zeppelin
Server just control the metadata and ACL of resources. And Spark
Interpreter will fetch data from Jdbc Interpreter directly instead of
through zeppelin server.  Here's the sequences
   1). SparkInterpreter ask for metadata and token for the resource
   2). Zeppelin Server will check whether this SparkInterprter has
permission to access this resource, if yes, then send the metadata and
token to SparkInterpreter. The metadata includes the RPC address of the
JdbcInterpreter and token is for security.
   3). SparkInterpreter ask JdbcInterpreter for the resource via the
the token and metadata received in step 2
   4). JdbcInterpreter verify the token, and send the data to
SparkInterpreter.

A3. +1 direct accessing in spark interpreter to JDBC since it’s better for
large data handling. But not sure about how other interpreters can do the
same thing. (e.g trivial, but let’s think about shell interpreter which
keeps it’s tabledata on memory)


--

Some people might wonder why we do not use external storages to persist
(large) table resources instead of keeping them in memory of ZeppelinServer.

The authors originally discussed whether having an external storage or not.
But having external storage

- requires additional (lots of) dependencies. (Geode? Redis? HDFS? Which
one should we use? or support all?)
- even with external storage, we might not persist 400GB, 10TB.

Thus, the proposal was written to

- utilize interpreter’s own storage (e.g spark cluster for spark
interpreter)
- keep the minimal things to reproduce the table result (e.g keeping the
only query) while don’t affect on external storage as well at first.


And now we are discussing, hope we can improve the proposal and turn it
into a reall implementation soon. :)



Thanks.




On Wed, Jun 14, 2017 at 12:20 PM, Jeff Zhang  wrote:

>
> Hi Park,
>
> Thanks for the sharing, this is a very interested and innovated idea. I
> have several comments and concerns.
>
> 1. What does the resource registration mean ?
>IIUC, currently it means it would cache the data in Interpreter
> Process. Then it might be a memory issue when more and more resources are
> registered. Maybe we could introduce resource retention mechanism or cache
> the data in other formats (just like the spark table cache policy, user can
> specify how to cache the data, like memory, disk and etc.)
>
> 2. The scope of resource sharing
>For now, it seems it is globally shared. But I think user level sharing
> might be more common. Then we need to create a namespace for each user.
> That means the same resource name could exist in different user namespace.
>
> 3. The data route might cause performance issue.
>From the diagram, If spark interpreter needs to access a resource from
> jdbc interpreter. Then first data needs to be send to zeppelin server, and
> then zeppelin server send the data to spark interpreter. This kind of data
> route introduce a bit more overhead to me. And zeppelin server will become
> a bottleneck

Re: [DISCUSSION] Extending TableData API

2017-06-14 Thread Jeff Zhang
>>> But not sure about how other interpreters can do the same thing. (e.g
trivial, but let’s think about shell interpreter which keeps it’s tabledata
on memory)

The approach I proposed is general for all the interpreters. What we need
do is to add one method in RemoteInterpreterProcess for other
interpreters to fetch resources.

>>> Some people might wonder why we do not use external storages to persist
(large) table resources instead of keeping them in memory of ZeppelinServer.

It is fine to use memory for now. But we should leave an interface there
for other storages. For now we could just have MemoryStorage, could have
other implementations in future.


Park Hoon <1am...@gmail.com>于2017年6月14日周三 下午10:22写道:

> @Jeff, Thanks for sharing your opinions and important questions.
>
>
> > Q1. What does the resource registration mean? IIUC, currently it means
> it would cache the data in Interpreter Process. Then it might be a memory
> issue when more and more resources are registered. Maybe we could introduce
> resource retention mechanism or cache the data in other formats (just like
> the spark table cache policy, user can specify how to cache the data, like
> memory, disk and etc.
>
> A1. It depends on an implementation of TableData for each interpreter.
> For example,
>
> If JDBC interpreter only keeps the SQL in a paragraph to reproduce the
> table, we don’t need to persist the whole table data in memory or file
> system or an external storage. That’s what the section 3.2 describes.
>
> [image: Inline image 2]
>
>
>
>
> > Q2. The scope of resource sharing. For now, it seems it is globally
> shared. But I think user level sharing might be more common. Then we need
> to create a namespace for each user. That means the same resource name
> could exist in different user namespace.
>
> A2. Regarding the namespace concept, the proposal only describes what the
> table resource name should be? (Section 5.3) not about namespaces.
>
> The namespace can be the name of a note or custom (e.g creating users’
> namespace). We can discuss this.
>
> Personally, +1 for having namespace because it is helpful for searching
> and sharing. This might be included by `ResourceRegistry`
>
>
> [image: Inline image 1]
>
>
> > Q3. The data route might cause performance issue.  From the diagram, If
> spark interpreter needs to access a resource from jdbc interpreter. Then
> first data needs to be send to zeppelin server, and then zeppelin server
> send the data to spark interpreter. This kind of data route introduce a bit
> more overhead to me. And zeppelin server will become a bottleneck and
> require large memory when there're many resources to be shared across
> users/interpreters. So I would suggest the following approach. Zeppelin
> Server just control the metadata and ACL of resources. And Spark
> Interpreter will fetch data from Jdbc Interpreter directly instead of
> through zeppelin server.  Here's the sequences
>1). SparkInterpreter ask for metadata and token for the resource
>2). Zeppelin Server will check whether this SparkInterprter has
> permission to access this resource, if yes, then send the metadata and
> token to SparkInterpreter. The metadata includes the RPC address of the
> JdbcInterpreter and token is for security.
>3). SparkInterpreter ask JdbcInterpreter for the resource via the
> the token and metadata received in step 2
>4). JdbcInterpreter verify the token, and send the data to
> SparkInterpreter.
>
> A3. +1 direct accessing in spark interpreter to JDBC since it’s better for
> large data handling. But not sure about how other interpreters can do the
> same thing. (e.g trivial, but let’s think about shell interpreter which
> keeps it’s tabledata on memory)
>
>
> --
>
> Some people might wonder why we do not use external storages to persist
> (large) table resources instead of keeping them in memory of ZeppelinServer.
>
> The authors originally discussed whether having an external storage or
> not. But having external storage
>
> - requires additional (lots of) dependencies. (Geode? Redis? HDFS? Which
> one should we use? or support all?)
> - even with external storage, we might not persist 400GB, 10TB.
>
> Thus, the proposal was written to
>
> - utilize interpreter’s own storage (e.g spark cluster for spark
> interpreter)
> - keep the minimal things to reproduce the table result (e.g keeping the
> only query) while don’t affect on external storage as well at first.
>
>
> And now we are discussing, hope we can improve the proposal and turn it
> into a reall implementation soon. :)
>
>
>
> Thanks.
>
>
>
>
> On Wed, Jun 14, 2017 at 12:20 PM, Jeff Zhang  wrote:
>
>>
>> Hi Park,
>>
>> Thanks for the sharing, this is a very interested and innovated idea. I
>> have several comments and concerns.
>>
>> 1. What does the resource registration mean ?
>>IIUC, currently it means it would cache the data in Interpreter
>> Process. Then it might be a memory iss