[jira] [Commented] (SQOOP-365) Proposal for next major revision of Sqoop.

Arvind Prabhakar (Commented) (JIRA) Wed, 19 Oct 2011 12:01:34 -0700

    [ 
https://issues.apache.org/jira/browse/SQOOP-365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13130878#comment-13130878
 ]


Arvind Prabhakar commented on SQOOP-365:
----------------------------------------

Thanks for the feedback Aaron. Here are my thoughts on the questions you raised:

* *Ease of Deployment*:
** _Will Sqoop have an embedded server to support localhost execution?_ This is 
certainly a possiblity. I do feel that it is more of a packaging concern though 
and not a design concern. For example - BigTop should be able to easily package 
the system with say Tomcat or other middleware. 
** _...will Sqoop still support the ability to use the existing 'ad hoc' 
connection mechanism?_ Absolutely. The idea of connections as first class 
objects is a prerequisite for tighter security. But nothing stops a user to 
deploy Sqoop in a mode where the security is not enabled, or if the operator 
has admin privileges as well.

* *Command Line Backward compatiblity*: _Will Sqoop2 be backwards-compatible 
with these arguments?_ My inclination is to not be backward compatible due to 
reasons of controlling overall implementation complexity. To a certin degree - 
we could preserve the same command line interface if that is a critical 
requirement, so you would be able to run with most Sqoop 1.x standard command 
line options. But then there are things that are interpreted differently by 
different connectors and in order to be fully backward compatible, the 
connectors too would have to be backward compatible and respect the same 
semantics they did as before. There is no harm in doing so, just that the 
overall burden for the new implementation will put a drag on its progress 
towards further improvement. I would prefer that Sqoop 2 be not required to be 
backward compatible with current implementation, as long as there is an easy 
migration path from the previous sytem to the new.

* *Metadata store*: _How and where does Sqoop store information about 
Connections, resource limits, etc?_ Even though the writeup does not talk about 
this, I imagine having a pluggable store interface that is backed by an 
embedded derby database. This will allow Sqoop to integrate with HCatalog when 
it is ready for production.
** _How, if at all, do we guard against end-users starting a second Sqoop 
server to get around resource limits?_ We should provide implementation that 
uses the metadata store to manage resource limits etc. Which, as you point out, 
is easy to bypass if the user has access to connection information - where they 
can setup a new instance pointing to a different metastore that violates these 
restrictions. But that is no different from abusing the resources outside of 
Sqoop by directly running programms/sessions against the database. Such 
use/abuse is beyond the scope of security implementation in Sqoop IMO.


bq. I also don't believe that it's productive for the command-line client to 
use the REST API directly. Starting a server (even on localhost) as a pre-req 
for running a command-line tool seems overly complicated to me.

I agree that there are differences in how one uses a tool vs how one uses a 
service. Services have the added burden of being managed and monitored, where 
as tools are usually controlled by the user entriely. Once the service is 
started/available - using a service becomes far easier than using a tool. The 
end-user does not have to worry about classpath details or making sure that 
they have the correct drivers installed. The client provides a thin facade to 
access the service and run it from anywhere and on its own does not require any 
management. This generally scales very well as compared to a heavy client that 
requires individual installations to be managed.


bq. I think a better architecture may be to define a number of Operations 
internally. Each Operation can have a programmatic (Java) API that executes it. 
Each Operation can also be bound to a REST API endpoint. But this way a user 
can still simply run the command-line application without configuring an entire 
server. The command-line app would run the Operation directly, as opposed to 
running it in the address space of a separate process somewhere. This would 
reduce the number of layers of complexity when debugging what goes wrong. 
Involving the network (even loopback) where none is needed seems like asking 
for trouble.

I think that underneath the covers the logic of most of Sqoop 2 will indeed be 
implemented as operations that can be invoked without needing a web based 
service for testing purposes. The difference is that it won't be that way for 
the packaged system, which will be wired to work in a service model. Testablity 
is certainly a core requirement for any system and any implementation that does 
not lend it self to this is deficient. Given that, I don't think debugging 
would be that much more difficult than what it is right now to say debugging MR 
applications.

bq. Finally, on the front of API compatibility: Arvind, in an offline 
discussion, we talked about having a separate API package of interfaces that 
would have "api level" versioning (a la the Servlet API) that is distinct from 
the implementation version. Is that still part of your vision for Sqoop 2? I 
don't see it described in this proposal.

Thanks for pointing that out - yes it is. For those who were not part of our 
offline discussion - the summary is that Sqoop 2 will expose versioned REST API 
that would automatically bind to different clients. So technically you could 
upgrade Sqoop 2 to Sqoop 2.5 etc which may have new API but the old clients 
will continue to work as is. The only caveat is that we may not be able to 
retrofit it to support Sqoop 1.x based on the discussion above.

bq. I looked through the proposed source layout for this. Without a README 
specifying what goes in which directories, it's hard for me to understand what 
you're trying to accomplish. What's the "infra" project for?

The infra project would be the Sqoop infrastructure. We could name it "core" or 
"arch" or other commonly used names. The purpose of this project is to be able 
to define the core system architecture and design which gets used by other 
modules where necessary.

bq. I think based on what I said above about Operations, etc, there should be a 
"libsqoop" project that corresponds to the guts of the project. The "server" 
should just be a REST API implementation (perhaps w/ an embedded Jetty server, 
but also perhaps deployable as a WAR on a fully-administered Tomcat instance) 
that embeds libsqoop to perform the Operations. And the client, similarly, is a 
thin command-line-arg parsing shell that embeds libsqoop to perform Operations 
directly.

I believe the infra module is what we are talking about here. I am hesitant to 
give it a name that suggests it is a library since there will be a bit of logic 
in dealing with extensions, job lifecycle, and other operational details which 
actively define the overall functioning of the system. Effectively though, it 
will still do the same thing as what you have suggested for libsqoop.

bq. Is infra ~= libsqoop in this idea? Or is that about independent testing of 
connectors, etc?

Yes - infra ~ libsqoop.

bq. I think there should also be a plugin-api library (libsqoopapi?) which the 
connector/*/ projects link against, rather than libsqoop itself. This API would 
also be used by third-party SqoopTool implementations.

Good suggestion - we can have a separate module for Sqoop extension API. It 
probably belongs to connection/api module.
                
> Proposal for next major revision of Sqoop.
> ------------------------------------------
>
>                 Key: SQOOP-365
>                 URL: https://issues.apache.org/jira/browse/SQOOP-365
>             Project: Sqoop
>          Issue Type: Wish
>            Reporter: Arvind Prabhakar
>            Assignee: Arvind Prabhakar
>         Attachments: sqoop2.tar.gz
>
>
> This issue tracks the design and development of the next major revision of 
> Sqoop. The proposal has been articulated on the wiki at the following 
> location:
> https://cwiki.apache.org/confluence/display/SQOOP/Sqoop+2
> Please review the proposal and add your comments to this JIRA. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (SQOOP-365) Proposal for next major revision of Sqoop.

Reply via email to