[
https://issues.apache.org/jira/browse/SQOOP-365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13130878#comment-13130878
]
Arvind Prabhakar commented on SQOOP-365:
----------------------------------------
Thanks for the feedback Aaron. Here are my thoughts on the questions you raised:
* *Ease of Deployment*:
** _Will Sqoop have an embedded server to support localhost execution?_ This is
certainly a possiblity. I do feel that it is more of a packaging concern though
and not a design concern. For example - BigTop should be able to easily package
the system with say Tomcat or other middleware.
** _...will Sqoop still support the ability to use the existing 'ad hoc'
connection mechanism?_ Absolutely. The idea of connections as first class
objects is a prerequisite for tighter security. But nothing stops a user to
deploy Sqoop in a mode where the security is not enabled, or if the operator
has admin privileges as well.
* *Command Line Backward compatiblity*: _Will Sqoop2 be backwards-compatible
with these arguments?_ My inclination is to not be backward compatible due to
reasons of controlling overall implementation complexity. To a certin degree -
we could preserve the same command line interface if that is a critical
requirement, so you would be able to run with most Sqoop 1.x standard command
line options. But then there are things that are interpreted differently by
different connectors and in order to be fully backward compatible, the
connectors too would have to be backward compatible and respect the same
semantics they did as before. There is no harm in doing so, just that the
overall burden for the new implementation will put a drag on its progress
towards further improvement. I would prefer that Sqoop 2 be not required to be
backward compatible with current implementation, as long as there is an easy
migration path from the previous sytem to the new.
* *Metadata store*: _How and where does Sqoop store information about
Connections, resource limits, etc?_ Even though the writeup does not talk about
this, I imagine having a pluggable store interface that is backed by an
embedded derby database. This will allow Sqoop to integrate with HCatalog when
it is ready for production.
** _How, if at all, do we guard against end-users starting a second Sqoop
server to get around resource limits?_ We should provide implementation that
uses the metadata store to manage resource limits etc. Which, as you point out,
is easy to bypass if the user has access to connection information - where they
can setup a new instance pointing to a different metastore that violates these
restrictions. But that is no different from abusing the resources outside of
Sqoop by directly running programms/sessions against the database. Such
use/abuse is beyond the scope of security implementation in Sqoop IMO.
bq. I also don't believe that it's productive for the command-line client to
use the REST API directly. Starting a server (even on localhost) as a pre-req
for running a command-line tool seems overly complicated to me.
I agree that there are differences in how one uses a tool vs how one uses a
service. Services have the added burden of being managed and monitored, where
as tools are usually controlled by the user entriely. Once the service is
started/available - using a service becomes far easier than using a tool. The
end-user does not have to worry about classpath details or making sure that
they have the correct drivers installed. The client provides a thin facade to
access the service and run it from anywhere and on its own does not require any
management. This generally scales very well as compared to a heavy client that
requires individual installations to be managed.
bq. I think a better architecture may be to define a number of Operations
internally. Each Operation can have a programmatic (Java) API that executes it.
Each Operation can also be bound to a REST API endpoint. But this way a user
can still simply run the command-line application without configuring an entire
server. The command-line app would run the Operation directly, as opposed to
running it in the address space of a separate process somewhere. This would
reduce the number of layers of complexity when debugging what goes wrong.
Involving the network (even loopback) where none is needed seems like asking
for trouble.
I think that underneath the covers the logic of most of Sqoop 2 will indeed be
implemented as operations that can be invoked without needing a web based
service for testing purposes. The difference is that it won't be that way for
the packaged system, which will be wired to work in a service model. Testablity
is certainly a core requirement for any system and any implementation that does
not lend it self to this is deficient. Given that, I don't think debugging
would be that much more difficult than what it is right now to say debugging MR
applications.
bq. Finally, on the front of API compatibility: Arvind, in an offline
discussion, we talked about having a separate API package of interfaces that
would have "api level" versioning (a la the Servlet API) that is distinct from
the implementation version. Is that still part of your vision for Sqoop 2? I
don't see it described in this proposal.
Thanks for pointing that out - yes it is. For those who were not part of our
offline discussion - the summary is that Sqoop 2 will expose versioned REST API
that would automatically bind to different clients. So technically you could
upgrade Sqoop 2 to Sqoop 2.5 etc which may have new API but the old clients
will continue to work as is. The only caveat is that we may not be able to
retrofit it to support Sqoop 1.x based on the discussion above.
bq. I looked through the proposed source layout for this. Without a README
specifying what goes in which directories, it's hard for me to understand what
you're trying to accomplish. What's the "infra" project for?
The infra project would be the Sqoop infrastructure. We could name it "core" or
"arch" or other commonly used names. The purpose of this project is to be able
to define the core system architecture and design which gets used by other
modules where necessary.
bq. I think based on what I said above about Operations, etc, there should be a
"libsqoop" project that corresponds to the guts of the project. The "server"
should just be a REST API implementation (perhaps w/ an embedded Jetty server,
but also perhaps deployable as a WAR on a fully-administered Tomcat instance)
that embeds libsqoop to perform the Operations. And the client, similarly, is a
thin command-line-arg parsing shell that embeds libsqoop to perform Operations
directly.
I believe the infra module is what we are talking about here. I am hesitant to
give it a name that suggests it is a library since there will be a bit of logic
in dealing with extensions, job lifecycle, and other operational details which
actively define the overall functioning of the system. Effectively though, it
will still do the same thing as what you have suggested for libsqoop.
bq. Is infra ~= libsqoop in this idea? Or is that about independent testing of
connectors, etc?
Yes - infra ~ libsqoop.
bq. I think there should also be a plugin-api library (libsqoopapi?) which the
connector/*/ projects link against, rather than libsqoop itself. This API would
also be used by third-party SqoopTool implementations.
Good suggestion - we can have a separate module for Sqoop extension API. It
probably belongs to connection/api module.
> Proposal for next major revision of Sqoop.
> ------------------------------------------
>
> Key: SQOOP-365
> URL: https://issues.apache.org/jira/browse/SQOOP-365
> Project: Sqoop
> Issue Type: Wish
> Reporter: Arvind Prabhakar
> Assignee: Arvind Prabhakar
> Attachments: sqoop2.tar.gz
>
>
> This issue tracks the design and development of the next major revision of
> Sqoop. The proposal has been articulated on the wiki at the following
> location:
> https://cwiki.apache.org/confluence/display/SQOOP/Sqoop+2
> Please review the proposal and add your comments to this JIRA.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira