[
https://issues.apache.org/jira/browse/SQOOP-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14363273#comment-14363273
]
Jarek Jarcec Cecho commented on SQOOP-1803:
-------------------------------------------
Thank you for putting it together [~vybs].
Indeed the current {{MutableContext}} serializes all the data as Strings, but
that is just an internal way that is modeling what Hadoop's {{Configuration}}
has been doing. We're still exposing {{setBoolean}}, {{setInt}}, ... methods
and their {{getType}} alternatives. Connector developer can store any type in
{{Context}}. It's however his responsibility to remember what type has been
stored there (e.g. we do not persist information that property "X" has been
saved as long). The {{MutableContext}} is not persisted in our repository and
is more meant as transient store specific to given submission. I believe that
the context is fully lost after the submission ends. Hence I think that we
should have a contract in the connector API somewhere that given the context
object can update the appropriate configuration. Couple of ideas:
1) We currently call the
{{[Initializer.initialize()|https://github.com/apache/sqoop/blob/sqoop2/connector/connector-sdk/src/main/java/org/apache/sqoop/job/etl/Initializer.java#L47]}}
on every job initialization (both in From and To context). We could allow the
connector to change the given configuration objects. If and only if the job is
successful we would persist the updated configuration objects in repository via
normal update path (the same one that is used by user). As the job submission
is asynchronous we might need to came up with mechanism how to persist the
updated configuration objects with the Hadoop job itself and get the back later.
*Pros:* Seems relatively simple to implement as we are already preserving a lot
of information with the Hadoop job itself.
*Cons:* We would introduce kind of "implied" or "secret" API as the connector
developer have to know that he is allowed to change the configuration objects.
2) Alternatively we could expose and explicit API
{{updateConfigurationObjects(Context, LinkConfiguration, JobConfiguration)}}
(proper name pending) that connector developer could explicitly implement if he
cares about updating the configuration objects. As this API would make sense
only after the job is successfully finished, we could:
*Pros:* We have an explicit API with nicely defined semantics. We don't need to
persist any additional information in the Hadoop job object.
2.1) Introduce it as part of
[Destroyer|https://github.com/apache/sqoop/blob/sqoop2/connector/connector-sdk/src/main/java/org/apache/sqoop/job/etl/Destroyer.java]
*Pros:* Updating the configuration objects is part of the clean up phase so it
make sense to have it as part of {{Destroyer}}.
*Cons:* Currently the {{Destroyer}} runs outside of the Sqoop 2 server
somewhere on the cluster. We would either have to move the {{Destroyer}} to be
executed in the server or simply call this particular method in different
instance of the {{Destroyer}} - and that might be a bit confusing.
2.2) Introduce a new part of the workflow that will be executed post Destroyer.
Something liked {{Updater}}.
*Pros:* We can easily run it on Sqoop 2 server itself without moving/caring
about where the {{Destroyer}} runs.
*Cons:* Seems weird to have a part of workflow that is executed post finalized
step. Especially when it have the same semantics as the {{Destroyer}} (we will
call it exactly once, on one node).
I'm sure that there are other ways how to expose the contract in the connector
interface, so don't hesitate and jump in with other ideas!
> JobManager and Execution Engine changes: Support for a injecting and pulling
> out configs and job output in connectors
> ----------------------------------------------------------------------------------------------------------------------
>
> Key: SQOOP-1803
> URL: https://issues.apache.org/jira/browse/SQOOP-1803
> Project: Sqoop
> Issue Type: Sub-task
> Reporter: Veena Basavaraj
> Assignee: Veena Basavaraj
> Fix For: 1.99.6
>
>
> The details are in the design wiki, as the implementation happens more
> discussions can happen here.
> https://cwiki.apache.org/confluence/display/SQOOP/Delta+Fetch+And+Merge+Design#DeltaFetchAndMergeDesign-Howtogetoutputfromconnectortosqoop?
> The goal is to dynamically inject a IncrementalConfig instance into the
> FromJobConfiguration. The current MFromConfig and MToConfig can already hold
> a list of configs, and a strong sentiment was expressed to keep it as a list,
> why not for the first time actually make use of it and group the incremental
> related configs in one config object
> This task will prepare the FromJobConfiguration from the job config data,
> ExtractorContext with the relevant values from the prev job run
> This task will prepare the ToJobConfiguration from the job config data,
> LoaderContext with the relevant values from the prev job run if any
> We will use DistributedCache to get State information from the Extractor and
> Loader out and finally persist it into the sqoop repository depending on
> SQOOP-1804 once the outputcommitter commit is called
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)