[
https://issues.apache.org/jira/browse/SQOOP-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14221479#comment-14221479
]
Veena Basavaraj commented on SQOOP-1771:
----------------------------------------
Postgres format is very similar to JSON but just needs more hand-rolling,
instead of relying on a standard JSON library for both arrays and map,
Between handrolling and actually using jackson object mapper, the performance
differences are highly unlikely to be different.
I would still prefer using a standard JSON library for encoding maps and nested
arrays, so that the connectors can use the same standard as well.
> Investigation FORMAT of the Array/NestedArray/ Set/ Map in Postgres and HIVE.
> -----------------------------------------------------------------------------
>
> Key: SQOOP-1771
> URL: https://issues.apache.org/jira/browse/SQOOP-1771
> Project: Sqoop
> Issue Type: Sub-task
> Components: sqoop2-framework
> Reporter: Veena Basavaraj
> Fix For: 1.99.5
>
>
> update this wiki, which is missing details on the complex types
> https://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+Intermediate+representation#Sqoop2Intermediaterepresentation-Intermediateformatrepresentationproposal
> The above document does not explicitly say the design goals for choosing the
> IDF format for different types but with conversation on of the related
> tickets RB : https://reviews.apache.org/r/28139/diff/#. Here are the
> considerations.
> Intermediate Data Format is more relevant when we transfer data between the
> FROM and TO and both do not agree on the same form of data as it is
> transferred.
> The IDF API as of today exposes 3 types of setter, one for a generic type T,
> one for Text/String, one for object array.
> {code}
> /**
> * Set one row of data. If validate is set to true, the data is validated
> * against the schema.
> * @param data - A single row of data to be moved.
> */
> public void setData(T data) {
> this.data = data;
> }
> /**
> * Get one row of data.
> *
> * @return - One row of data, represented in the internal/native format of
> * the intermediate data format implementation.
> */
> public T getData() {
> return data;
> }
> /**
> * Get one row of data as CSV.
> *
> * @return - String representing the data in CSV, according to the "FROM"
> schema.
> * No schema conversion is done on textData, to keep it as "high
> performance" option.
> */
> public abstract String getTextData();
> /**
> * Set one row of data as CSV.
> *
> */
> public abstract void setTextData(String text);
> /**
> * Get one row of data as an Object array.
> *
> * @return - String representing the data as an Object array
> * If FROM and TO schema exist, we will use SchemaMatcher to get the data
> according to "TO" schema
> */
> public abstract Object[] getObjectData();
> /**
> * Set one row of data as an Object array.
> *
> */
> public abstract void setObjectData(Object[] data);
> /**
> {code}
> NOTE : the java docs are not completely accurate, there is really no
> validation happening:). Second CSV in one way the IDF can be represented when
> it is TEXT.There can be other implementation of CSV as well such as AVRO or
> JSON, very similar to the serDe interface in HIVE. " String representing the
> data in CSV, according to the "FROM" schema. * No schema conversion is done
> on textData, to keep it as "high performance" option.", this also is not
> accurate. The CSV format is a standard enforced by sqoop. The FROM schema
> does not enforce it.
> Anyways, so the design considerations seem to be the following
> 1. the setText/ getText are supposed to allow the FROM and TO to talk the
> same language and hence should have very minimal transformations as the data
> flows through SQOOP. This means that both FROM and TO agree to give data in
> the CSV IDF that is standardized in the wiki / spec/ docs and the read data
> in the same format. Transformation may have to happen before the setText() or
> after the getText, but nothing will happen in between when it flows through
> sqoop. If the FROM does a setText and the TO does a getObject then there is
> time spent it converting the elements within the CSV string to actual java
> objects. This means there is parsing and unescaping / unencoding happening in
> sqoop.
> 2. The current proposal seems to recommend the formats that are more
> prominent with the databases that have been explored in the list, but it is
> not really a complete set of all data sources/connectors sqoop may have in
> future. Most emphasis is on the relational DB stores since historically
> sqoop1 only supported that as the FROM source
> https://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+Intermediate+representation#Sqoop2Intermediaterepresentation-Intermediateformatrepresentationproposal
> But overall the goal seem to be more on the side of sql dump and pg dump that
> use CSV format and the hope is such transfers in sqoop will happen more.
> 3. Avoiding any CPU cycles, there is no validation that will done to make
> sure that the data adheres to the CSV format. It is trust based system that
> the incoming data will follow the CSV rules as depicted in the link above
> https://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+Intermediate+representation#Sqoop2Intermediaterepresentation-Intermediateformatrepresentationproposa
> Next, having know these design goals, the format to encode the nested arrays
> and maps can be done in some ways.
> 2 examples were explored below. HIVE and postgres. Details are given below in
> comments. One of the simplest ways was to use the universal JSON jackson api
> for nested arrays and maps.
> Postgres format is very similar to that but just needs more hand-rolling
> instead of relying on a standard JSON library. both for arrays and map, this
> format can be used as a standard. Between this and actually using jackson
> object mapper, the performance differences are highly unlikely to be
> different.
> I would still prefer using a standard JSON library for encoding maps and
> nested arrays, so that the connectors can use the same standard as well.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)