[ 
https://issues.apache.org/jira/browse/SQOOP-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14221479#comment-14221479
 ] 

Veena Basavaraj commented on SQOOP-1771:
----------------------------------------

Postgres format is very similar to JSON  but just needs more hand-rolling, 
instead of relying on a standard JSON library for both arrays and map,
Between handrolling and actually using jackson object mapper, the performance 
differences are highly unlikely to be different.
I would still prefer using a standard JSON library for encoding maps and nested 
arrays, so that the connectors can use the same standard as well.

> Investigation FORMAT of the Array/NestedArray/ Set/ Map in Postgres and HIVE.
> -----------------------------------------------------------------------------
>
>                 Key: SQOOP-1771
>                 URL: https://issues.apache.org/jira/browse/SQOOP-1771
>             Project: Sqoop
>          Issue Type: Sub-task
>          Components: sqoop2-framework
>            Reporter: Veena Basavaraj
>             Fix For: 1.99.5
>
>
> update this wiki, which is missing details on the complex types
> https://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+Intermediate+representation#Sqoop2Intermediaterepresentation-Intermediateformatrepresentationproposal
> The above document does not explicitly say the design goals for choosing the 
> IDF format for different types but with conversation on of the related 
> tickets  RB : https://reviews.apache.org/r/28139/diff/#. Here are the 
> considerations.
> Intermediate Data Format is more relevant when we transfer data between the 
> FROM and TO and both do not agree on the same form of data as it is 
> transferred. 
> The IDF API as of today exposes 3 types of setter, one for a generic type T, 
> one for Text/String, one for object array.  
> {code}
>   /**
>    * Set one row of data. If validate is set to true, the data is validated
>    * against the schema.
>    * @param data - A single row of data to be moved.
>    */
>   public void setData(T data) {
>     this.data = data;
>   }
>   /**
>    * Get one row of data.
>    *
>    * @return - One row of data, represented in the internal/native format of
>    *         the intermediate data format implementation.
>    */
>   public T getData() {
>     return data;
>   }
>   /**
>    * Get one row of data as CSV.
>    *
>    * @return - String representing the data in CSV, according to the "FROM" 
> schema.
>    * No schema conversion is done on textData, to keep it as "high 
> performance" option.
>    */
>   public abstract String getTextData();
>   /**
>    * Set one row of data as CSV.
>    *
>    */
>   public abstract void setTextData(String text); 
> /**
>    * Get one row of data as an Object array.
>    *
>    * @return - String representing the data as an Object array
>    * If FROM and TO schema exist, we will use SchemaMatcher to get the data 
> according to "TO" schema
>    */
>   public abstract Object[] getObjectData();
>   /**
>    * Set one row of data as an Object array.
>    *
>    */
>   public abstract void setObjectData(Object[] data);
>   /**
> {code} 
> NOTE : the java docs are not completely accurate, there is really no 
> validation happening:). Second CSV in one way the IDF can be represented when 
> it is TEXT.There can be other implementation of CSV as well such as AVRO or 
> JSON, very similar to the serDe interface in HIVE. " String representing the 
> data in CSV, according to the "FROM" schema.  * No schema conversion is done 
> on textData, to keep it as "high performance" option.", this also is not 
> accurate. The CSV format is a standard enforced by sqoop. The FROM schema 
> does not enforce it.
> Anyways, so the design considerations seem to be the following
> 1. the setText/ getText are supposed to allow the FROM and TO to talk the 
> same language and hence should have very minimal transformations as the data 
> flows through SQOOP. This means that both FROM and TO agree to give data in 
> the CSV IDF that is standardized in the wiki / spec/ docs and the read data 
> in the same format. Transformation may have to happen before the setText() or 
> after the getText, but nothing will happen in between when it flows through 
> sqoop. If the FROM does a setText and the TO does a getObject then there is 
> time spent it converting the elements within the CSV string to actual java 
> objects. This means there is parsing and unescaping / unencoding happening in 
> sqoop.
> 2. The current proposal seems to recommend the formats that are more 
> prominent with the databases that have been explored in the list, but it is 
> not really a complete set of all data sources/connectors sqoop may have in 
> future. Most emphasis is on the relational DB stores since historically 
> sqoop1 only supported that as the FROM source
> https://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+Intermediate+representation#Sqoop2Intermediaterepresentation-Intermediateformatrepresentationproposal
> But overall the goal seem to be more on the side of sql dump and pg dump that 
> use CSV format and the hope is such transfers in sqoop will happen more.
> 3. Avoiding any CPU cycles, there is no validation that will done to make 
> sure that the data adheres to the CSV format. It is trust based system that 
> the incoming data will follow the CSV rules as depicted in the link above 
> https://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+Intermediate+representation#Sqoop2Intermediaterepresentation-Intermediateformatrepresentationproposa
> Next, having know these design goals, the format to encode the nested arrays 
> and maps can be done in some ways. 
> 2 examples were explored below. HIVE and postgres. Details are given below in 
> comments. One of the simplest ways was to use the universal JSON jackson api 
> for nested arrays and maps.
> Postgres format  is very similar to that but just needs more hand-rolling 
> instead of relying on a standard JSON library.  both for arrays and map, this 
> format can be used as a standard.  Between this and actually using jackson 
> object mapper, the performance differences are highly unlikely to be 
> different. 
> I would still prefer using a standard JSON library for encoding maps and 
> nested arrays, so that the connectors can  use the same standard as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to