[ 
https://issues.apache.org/jira/browse/SQOOP-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Veena Basavaraj updated SQOOP-1771:
-----------------------------------
    Description: 
update this wiki, which is missing details on the complex types

https://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+Intermediate+representation#Sqoop2Intermediaterepresentation-Intermediateformatrepresentationproposal

The above document does not explicitly say the design goals for choosing the 
IDF format for different types but with conversation on of the related tickets  
RB : https://reviews.apache.org/r/28139/diff/#. Here are the considerations.

Intermediate Data Format is more relevant when we transfer data between the 
FROM and TO and both do not agree on the same form of data as it is 
transferred. 

The IDF API as of today exposes 3 types of setter, one for a generic type T, 
one for Text/String, one for object array.  
{code}
  /**
   * Set one row of data. If validate is set to true, the data is validated
   * against the schema.
   * @param data - A single row of data to be moved.
   */
  public void setData(T data) {
    this.data = data;
  }
  /**
   * Get one row of data.
   *
   * @return - One row of data, represented in the internal/native format of
   *         the intermediate data format implementation.
   */
  public T getData() {
    return data;
  }

  /**
   * Get one row of data as CSV.
   *
   * @return - String representing the data in CSV, according to the "FROM" 
schema.
   * No schema conversion is done on textData, to keep it as "high performance" 
option.
   */
  public abstract String getTextData();

  /**
   * Set one row of data as CSV.
   *
   */
  public abstract void setTextData(String text); 

/**
   * Get one row of data as an Object array.
   *
   * @return - String representing the data as an Object array
   * If FROM and TO schema exist, we will use SchemaMatcher to get the data 
according to "TO" schema
   */
  public abstract Object[] getObjectData();

  /**
   * Set one row of data as an Object array.
   *
   */
  public abstract void setObjectData(Object[] data);

  /**
{code} 

NOTE : the java docs are not completely accurate, there is really no validation 
happening:). Second CSV in one way the IDF can be represented when it is 
TEXT.There can be other implementation of CSV as well such as AVRO or JSON, 
very similar to the serDe interface in HIVE. " String representing the data in 
CSV, according to the "FROM" schema.  * No schema conversion is done on 
textData, to keep it as "high performance" option.", this also is not accurate. 
The CSV format is a standard enforced by sqoop. The FROM schema does not 
enforce it.

Anyways, so the design considerations seem to be the following

1. the setText/ getText are supposed to allow the FROM and TO to talk the same 
language and hence should have very minimal transformations as the data flows 
through SQOOP. This means that both FROM and TO agree to give data in the CSV 
IDF that is standardized in the wiki / spec/ docs and the read data in the same 
format. Transformation may have to happen before the setText() or after the 
getText, but nothing will happen in between when it flows through sqoop. If the 
FROM does a setText and the TO does a getObject then there is time spent it 
converting the elements within the CSV string to actual java objects. This 
means there is parsing and unescaping / unencoding happening in sqoop.



2. The current proposal seems to recommend the formats that are more prominent 
with the databases that have been explored in the list, but it is not really a 
complete set of all data sources/connectors sqoop may have in future. Most 
emphasis is on the relational DB stores since historically sqoop1 only 
supported that as the FROM source
https://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+Intermediate+representation#Sqoop2Intermediaterepresentation-Intermediateformatrepresentationproposal

But overall the goal seem to be more on the side of sql dump and pg dump that 
use CSV format and the hope is such transfers in sqoop will happen more.

3. Avoiding any CPU cycles, there is no validation that will done to make sure 
that the data adheres to the CSV format. It is trust based system that the 
incoming data will follow the CSV rules as depicted in the link above 
https://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+Intermediate+representation#Sqoop2Intermediaterepresentation-Intermediateformatrepresentationproposa

Next, having know these design goals, the format to encode the nested arrays 
and maps can be done in some ways. 

2 examples were explored below. HIVE and postgres. Details are given below in 
comments. One of the simplest ways was to use the universal JSON jackson api 
for nested arrays and maps.

Postgres format  is very similar to that but just needs more hand-rolling 
instead of relying on a standard JSON library.  both for arrays and map, this 
format can be used as a standard.  Between this and actually using jackson 
object mapper, the performance differences are highly unlikely to be different. 

I would still prefer using a standard JSON library for encoding maps and nested 
arrays, so that the connectors can  use the same standard as well.





  was:
update this wiki, which is missing details on the complex types

https://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+Intermediate+representation#Sqoop2Intermediaterepresentation-Intermediateformatrepresentationproposal

The above document does not explicitly say the design goals for choosing the 
IDF format for different types but with conversation on of the related tickets  
RB : https://reviews.apache.org/r/28139/diff/#. Here are the considerations.

Intermediate Data Format is more relevant when we transfer data between the 
FROM and TO and both do not agree on the same form of data as it is transferred.

The IDF API as of today exposes 3 types of setter, one for a generic type T, 
one for Text/String, one for object array.  
{code}
  /**
   * Set one row of data. If validate is set to true, the data is validated
   * against the schema.
   *
   * @param data - A single row of data to be moved.
   */
  public void setData(T data) {
    this.data = data;
  }

  /**
   * Get one row of data.
   *
   * @return - One row of data, represented in the internal/native format of
   *         the intermediate data format implementation.
   */
  public T getData() {
    return data;
  }

  /**
   * Get one row of data as CSV.
   *
   * @return - String representing the data in CSV, according to the "FROM" 
schema.
   * No schema conversion is done on textData, to keep it as "high performance" 
option.
   */
  public abstract String getTextData();

  /**
   * Set one row of data as CSV.
   *
   */
  public abstract void setTextData(String text); 
{code} 

NOTE : the java docs are not completely accurate, there is really no validation 
happening:). Second CSV in one way the IDF can be represented when it is 
TEXT.There can be other implementation of CSV as well such as AVRO or JSON, 
very similar to the serDe interface in HIVE.

Anyways, so the design considerations seem to be the following

1. the setText/ getText are supposed to allow the FROM and TO to talk the same 
language and hence should have very minimal transformations as the data flows 
through SQOOP. This means that both FROM and TO agree to give data in the CSV 
IDF that is standardized in the wiki / spec/ docs and the read data in the same 
format. Transformation may have to happen before the setText() or after the 
getText, but nothing will happen in between when it flows through sqoop.

2. The current proposal seems to recommend the formats that are more prominent 
with the databases that ahve been explored in the list, but it is not really a 
complete set of all data sources/connectors sqoop may have in future.
https://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+Intermediate+representation#Sqoop2Intermediaterepresentation-Intermediateformatrepresentationproposal

But overall the goal seem to be more on the side of sql dump and pg dump that 
use CSV format and the hope is such transfers in sqoop will be more performant

3. Avoiding any CPU cycles, there is no validation that will done to make sure 
that the data adheres to the CSV format. It is trust based system that the 
incoming data will follow the CSV rules as depicted in the link above 
https://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+Intermediate+representation#Sqoop2Intermediaterepresentation-Intermediateformatrepresentationproposa






> Investigation FORMAT of the Array/NestedArray/ Set/ Map in Postgres and HIVE.
> -----------------------------------------------------------------------------
>
>                 Key: SQOOP-1771
>                 URL: https://issues.apache.org/jira/browse/SQOOP-1771
>             Project: Sqoop
>          Issue Type: Sub-task
>          Components: sqoop2-framework
>            Reporter: Veena Basavaraj
>             Fix For: 1.99.5
>
>
> update this wiki, which is missing details on the complex types
> https://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+Intermediate+representation#Sqoop2Intermediaterepresentation-Intermediateformatrepresentationproposal
> The above document does not explicitly say the design goals for choosing the 
> IDF format for different types but with conversation on of the related 
> tickets  RB : https://reviews.apache.org/r/28139/diff/#. Here are the 
> considerations.
> Intermediate Data Format is more relevant when we transfer data between the 
> FROM and TO and both do not agree on the same form of data as it is 
> transferred. 
> The IDF API as of today exposes 3 types of setter, one for a generic type T, 
> one for Text/String, one for object array.  
> {code}
>   /**
>    * Set one row of data. If validate is set to true, the data is validated
>    * against the schema.
>    * @param data - A single row of data to be moved.
>    */
>   public void setData(T data) {
>     this.data = data;
>   }
>   /**
>    * Get one row of data.
>    *
>    * @return - One row of data, represented in the internal/native format of
>    *         the intermediate data format implementation.
>    */
>   public T getData() {
>     return data;
>   }
>   /**
>    * Get one row of data as CSV.
>    *
>    * @return - String representing the data in CSV, according to the "FROM" 
> schema.
>    * No schema conversion is done on textData, to keep it as "high 
> performance" option.
>    */
>   public abstract String getTextData();
>   /**
>    * Set one row of data as CSV.
>    *
>    */
>   public abstract void setTextData(String text); 
> /**
>    * Get one row of data as an Object array.
>    *
>    * @return - String representing the data as an Object array
>    * If FROM and TO schema exist, we will use SchemaMatcher to get the data 
> according to "TO" schema
>    */
>   public abstract Object[] getObjectData();
>   /**
>    * Set one row of data as an Object array.
>    *
>    */
>   public abstract void setObjectData(Object[] data);
>   /**
> {code} 
> NOTE : the java docs are not completely accurate, there is really no 
> validation happening:). Second CSV in one way the IDF can be represented when 
> it is TEXT.There can be other implementation of CSV as well such as AVRO or 
> JSON, very similar to the serDe interface in HIVE. " String representing the 
> data in CSV, according to the "FROM" schema.  * No schema conversion is done 
> on textData, to keep it as "high performance" option.", this also is not 
> accurate. The CSV format is a standard enforced by sqoop. The FROM schema 
> does not enforce it.
> Anyways, so the design considerations seem to be the following
> 1. the setText/ getText are supposed to allow the FROM and TO to talk the 
> same language and hence should have very minimal transformations as the data 
> flows through SQOOP. This means that both FROM and TO agree to give data in 
> the CSV IDF that is standardized in the wiki / spec/ docs and the read data 
> in the same format. Transformation may have to happen before the setText() or 
> after the getText, but nothing will happen in between when it flows through 
> sqoop. If the FROM does a setText and the TO does a getObject then there is 
> time spent it converting the elements within the CSV string to actual java 
> objects. This means there is parsing and unescaping / unencoding happening in 
> sqoop.
> 2. The current proposal seems to recommend the formats that are more 
> prominent with the databases that have been explored in the list, but it is 
> not really a complete set of all data sources/connectors sqoop may have in 
> future. Most emphasis is on the relational DB stores since historically 
> sqoop1 only supported that as the FROM source
> https://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+Intermediate+representation#Sqoop2Intermediaterepresentation-Intermediateformatrepresentationproposal
> But overall the goal seem to be more on the side of sql dump and pg dump that 
> use CSV format and the hope is such transfers in sqoop will happen more.
> 3. Avoiding any CPU cycles, there is no validation that will done to make 
> sure that the data adheres to the CSV format. It is trust based system that 
> the incoming data will follow the CSV rules as depicted in the link above 
> https://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+Intermediate+representation#Sqoop2Intermediaterepresentation-Intermediateformatrepresentationproposa
> Next, having know these design goals, the format to encode the nested arrays 
> and maps can be done in some ways. 
> 2 examples were explored below. HIVE and postgres. Details are given below in 
> comments. One of the simplest ways was to use the universal JSON jackson api 
> for nested arrays and maps.
> Postgres format  is very similar to that but just needs more hand-rolling 
> instead of relying on a standard JSON library.  both for arrays and map, this 
> format can be used as a standard.  Between this and actually using jackson 
> object mapper, the performance differences are highly unlikely to be 
> different. 
> I would still prefer using a standard JSON library for encoding maps and 
> nested arrays, so that the connectors can  use the same standard as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to