Discussion regrading design of data load after kettle removal.

Ravindra Pesala Sat, 08 Oct 2016 03:01:12 -0700

Hi All,


Removing kettle from carbondata is necessary as this legacy kettle
framework become overhead to carbondata.This discussion is regarding the
design of carbon load with out kettle.

The main interface for data loading here is DataLoadProcessorStep.

*/***
* * This base interface for data loading. It can do transformation jobs as
per the implementation.*
* **
* */*
*public interface DataLoadProcessorStep {*

*  /***
*   * The output meta for this step. The data returns from this step is as
per this meta.*
*   * @return*
*   */*
*  DataField[] getOutput();*

*  /***
*   * Intialization process for this step.*
*   * @param configuration*
*   * @param child*
*   * @throws CarbonDataLoadingException*
*   */*
*  void intialize(CarbonDataLoadConfiguration configuration,
DataLoadProcessorStep child) throws*
*      CarbonDataLoadingException;*

*  /***
*   * Tranform the data as per the implemetation.*
*   * @return Iterator of data*
*   * @throws CarbonDataLoadingException*
*   */*
*  Iterator<Object[]> execute() throws CarbonDataLoadingException;*

*  /***
*   * Any closing of resources after step execution can be done here.*
*   */*
*  void finish();*
*}*

The implementation classes for DataLoadProcessorStep are
InputProcessorStep, EncoderProcessorStep, SortProcessorStep and
DataWriterProcessorStep.

The following picture depicts the loading process with implementation
classes.

[image: Inline images 2]

*InputProcessorStep* :  It does two jobs, 1. It reads data from
RecordReader of InputFormat 2. Parse each field of column as per the data
type.
*EncoderProcessorStep*: It encodes each field with dictionary if
requires.And combine all no dictionary columns to single byte array.
*SortProcessorStep* :   It sorts the data on dimension columns and write to
intermediate files.
*DataWriterProcessorStep* : It merge sort the data from intermediate temp
files and generate mdk key and writes the data in carbondata format to
store.



The following interface for Dictionary generation.

*/***
* * Generates dictionary for the column. The implementation classes can be
pre-defined or*
* * local or global dictionary generations.*
* */*
*public interface ColumnDictionaryGenerator {*

*  /***
*   * Generates dictionary value for the column data*
*   * @param data*
*   * @return dictionary value*
*   */*
*  int generateDictionaryValue(Object data);*

*  /***
*   * Returns the actual value associated with dictionary value.*
*   * @param dictionary*
*   * @return actual value.*
*   */*
*  Object getValueFromDictionary(int dictionary);*

*  /***
*   * Returns the maximum value among the dictionary values. It is used for
generating mdk key.*
*   * @return max dictionary value.*
*   */*
*  int getMaxDictionaryValue();*

*}*

This ColumnDictionaryGenerator interface can have 3 implementations, 1.
PreGeneratedColumnDictionaryGenerator 2. GlobalColumnDictionaryGenerator 3.
LocalColumnDictionaryGenerator

[image: Inline images 3]

*PreGeneratedColumnDictionaryGenerator* : It gets the dictionary values
from already generated and loaded dictionary.
*GlobalColumnDictionaryGenerator* : It generates global dictionary online
by using KV store or distributed map.
*LocalColumnDictionaryGenerator* : It generates local dictionary only for
that executor.


For more information on the loading please check the PR
https://github.com/apache/incubator-carbondata/pull/215

Please let me know any changes are required in these interfaces.

-- 
Thanks & Regards,
Ravi

Discussion regrading design of data load after kettle removal.

Reply via email to