Hi All,
Removing kettle from carbondata is necessary as this legacy kettle framework become overhead to carbondata.This discussion is regarding the design of carbon load with out kettle. The main interface for data loading here is DataLoadProcessorStep. */*** * * This base interface for data loading. It can do transformation jobs as per the implementation.* * ** * */* *public interface DataLoadProcessorStep {* * /*** * * The output meta for this step. The data returns from this step is as per this meta.* * * @return* * */* * DataField[] getOutput();* * /*** * * Intialization process for this step.* * * @param configuration* * * @param child* * * @throws CarbonDataLoadingException* * */* * void intialize(CarbonDataLoadConfiguration configuration, DataLoadProcessorStep child) throws* * CarbonDataLoadingException;* * /*** * * Tranform the data as per the implemetation.* * * @return Iterator of data* * * @throws CarbonDataLoadingException* * */* * Iterator<Object[]> execute() throws CarbonDataLoadingException;* * /*** * * Any closing of resources after step execution can be done here.* * */* * void finish();* *}* The implementation classes for DataLoadProcessorStep are InputProcessorStep, EncoderProcessorStep, SortProcessorStep and DataWriterProcessorStep. The following picture depicts the loading process with implementation classes. [image: Inline images 2] *InputProcessorStep* : It does two jobs, 1. It reads data from RecordReader of InputFormat 2. Parse each field of column as per the data type. *EncoderProcessorStep*: It encodes each field with dictionary if requires.And combine all no dictionary columns to single byte array. *SortProcessorStep* : It sorts the data on dimension columns and write to intermediate files. *DataWriterProcessorStep* : It merge sort the data from intermediate temp files and generate mdk key and writes the data in carbondata format to store. The following interface for Dictionary generation. */*** * * Generates dictionary for the column. The implementation classes can be pre-defined or* * * local or global dictionary generations.* * */* *public interface ColumnDictionaryGenerator {* * /*** * * Generates dictionary value for the column data* * * @param data* * * @return dictionary value* * */* * int generateDictionaryValue(Object data);* * /*** * * Returns the actual value associated with dictionary value.* * * @param dictionary* * * @return actual value.* * */* * Object getValueFromDictionary(int dictionary);* * /*** * * Returns the maximum value among the dictionary values. It is used for generating mdk key.* * * @return max dictionary value.* * */* * int getMaxDictionaryValue();* *}* This ColumnDictionaryGenerator interface can have 3 implementations, 1. PreGeneratedColumnDictionaryGenerator 2. GlobalColumnDictionaryGenerator 3. LocalColumnDictionaryGenerator [image: Inline images 3] *PreGeneratedColumnDictionaryGenerator* : It gets the dictionary values from already generated and loaded dictionary. *GlobalColumnDictionaryGenerator* : It generates global dictionary online by using KV store or distributed map. *LocalColumnDictionaryGenerator* : It generates local dictionary only for that executor. For more information on the loading please check the PR https://github.com/apache/incubator-carbondata/pull/215 Please let me know any changes are required in these interfaces. -- Thanks & Regards, Ravi