Hi Ravi, We can move mdkey generation step before sorting, this will compress the dictionary data and will reduce the IO. -Regards Kumar Vishal
On Sat, Oct 8, 2016 at 3:30 PM, Ravindra Pesala <ravi.pes...@gmail.com> wrote: > Hi All, > > > Removing kettle from carbondata is necessary as this legacy kettle > framework become overhead to carbondata.This discussion is regarding the > design of carbon load with out kettle. > > The main interface for data loading here is DataLoadProcessorStep. > > */*** > * * This base interface for data loading. It can do transformation jobs as > per the implementation.* > * ** > * */* > *public interface DataLoadProcessorStep {* > > * /*** > * * The output meta for this step. The data returns from this step is as > per this meta.* > * * @return* > * */* > * DataField[] getOutput();* > > * /*** > * * Intialization process for this step.* > * * @param configuration* > * * @param child* > * * @throws CarbonDataLoadingException* > * */* > * void intialize(CarbonDataLoadConfiguration configuration, > DataLoadProcessorStep child) throws* > * CarbonDataLoadingException;* > > * /*** > * * Tranform the data as per the implemetation.* > * * @return Iterator of data* > * * @throws CarbonDataLoadingException* > * */* > * Iterator<Object[]> execute() throws CarbonDataLoadingException;* > > * /*** > * * Any closing of resources after step execution can be done here.* > * */* > * void finish();* > *}* > > The implementation classes for DataLoadProcessorStep are > InputProcessorStep, EncoderProcessorStep, SortProcessorStep and > DataWriterProcessorStep. > > The following picture depicts the loading process with implementation > classes. > > [image: Inline images 2] > > *InputProcessorStep* : It does two jobs, 1. It reads data from > RecordReader of InputFormat 2. Parse each field of column as per the data > type. > *EncoderProcessorStep*: It encodes each field with dictionary if > requires.And combine all no dictionary columns to single byte array. > *SortProcessorStep* : It sorts the data on dimension columns and write > to intermediate files. > *DataWriterProcessorStep* : It merge sort the data from intermediate temp > files and generate mdk key and writes the data in carbondata format to > store. > > > > The following interface for Dictionary generation. > > */*** > * * Generates dictionary for the column. The implementation classes can be > pre-defined or* > * * local or global dictionary generations.* > * */* > *public interface ColumnDictionaryGenerator {* > > * /*** > * * Generates dictionary value for the column data* > * * @param data* > * * @return dictionary value* > * */* > * int generateDictionaryValue(Object data);* > > * /*** > * * Returns the actual value associated with dictionary value.* > * * @param dictionary* > * * @return actual value.* > * */* > * Object getValueFromDictionary(int dictionary);* > > * /*** > * * Returns the maximum value among the dictionary values. It is used > for generating mdk key.* > * * @return max dictionary value.* > * */* > * int getMaxDictionaryValue();* > > *}* > > This ColumnDictionaryGenerator interface can have 3 implementations, 1. > PreGeneratedColumnDictionaryGenerator 2. GlobalColumnDictionaryGenerator > 3. LocalColumnDictionaryGenerator > > [image: Inline images 3] > > *PreGeneratedColumnDictionaryGenerator* : It gets the dictionary values > from already generated and loaded dictionary. > *GlobalColumnDictionaryGenerator* : It generates global dictionary online > by using KV store or distributed map. > *LocalColumnDictionaryGenerator* : It generates local dictionary only for > that executor. > > > For more information on the loading please check the PR > https://github.com/apache/incubator-carbondata/pull/215 > > Please let me know any changes are required in these interfaces. > > -- > Thanks & Regards, > Ravi >