Hi Ravi,
We can move mdkey generation step before sorting, this will compress the
dictionary data and will reduce the IO.
-Regards
Kumar Vishal

On Sat, Oct 8, 2016 at 3:30 PM, Ravindra Pesala <ravi.pes...@gmail.com>
wrote:

> Hi All,
>
>
> Removing kettle from carbondata is necessary as this legacy kettle
> framework become overhead to carbondata.This discussion is regarding the
> design of carbon load with out kettle.
>
> The main interface for data loading here is DataLoadProcessorStep.
>
> */***
> * * This base interface for data loading. It can do transformation jobs as
> per the implementation.*
> * **
> * */*
> *public interface DataLoadProcessorStep {*
>
> *  /***
> *   * The output meta for this step. The data returns from this step is as
> per this meta.*
> *   * @return*
> *   */*
> *  DataField[] getOutput();*
>
> *  /***
> *   * Intialization process for this step.*
> *   * @param configuration*
> *   * @param child*
> *   * @throws CarbonDataLoadingException*
> *   */*
> *  void intialize(CarbonDataLoadConfiguration configuration,
> DataLoadProcessorStep child) throws*
> *      CarbonDataLoadingException;*
>
> *  /***
> *   * Tranform the data as per the implemetation.*
> *   * @return Iterator of data*
> *   * @throws CarbonDataLoadingException*
> *   */*
> *  Iterator<Object[]> execute() throws CarbonDataLoadingException;*
>
> *  /***
> *   * Any closing of resources after step execution can be done here.*
> *   */*
> *  void finish();*
> *}*
>
> The implementation classes for DataLoadProcessorStep are
> InputProcessorStep, EncoderProcessorStep, SortProcessorStep and
> DataWriterProcessorStep.
>
> The following picture depicts the loading process with implementation
> classes.
>
> [image: Inline images 2]
>
> *InputProcessorStep* :  It does two jobs, 1. It reads data from
> RecordReader of InputFormat 2. Parse each field of column as per the data
> type.
> *EncoderProcessorStep*: It encodes each field with dictionary if
> requires.And combine all no dictionary columns to single byte array.
> *SortProcessorStep* :   It sorts the data on dimension columns and write
> to intermediate files.
> *DataWriterProcessorStep* : It merge sort the data from intermediate temp
> files and generate mdk key and writes the data in carbondata format to
> store.
>
>
>
> The following interface for Dictionary generation.
>
> */***
> * * Generates dictionary for the column. The implementation classes can be
> pre-defined or*
> * * local or global dictionary generations.*
> * */*
> *public interface ColumnDictionaryGenerator {*
>
> *  /***
> *   * Generates dictionary value for the column data*
> *   * @param data*
> *   * @return dictionary value*
> *   */*
> *  int generateDictionaryValue(Object data);*
>
> *  /***
> *   * Returns the actual value associated with dictionary value.*
> *   * @param dictionary*
> *   * @return actual value.*
> *   */*
> *  Object getValueFromDictionary(int dictionary);*
>
> *  /***
> *   * Returns the maximum value among the dictionary values. It is used
> for generating mdk key.*
> *   * @return max dictionary value.*
> *   */*
> *  int getMaxDictionaryValue();*
>
> *}*
>
> This ColumnDictionaryGenerator interface can have 3 implementations, 1.
> PreGeneratedColumnDictionaryGenerator 2. GlobalColumnDictionaryGenerator
> 3. LocalColumnDictionaryGenerator
>
> [image: Inline images 3]
>
> *PreGeneratedColumnDictionaryGenerator* : It gets the dictionary values
> from already generated and loaded dictionary.
> *GlobalColumnDictionaryGenerator* : It generates global dictionary online
> by using KV store or distributed map.
> *LocalColumnDictionaryGenerator* : It generates local dictionary only for
> that executor.
>
>
> For more information on the loading please check the PR
> https://github.com/apache/incubator-carbondata/pull/215
>
> Please let me know any changes are required in these interfaces.
>
> --
> Thanks & Regards,
> Ravi
>

Reply via email to