Hi Jacky, 1. Yes. It is better to keep all sorting logic to one step so other types of sorts can be implemented easily. I will update the design.
2. EncoderProcessorStep can do dictionary encoding and converting nodictionary and complex types to byte[] representation. Here encoding interface is flexible for user to give different encoding representation at row level only. And about RLE, DELTA and also heavy compression are done at DataWriterProcessorStep only, it is because these encodings/compression happens at bloclklet level not row level. 3. Yes, each step requires schema definition, that will be passed as DataField[] through configuration to initial step InputProcessorStep . Remaining steps can call child.getOutput() to get the schema. Here each DataField represents one column. Regards, Ravi On 12 October 2016 at 09:38, Jacky Li <jacky.li...@qq.com> wrote: > Hi Ravindra, > > Regarding the design > (https://drive.google.com/file/d/0B4TWTVbFSTnqTF85anlDOUQ5S1BqY > zFpLWcwZnBLSVVqSWpj/view), > I have following question: > > 1. In SortProcessorStep, I think it is better to include MergeSort in this > step also, so it includes all logic for sorting. In this case, developer > can > implement a external sort (spill to files only if necessary), then the > loading process is a on-line sorting if memory is sufficient. I think it > will improve loading performance a lot. > > 2. In EncoderProcessorStep, apart from the dictionary encoding, what other > processing it will do? How about delta, RLE, etc. > > 3. In InputProcessorStep, it needs some schema definition to parse the > input > and convert to the row, right? For example, how to read from JSON, AVRO > file? > > Regards, > Jacky > > > > -- > View this message in context: http://apache-carbondata- > mailing-list-archive.1130556.n5.nabble.com/Discussion- > regrading-design-of-data-load-after-kettle-removal-tp1672p1783.html > Sent from the Apache CarbonData Mailing List archive mailing list archive > at Nabble.com. > -- Thanks & Regards, Ravi