If we specify order of the fields and length for each field then start, end can be computed. Why do we need end user to specify start position for each field?
~ Yogi On 8 September 2016 at 12:48, Chinmay Kolhatkar <chin...@datatorrent.com> wrote: > Few points/questions: > 1. Agree with Yogi. Approach 2 does not look clean. > 2. Do we need "recordwidthlength"? > 3. "recordseperator" should be "\n" and not "/n". > 4. In general, providing schema as a JSON is tedious from user perspective. > I suggest we find a simpler format for specifying schema. For eg. > <name>,<type>,<startPointer>,<fieldLength> > 5. I suggest we provide basic parser first to malhar which does only > parsing and type checking. Constraints, IMO are not part of parsing module > OR if needed can be added as phase 2 improvisation of this parser. > 6. I would suggest to use some existing library for parsing. There is no > point in re-inventing the wheels and trying to make something robust can be > time consuming. > > -Chinmay. > > > On Wed, Sep 7, 2016 at 4:33 PM, Yogi Devendra < > devendra.vyavah...@gmail.com> > wrote: > > > Approach 2 does not look like a clean solution. > > > > -1 for Approach 2. > > > > ~ Yogi > > > > On 7 September 2016 at 15:25, Hitesh Kapoor <hit...@datatorrent.com> > > wrote: > > > > > Hi All, > > > > > > An operator for parsing fixed width records has to be implemented. > > > This operator shall be used to parse fixed width byte array/tuples > based > > on > > > a JSON Schema and emit the parsed bytearray on one port; converted POJO > > > object on another port and the failed bytearray/tuples on an error > port. > > > > > > > > > User will provide a JSON schema definition based on the schema > definition > > > as mentioned below. > > > > > > { > > > > > > “recordwidthlength”: “Integer” > > > > > > "recordseparator": "/n", // this would be blank if there is no record > > > separator, default - a newline character > > > > > > "fields": [ > > > > > > { > > > > > > "name": "<Name of the Field>", > > > > > > "type": "<Data Type of Field>", > > > > > > “startCharNum”: “<Integer - Starting Character Position>”, > > > > > > “endCharNum”: “<Integer - End Character Position>”, > > > > > > "constraints": { > > > > > > } > > > > > > }, > > > > > > { > > > > > > "name": "adName", > > > > > > "type": "String", > > > > > > “startCharNum”: “Integer”, > > > > > > “endCharNum”: “Integer”, > > > > > > "constraints": { > > > > > > "required": "true", > > > > > > "pattern": "[az].*[az]$", > > > > > > } > > > > > > } > > > ] > > > } > > > > > > > > > Below are the options to implement this operator. > > > > > > 1) Write a new custom library for parsing fixed width records as > existing > > > libraries for the same(e.g. flatowrm jffp etc.) do not have mechanism > for > > > constraint checking. > > > The challenges in this approach will be to write a robust library from > > > scratch to handle all our requirements. > > > > > > 2) Extend our already written CsvParser to handle fixed width record. > In > > > this approach in the incoming tuple we will have to add a delimiter > > > "character" after every field in the record. > > > The challenges in this approach would be to select a delimiter > character > > > and then if the character appears in the stream we will have to escape > > that > > > character. > > > This approach will increase the memory overhead (as extra characters > are > > > inserted as delimiters) but will be comparatively more easy to maintain > > and > > > operate. > > > > > > Please let me know your thoughts and votes on above approaches. > > > > > > Regards, > > > Hitesh > > > > > >