Re: Fixed Width Record Parser

Chinmay Kolhatkar Thu, 08 Sep 2016 00:20:08 -0700

Few points/questions:
1. Agree with Yogi. Approach 2 does not look clean.
2. Do we need "recordwidthlength"?
3. "recordseperator" should be "\n" and not "/n".
4. In general, providing schema as a JSON is tedious from user perspective.
I suggest we find a simpler format for specifying schema. For eg.
<name>,<type>,<startPointer>,<fieldLength>
5. I suggest we provide basic parser first to malhar which does only
parsing and type checking. Constraints, IMO are not part of parsing module
OR if needed can be added as phase 2 improvisation of this parser.
6. I would suggest to use some existing library for parsing. There is no
point in re-inventing the wheels and trying to make something robust can be
time consuming.


-Chinmay.


On Wed, Sep 7, 2016 at 4:33 PM, Yogi Devendra <devendra.vyavah...@gmail.com>
wrote:

> Approach 2 does not look like a clean solution.
>
> -1 for Approach 2.
>
> ~ Yogi
>
> On 7 September 2016 at 15:25, Hitesh Kapoor <hit...@datatorrent.com>
> wrote:
>
> > Hi All,
> >
> > An operator for parsing fixed width records has to be implemented.
> > This operator shall be used to parse fixed width byte array/tuples based
> on
> > a JSON Schema and emit the parsed bytearray on one port; converted POJO
> > object on another port and the failed bytearray/tuples on an error port.
> >
> >
> > User will provide a JSON schema definition based on the schema definition
> > as mentioned below.
> >
> > {
> >
> > “recordwidthlength”: “Integer”
> >
> > "recordseparator": "/n", // this would be blank if there is no record
> > separator, default - a newline character
> >
> > "fields": [
> >
> > {
> >
> > "name": "<Name of the Field>",
> >
> > "type": "<Data Type of Field>",
> >
> > “startCharNum”: “<Integer - Starting Character Position>”,
> >
> > “endCharNum”: “<Integer - End Character Position>”,
> >
> > "constraints": {
> >
> > }
> >
> > },
> >
> > {
> >
> > "name": "adName",
> >
> > "type": "String",
> >
> > “startCharNum”: “Integer”,
> >
> > “endCharNum”: “Integer”,
> >
> > "constraints": {
> >
> > "required": "true",
> >
> > "pattern": "[az].*[az]$",
> >
> > }
> >
> > }
> > ]
> > }
> >
> >
> > Below are the options to implement this operator.
> >
> > 1) Write a new custom library for parsing fixed width records as existing
> > libraries for the same(e.g. flatowrm jffp etc.) do not have mechanism for
> > constraint checking.
> > The challenges in this approach will be to write a robust library from
> > scratch to handle all our requirements.
> >
> > 2) Extend our already written CsvParser to handle fixed width record. In
> > this approach in the incoming tuple we will have to add a delimiter
> > "character" after every field in the record.
> > The challenges in this approach would be to select a delimiter character
> > and then if the character appears in the stream we will have to escape
> that
> > character.
> > This approach will increase the memory overhead (as extra characters are
> > inserted as delimiters) but will be comparatively more easy to maintain
> and
> > operate.
> >
> > Please let me know your thoughts and votes on above approaches.
> >
> > Regards,
> > Hitesh
> >
>

Re: Fixed Width Record Parser

Reply via email to