Re: Fixed Width Record Parser

Yogi Devendra Thu, 08 Sep 2016 01:08:32 -0700

If we specify order of the fields and length for each field then start, end
can be computed.
Why do we need end user to specify start position for each field?


~ Yogi

On 8 September 2016 at 12:48, Chinmay Kolhatkar <chin...@datatorrent.com>
wrote:

> Few points/questions:
> 1. Agree with Yogi. Approach 2 does not look clean.
> 2. Do we need "recordwidthlength"?
> 3. "recordseperator" should be "\n" and not "/n".
> 4. In general, providing schema as a JSON is tedious from user perspective.
> I suggest we find a simpler format for specifying schema. For eg.
> <name>,<type>,<startPointer>,<fieldLength>
> 5. I suggest we provide basic parser first to malhar which does only
> parsing and type checking. Constraints, IMO are not part of parsing module
> OR if needed can be added as phase 2 improvisation of this parser.
> 6. I would suggest to use some existing library for parsing. There is no
> point in re-inventing the wheels and trying to make something robust can be
> time consuming.
>
> -Chinmay.
>
>
> On Wed, Sep 7, 2016 at 4:33 PM, Yogi Devendra <
> devendra.vyavah...@gmail.com>
> wrote:
>
> > Approach 2 does not look like a clean solution.
> >
> > -1 for Approach 2.
> >
> > ~ Yogi
> >
> > On 7 September 2016 at 15:25, Hitesh Kapoor <hit...@datatorrent.com>
> > wrote:
> >
> > > Hi All,
> > >
> > > An operator for parsing fixed width records has to be implemented.
> > > This operator shall be used to parse fixed width byte array/tuples
> based
> > on
> > > a JSON Schema and emit the parsed bytearray on one port; converted POJO
> > > object on another port and the failed bytearray/tuples on an error
> port.
> > >
> > >
> > > User will provide a JSON schema definition based on the schema
> definition
> > > as mentioned below.
> > >
> > > {
> > >
> > > “recordwidthlength”: “Integer”
> > >
> > > "recordseparator": "/n", // this would be blank if there is no record
> > > separator, default - a newline character
> > >
> > > "fields": [
> > >
> > > {
> > >
> > > "name": "<Name of the Field>",
> > >
> > > "type": "<Data Type of Field>",
> > >
> > > “startCharNum”: “<Integer - Starting Character Position>”,
> > >
> > > “endCharNum”: “<Integer - End Character Position>”,
> > >
> > > "constraints": {
> > >
> > > }
> > >
> > > },
> > >
> > > {
> > >
> > > "name": "adName",
> > >
> > > "type": "String",
> > >
> > > “startCharNum”: “Integer”,
> > >
> > > “endCharNum”: “Integer”,
> > >
> > > "constraints": {
> > >
> > > "required": "true",
> > >
> > > "pattern": "[az].*[az]$",
> > >
> > > }
> > >
> > > }
> > > ]
> > > }
> > >
> > >
> > > Below are the options to implement this operator.
> > >
> > > 1) Write a new custom library for parsing fixed width records as
> existing
> > > libraries for the same(e.g. flatowrm jffp etc.) do not have mechanism
> for
> > > constraint checking.
> > > The challenges in this approach will be to write a robust library from
> > > scratch to handle all our requirements.
> > >
> > > 2) Extend our already written CsvParser to handle fixed width record.
> In
> > > this approach in the incoming tuple we will have to add a delimiter
> > > "character" after every field in the record.
> > > The challenges in this approach would be to select a delimiter
> character
> > > and then if the character appears in the stream we will have to escape
> > that
> > > character.
> > > This approach will increase the memory overhead (as extra characters
> are
> > > inserted as delimiters) but will be comparatively more easy to maintain
> > and
> > > operate.
> > >
> > > Please let me know your thoughts and votes on above approaches.
> > >
> > > Regards,
> > > Hitesh
> > >
> >
>

Re: Fixed Width Record Parser

Reply via email to