Re: Fixed Width Record Parser

Hitesh Kapoor Mon, 03 Oct 2016 01:42:56 -0700

Hi All,

Thank you for your feedback.
So as per the votes/comments, I will not be going ahead with approach 2 as
it is not clean.

For approach 1, I have looked at the possibility to use existing parsing
libraries like flatworm, flatpack, univocity,
following are the problems with using exisiting libraries:
1) These libraries take input schema in a specific format and are
complicated to use.
For example the most famous library (as per stackoverflow) flatworm will
involve giving the input schema in Xml format (refer
http://flatworm.sourceforge.net/) so we will loose our consistency with
existing parsers like CsvParser, where we take i/p in JSON format. Not only
the consistency it will be more difficult for the user to give input in
flatworm specific XML.
If we decide to convert our JSON to Flatworm specific Xml, it will involve
lot more work then to write your own library.
2)  Does only limited type checking for example for a Date type if it
adheres to dd/mm/yyyy, a date may parse correctly for i/p 12/13/2000 (month
is beyond 12) .
3) Difficult to handle Boolean and Date datatypes.
4) Future scalability may take a hit. For example if we want to add more
constraints to our parser like min value for an integer or a pattern for a
string , it won't be possible to do it with existing libraries.
5) To retrieve the values to create a POJO is not user (coder) friendly.

According to me we should write our own library to do the parsing and
validation  as to use an existing library will involve more work.
The work involved in coding the library is easy and straightforward.
It will be easier for us to scale and also provide an easy life for the end
user to provide the input schema.
The reason we are not going ahead with approach 2 is that it is not clean,
the twisting and turning involved in using (forcefully using) existing
libraries appears more dirty to me.

Regards,
Hitesh

On Thu, Sep 8, 2016 at 1:37 PM, Yogi Devendra <devendra.vyavah...@gmail.com>
wrote:

> If we specify order of the fields and length for each field then start, end
> can be computed.
> Why do we need end user to specify start position for each field?
>
> ~ Yogi
>
> On 8 September 2016 at 12:48, Chinmay Kolhatkar <chin...@datatorrent.com>
> wrote:
>
> > Few points/questions:
> > 1. Agree with Yogi. Approach 2 does not look clean.
> > 2. Do we need "recordwidthlength"?
> > 3. "recordseperator" should be "\n" and not "/n".
> > 4. In general, providing schema as a JSON is tedious from user
> perspective.
> > I suggest we find a simpler format for specifying schema. For eg.
> > <name>,<type>,<startPointer>,<fieldLength>
> > 5. I suggest we provide basic parser first to malhar which does only
> > parsing and type checking. Constraints, IMO are not part of parsing
> module
> > OR if needed can be added as phase 2 improvisation of this parser.
> > 6. I would suggest to use some existing library for parsing. There is no
> > point in re-inventing the wheels and trying to make something robust can
> be
> > time consuming.
> >
> > -Chinmay.
> >
> >
> > On Wed, Sep 7, 2016 at 4:33 PM, Yogi Devendra <
> > devendra.vyavah...@gmail.com>
> > wrote:
> >
> > > Approach 2 does not look like a clean solution.
> > >
> > > -1 for Approach 2.
> > >
> > > ~ Yogi
> > >
> > > On 7 September 2016 at 15:25, Hitesh Kapoor <hit...@datatorrent.com>
> > > wrote:
> > >
> > > > Hi All,
> > > >
> > > > An operator for parsing fixed width records has to be implemented.
> > > > This operator shall be used to parse fixed width byte array/tuples
> > based
> > > on
> > > > a JSON Schema and emit the parsed bytearray on one port; converted
> POJO
> > > > object on another port and the failed bytearray/tuples on an error
> > port.
> > > >
> > > >
> > > > User will provide a JSON schema definition based on the schema
> > definition
> > > > as mentioned below.
> > > >
> > > > {
> > > >
> > > > “recordwidthlength”: “Integer”
> > > >
> > > > "recordseparator": "/n", // this would be blank if there is no record
> > > > separator, default - a newline character
> > > >
> > > > "fields": [
> > > >
> > > > {
> > > >
> > > > "name": "<Name of the Field>",
> > > >
> > > > "type": "<Data Type of Field>",
> > > >
> > > > “startCharNum”: “<Integer - Starting Character Position>”,
> > > >
> > > > “endCharNum”: “<Integer - End Character Position>”,
> > > >
> > > > "constraints": {
> > > >
> > > > }
> > > >
> > > > },
> > > >
> > > > {
> > > >
> > > > "name": "adName",
> > > >
> > > > "type": "String",
> > > >
> > > > “startCharNum”: “Integer”,
> > > >
> > > > “endCharNum”: “Integer”,
> > > >
> > > > "constraints": {
> > > >
> > > > "required": "true",
> > > >
> > > > "pattern": "[az].*[az]$",
> > > >
> > > > }
> > > >
> > > > }
> > > > ]
> > > > }
> > > >
> > > >
> > > > Below are the options to implement this operator.
> > > >
> > > > 1) Write a new custom library for parsing fixed width records as
> > existing
> > > > libraries for the same(e.g. flatowrm jffp etc.) do not have mechanism
> > for
> > > > constraint checking.
> > > > The challenges in this approach will be to write a robust library
> from
> > > > scratch to handle all our requirements.
> > > >
> > > > 2) Extend our already written CsvParser to handle fixed width record.
> > In
> > > > this approach in the incoming tuple we will have to add a delimiter
> > > > "character" after every field in the record.
> > > > The challenges in this approach would be to select a delimiter
> > character
> > > > and then if the character appears in the stream we will have to
> escape
> > > that
> > > > character.
> > > > This approach will increase the memory overhead (as extra characters
> > are
> > > > inserted as delimiters) but will be comparatively more easy to
> maintain
> > > and
> > > > operate.
> > > >
> > > > Please let me know your thoughts and votes on above approaches.
> > > >
> > > > Regards,
> > > > Hitesh
> > > >
> > >
> >
>

Re: Fixed Width Record Parser

Reply via email to