If we specify order of the fields and length for each field then start, end
can be computed.
Why do we need end user to specify start position for each field?

~ Yogi

On 8 September 2016 at 12:48, Chinmay Kolhatkar <chin...@datatorrent.com>
wrote:

> Few points/questions:
> 1. Agree with Yogi. Approach 2 does not look clean.
> 2. Do we need "recordwidthlength"?
> 3. "recordseperator" should be "\n" and not "/n".
> 4. In general, providing schema as a JSON is tedious from user perspective.
> I suggest we find a simpler format for specifying schema. For eg.
> <name>,<type>,<startPointer>,<fieldLength>
> 5. I suggest we provide basic parser first to malhar which does only
> parsing and type checking. Constraints, IMO are not part of parsing module
> OR if needed can be added as phase 2 improvisation of this parser.
> 6. I would suggest to use some existing library for parsing. There is no
> point in re-inventing the wheels and trying to make something robust can be
> time consuming.
>
> -Chinmay.
>
>
> On Wed, Sep 7, 2016 at 4:33 PM, Yogi Devendra <
> devendra.vyavah...@gmail.com>
> wrote:
>
> > Approach 2 does not look like a clean solution.
> >
> > -1 for Approach 2.
> >
> > ~ Yogi
> >
> > On 7 September 2016 at 15:25, Hitesh Kapoor <hit...@datatorrent.com>
> > wrote:
> >
> > > Hi All,
> > >
> > > An operator for parsing fixed width records has to be implemented.
> > > This operator shall be used to parse fixed width byte array/tuples
> based
> > on
> > > a JSON Schema and emit the parsed bytearray on one port; converted POJO
> > > object on another port and the failed bytearray/tuples on an error
> port.
> > >
> > >
> > > User will provide a JSON schema definition based on the schema
> definition
> > > as mentioned below.
> > >
> > > {
> > >
> > > “recordwidthlength”: “Integer”
> > >
> > > "recordseparator": "/n", // this would be blank if there is no record
> > > separator, default - a newline character
> > >
> > > "fields": [
> > >
> > > {
> > >
> > > "name": "<Name of the Field>",
> > >
> > > "type": "<Data Type of Field>",
> > >
> > > “startCharNum”: “<Integer - Starting Character Position>”,
> > >
> > > “endCharNum”: “<Integer - End Character Position>”,
> > >
> > > "constraints": {
> > >
> > > }
> > >
> > > },
> > >
> > > {
> > >
> > > "name": "adName",
> > >
> > > "type": "String",
> > >
> > > “startCharNum”: “Integer”,
> > >
> > > “endCharNum”: “Integer”,
> > >
> > > "constraints": {
> > >
> > > "required": "true",
> > >
> > > "pattern": "[a­z].*[a­z]$",
> > >
> > > }
> > >
> > > }
> > > ]
> > > }
> > >
> > >
> > > Below are the options to implement this operator.
> > >
> > > 1) Write a new custom library for parsing fixed width records as
> existing
> > > libraries for the same(e.g. flatowrm jffp etc.) do not have mechanism
> for
> > > constraint checking.
> > > The challenges in this approach will be to write a robust library from
> > > scratch to handle all our requirements.
> > >
> > > 2) Extend our already written CsvParser to handle fixed width record.
> In
> > > this approach in the incoming tuple we will have to add a delimiter
> > > "character" after every field in the record.
> > > The challenges in this approach would be to select a delimiter
> character
> > > and then if the character appears in the stream we will have to escape
> > that
> > > character.
> > > This approach will increase the memory overhead (as extra characters
> are
> > > inserted as delimiters) but will be comparatively more easy to maintain
> > and
> > > operate.
> > >
> > > Please let me know your thoughts and votes on above approaches.
> > >
> > > Regards,
> > > Hitesh
> > >
> >
>

Reply via email to