Re: Review Request: Latest take on CRUNCH-97, text parsing lib for Crunch

Josh Wills Sun, 02 Dec 2012 00:01:26 -0800

Hey Matthias,

Your use case makes a lot of sense to me. I'm still recovering from a
business trip last week, but I'll take a look at the latest patch tomorrow
morning.


Josh



On Fri, Nov 30, 2012 at 12:23 PM, Matthias Friedrich <[email protected]> wrote:

> Hi Josh,
>
> sorry for taking so long. I've checked the code and I'm confident that
> it works as intended. There are a few things about handling and
> reporting missing data that I'd like to look into, but what we have
> now is great stuff already.
>
> Based on my own use cases I'd like to discuss a slightly different
> solution. I'm not sure if my requirements are esoteric, but perhaps
> that's what others need, too.
>
> The data I work with usually comes in CSV format with many rows and
> several hundreds of attributes. I have a schema for the file so I can
> easily map column names to column numbers. When I process the file I
> am often interested in only a small subset of columns. I would want
> a parsing library to take a row of the file and give me a Tuple with
> only the columns I'm interested in. Since we don't have a NamedTuple
> abstraction, I don't want the Tuple to contain too much data that
> I don't need.
>
> This is possible with the current implementation, but the Scanner
> stuff looks a bit unwieldy when it comes to skipping data. Here's how
> I would like to specify the extraction process:
>
>   Parse.parse(data, tokenizer,
>         xtuple(xstring(0), xint(7), xboolean(3), xdouble(9)))
>
> The int argument specifies the column number to extract the data from.
> This approach would work best if we just take the input record and
> turn it into a sequence of tokens. We could offer alternative
> strategies for tokenizing, like regex for log parsing (we pull out the
> groups specified in the pattern) or simple splitting at a static
> or regex delimiter. The extractors get the sequence of tokens passed
> in and take whatever they need.
>
> I'm a bit busy right now but I'd help out with some code if you want.
> It would probably take a bit until I can make some time though.
>
> What do you think?
>
> Regards,
>   Matthias
>
> On Sunday, 2012-11-25, Josh Wills wrote:
> >
> > -----------------------------------------------------------
> > This is an automatically generated e-mail. To reply, visit:
> > https://reviews.apache.org/r/8151/
> > -----------------------------------------------------------
> >
> > (Updated Nov. 25, 2012, 7:21 p.m.)
> >
> >
> > Review request for crunch.
> >
> >
> > Changes
> > -------
> >
> > Incorporated feedback from Matthias and Gabriel; added a bunch of
> javadoc.
> >
> >
> > Description
> > -------
> >
> > Latest and greatest rev of the extraction library for text parsing. I
> ended up refactoring the approach so that we could support nested parsing
> (e.g., using different Scanner instances for different parts of a line) and
> collections of items on a single line.
> >
> >
> > This addresses bug CRUNCH-97.
> >     https://issues.apache.org/jira/browse/CRUNCH-97
> >
> >
> > Diffs (updated)
> > -----
> >
> >   crunch/src/main/java/org/apache/crunch/lib/PTables.java e788656
> >   crunch/src/main/java/org/apache/crunch/lib/PTables.java e788656
> >
> crunch/src/main/java/org/apache/crunch/lib/text/AbstractCompositeExtractor.java
> PRE-CREATION
> >
> crunch/src/main/java/org/apache/crunch/lib/text/AbstractCompositeExtractor.java
> PRE-CREATION
> >
> crunch/src/main/java/org/apache/crunch/lib/text/AbstractSimpleExtractor.java
> PRE-CREATION
> >
> crunch/src/main/java/org/apache/crunch/lib/text/AbstractSimpleExtractor.java
> PRE-CREATION
> >   crunch/src/main/java/org/apache/crunch/lib/text/Extractor.java
> PRE-CREATION
> >   crunch/src/main/java/org/apache/crunch/lib/text/Extractor.java
> PRE-CREATION
> >   crunch/src/main/java/org/apache/crunch/lib/text/ExtractorStats.java
> PRE-CREATION
> >   crunch/src/main/java/org/apache/crunch/lib/text/ExtractorStats.java
> PRE-CREATION
> >   crunch/src/main/java/org/apache/crunch/lib/text/Extractors.java
> PRE-CREATION
> >   crunch/src/main/java/org/apache/crunch/lib/text/Extractors.java
> PRE-CREATION
> >   crunch/src/main/java/org/apache/crunch/lib/text/Parse.java PRE-CREATION
> >   crunch/src/main/java/org/apache/crunch/lib/text/Parse.java PRE-CREATION
> >   crunch/src/main/java/org/apache/crunch/lib/text/ScannerFactory.java
> PRE-CREATION
> >   crunch/src/main/java/org/apache/crunch/lib/text/ScannerFactory.java
> PRE-CREATION
> >   crunch/src/test/java/org/apache/crunch/lib/text/ParseTest.java
> PRE-CREATION
> >   crunch/src/test/java/org/apache/crunch/lib/text/ParseTest.java
> PRE-CREATION
> >
> > Diff: https://reviews.apache.org/r/8151/diff/
> >
> >
> > Testing
> > -------
> >
> > Unit tests so far, still gathering feedback on the approach.
> >
> >
> > Thanks,
> >
> > Josh Wills
> >
>

Re: Review Request: Latest take on CRUNCH-97, text parsing lib for Crunch

Reply via email to