Hey Matthias, Your use case makes a lot of sense to me. I'm still recovering from a business trip last week, but I'll take a look at the latest patch tomorrow morning.
Josh On Fri, Nov 30, 2012 at 12:23 PM, Matthias Friedrich <[email protected]> wrote: > Hi Josh, > > sorry for taking so long. I've checked the code and I'm confident that > it works as intended. There are a few things about handling and > reporting missing data that I'd like to look into, but what we have > now is great stuff already. > > Based on my own use cases I'd like to discuss a slightly different > solution. I'm not sure if my requirements are esoteric, but perhaps > that's what others need, too. > > The data I work with usually comes in CSV format with many rows and > several hundreds of attributes. I have a schema for the file so I can > easily map column names to column numbers. When I process the file I > am often interested in only a small subset of columns. I would want > a parsing library to take a row of the file and give me a Tuple with > only the columns I'm interested in. Since we don't have a NamedTuple > abstraction, I don't want the Tuple to contain too much data that > I don't need. > > This is possible with the current implementation, but the Scanner > stuff looks a bit unwieldy when it comes to skipping data. Here's how > I would like to specify the extraction process: > > Parse.parse(data, tokenizer, > xtuple(xstring(0), xint(7), xboolean(3), xdouble(9))) > > The int argument specifies the column number to extract the data from. > This approach would work best if we just take the input record and > turn it into a sequence of tokens. We could offer alternative > strategies for tokenizing, like regex for log parsing (we pull out the > groups specified in the pattern) or simple splitting at a static > or regex delimiter. The extractors get the sequence of tokens passed > in and take whatever they need. > > I'm a bit busy right now but I'd help out with some code if you want. > It would probably take a bit until I can make some time though. > > What do you think? > > Regards, > Matthias > > On Sunday, 2012-11-25, Josh Wills wrote: > > > > ----------------------------------------------------------- > > This is an automatically generated e-mail. To reply, visit: > > https://reviews.apache.org/r/8151/ > > ----------------------------------------------------------- > > > > (Updated Nov. 25, 2012, 7:21 p.m.) > > > > > > Review request for crunch. > > > > > > Changes > > ------- > > > > Incorporated feedback from Matthias and Gabriel; added a bunch of > javadoc. > > > > > > Description > > ------- > > > > Latest and greatest rev of the extraction library for text parsing. I > ended up refactoring the approach so that we could support nested parsing > (e.g., using different Scanner instances for different parts of a line) and > collections of items on a single line. > > > > > > This addresses bug CRUNCH-97. > > https://issues.apache.org/jira/browse/CRUNCH-97 > > > > > > Diffs (updated) > > ----- > > > > crunch/src/main/java/org/apache/crunch/lib/PTables.java e788656 > > crunch/src/main/java/org/apache/crunch/lib/PTables.java e788656 > > > crunch/src/main/java/org/apache/crunch/lib/text/AbstractCompositeExtractor.java > PRE-CREATION > > > crunch/src/main/java/org/apache/crunch/lib/text/AbstractCompositeExtractor.java > PRE-CREATION > > > crunch/src/main/java/org/apache/crunch/lib/text/AbstractSimpleExtractor.java > PRE-CREATION > > > crunch/src/main/java/org/apache/crunch/lib/text/AbstractSimpleExtractor.java > PRE-CREATION > > crunch/src/main/java/org/apache/crunch/lib/text/Extractor.java > PRE-CREATION > > crunch/src/main/java/org/apache/crunch/lib/text/Extractor.java > PRE-CREATION > > crunch/src/main/java/org/apache/crunch/lib/text/ExtractorStats.java > PRE-CREATION > > crunch/src/main/java/org/apache/crunch/lib/text/ExtractorStats.java > PRE-CREATION > > crunch/src/main/java/org/apache/crunch/lib/text/Extractors.java > PRE-CREATION > > crunch/src/main/java/org/apache/crunch/lib/text/Extractors.java > PRE-CREATION > > crunch/src/main/java/org/apache/crunch/lib/text/Parse.java PRE-CREATION > > crunch/src/main/java/org/apache/crunch/lib/text/Parse.java PRE-CREATION > > crunch/src/main/java/org/apache/crunch/lib/text/ScannerFactory.java > PRE-CREATION > > crunch/src/main/java/org/apache/crunch/lib/text/ScannerFactory.java > PRE-CREATION > > crunch/src/test/java/org/apache/crunch/lib/text/ParseTest.java > PRE-CREATION > > crunch/src/test/java/org/apache/crunch/lib/text/ParseTest.java > PRE-CREATION > > > > Diff: https://reviews.apache.org/r/8151/diff/ > > > > > > Testing > > ------- > > > > Unit tests so far, still gathering feedback on the approach. > > > > > > Thanks, > > > > Josh Wills > > >
