Re: correct way to provide regex in TokenizerExpression?

furchess123 Thu, 29 Oct 2015 15:39:13 -0700

Ok, here's the workaround I have implemented to go past the above issue...

Some MyConstants.java file:


    public static final String SYSTEM_AGNOSTIC_NEWLINE_REGEX = "\r|\r\n|\n";

Splitter route configuration in a RouteBuilder implementation:

       TokenizerExpression tokenizerExpression = new TokenizerExpression();
       
tokenizerExpression.setToken(MyConstants.SYSTEM_AGNOSTIC_NEWLINE_REGEX);  //
tokenize by line separators
        tokenizerExpression.setGroup(readerConfig.getLinesPerChunk());//
group so many lines into one exchange
        tokenizerExpression.setRegex(true);  // indicate that it is a
regular expression, not simple string match

        from(FILE_SPLITTER_ENDPOINT).routeId("fileSplitterRoute").
            split(tokenizerExpression).
                streaming().          // enable streaming vs. reading all
into memory
                parallelProcessing(readerConfig.isParallelProcessing()). //
on/off concurrent processing of multiple chunks
                stopOnException().    // stop processing file if system
exception occurs (handled by onException clause)
                *bean(new TokenizerCharRemover())*. // cleans junk chars
inserted by Camel's tokenizer due to bug(?) 
                unmarshal().csv().    // unmarshal each chunk to Java (list
of String lists) using Camel's CSV component
                bean(csvHandler).     // hand each unmarshalled list of
lines/fields to bean that parses and validates line content
                bean(importProcessor).// process codes for import (depending
on operational mode and errors in exchange)
                to(AGGREGATE_ERRORS_ENDPOINT).      // delegate to nested
route to update error report
            end();
 

TokenizerCharRemover.java:

public class TokenizerCharRemover
{
    /**
     * Pre-compiled regex pattern to match the instances of character
sequences of the regular expression inserted by
     * Camel's splitter's tokenizer between the file lines in the body of
the exchange.  The input string that specifies
     * the pattern is treated as a sequence of literal characters thanks to
the {@link Pattern#LITERAL} flag.
     */
    private static final Pattern REPLACE_JUNK_PATTERN =
        Pattern.compile(MyConstants.SYSTEM_AGNOSTIC_NEWLINE_REGEX,
Pattern.LITERAL);


    /**
     * Replaces every instance of the {@link
FileContext#SYSTEM_AGNOSTIC_NEWLINE_REGEX} character sequence in the
     * exchange body with a simple '\n' line separator.
     */
    @SuppressWarnings("MethodMayBeStatic")
    @Handler
    public void cleanupLineSeparators(Exchange exchange)
    {
        String newBody =
REPLACE_JUNK_PATTERN.matcher(exchange.getIn().getBody(String.class))
            .replaceAll(Matcher.quoteReplacement("\n"));
        exchange.getIn().setBody(newBody);
    }

}

If there is a better solution, or if I have missed some obvious simple way
to use the tokenizer that does not replace the matching line separators with
the regex character sequence itself, please let me know! I'd very much
appreciate that.



--
View this message in context: 
http://camel.465427.n5.nabble.com/correct-way-to-provide-regex-in-TokenizerExpression-tp5773192p5773221.html
Sent from the Camel - Users mailing list archive at Nabble.com.

Re: correct way to provide regex in TokenizerExpression?

Reply via email to