I am creating a custom Pattern Tokenizer to change the type of the
generated tokens. By incrementToken() function looks like the below code:

public boolean incrementToken() {
    if (index >= str.length()) return false;
    clearAttributes();
    if (group >= 0) {

        // match a specific group
        while (matcher.find()) {
            index = matcher.start(group);
            final int endIndex = matcher.end(group);
            if (index == endIndex) continue;
            termAtt.setEmpty().append(str, index, endIndex);
            offsetAtt.setOffset(correctOffset(index), correctOffset(endIndex));
            //Changing Token Type based on the pattern matcher
            Pattern pattern = Pattern.compile("\\p{Alnum}+");
            Matcher matcher = pattern.matcher(input.toString());
            boolean matchFound = matcher.find();
            if (matchFound) {
                typeAttribute.setType("some_random_type".toLowerCase());
            }
            return true;
        }
    }
}

I'm trying to change the type of the generated tokens based on the
condition that whenever the token encounters a particular regex, using the
typeAttribute, the type of the token should be changed. Here, I am using
the pattern "\p{Alnum}+", so whenever there is an alphanumeric token, its
type should be changed.

Currently, I am getting the token as:

"tokens" : [ { "token" : "testing", "start_offset" : 0, "end_offset" : 7,
"type" : "word", "position" : 0 }, ]

I want the above token to be like:

"tokens" : [ { "token" : "testing", "start_offset" : 0, "end_offset" : 7,
"type" : "some_random_type", "position" : 0 }, ]

Since the token matches with the pattern "\p{Alnum}+", the type of the
token should be changed to the type specified inside the
"typeAttribute.setType."

But, the code that I have done is spitting out all the tokens of the type
"some_random_type." If any token is not being matched with the pattern
"\p{Alnum}+", it is also getting the type "some_random_type".

How can I make only the specific tokens get the type "some_random_type"
which matches the pattern "some_random_type".

Reply via email to