Re: Lucene 9.2 release

2022-05-03 Thread Ignacio Vera
+1 

Thanks Alan!

> On 3. May 2022, at 13:01, Alan Woodward  wrote:
> 
> Hi all,
> 
> It’s been six weeks or so since we released 9.1, and we have a bunch of nice 
> new features and enhancements piling up in the 9.x branch.  I’d like to 
> volunteer to be a release manager for a 9.2 release.  I propose to cut a 
> branch this time next week, 10th May.
> 
> - Alan
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
> 

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Changing type of the tokens generated by pattern tokenizer

2022-05-03 Thread Robert Muir
As an alternative to writing a custom tokenizer, you can use built-in
PatternTypingFilter which does exactly this (sets type based on
whether it matches some regex).

https://lucene.apache.org/core/9_1_0/analysis/common/org/apache/lucene/analysis/pattern/PatternTypingFilter.html

On Tue, May 3, 2022 at 3:47 AM dishant sharma
 wrote:
>
> I am creating a custom Pattern Tokenizer to change the type of the generated 
> tokens. By incrementToken() function looks like the below code:
>
> public boolean incrementToken() {
> if (index >= str.length()) return false;
> clearAttributes();
> if (group >= 0) {
>
> // match a specific group
> while (matcher.find()) {
> index = matcher.start(group);
> final int endIndex = matcher.end(group);
> if (index == endIndex) continue;
> termAtt.setEmpty().append(str, index, endIndex);
> offsetAtt.setOffset(correctOffset(index), 
> correctOffset(endIndex));
> //Changing Token Type based on the pattern matcher
> Pattern pattern = Pattern.compile("\\p{Alnum}+");
> Matcher matcher = pattern.matcher(input.toString());
> boolean matchFound = matcher.find();
> if (matchFound) {
> typeAttribute.setType("some_random_type".toLowerCase());
> }
> return true;
> }
> }
> }
>
> I'm trying to change the type of the generated tokens based on the condition 
> that whenever the token encounters a particular regex, using the 
> typeAttribute, the type of the token should be changed. Here, I am using the 
> pattern "\p{Alnum}+", so whenever there is an alphanumeric token, its type 
> should be changed.
>
> Currently, I am getting the token as:
>
> "tokens" : [ { "token" : "testing", "start_offset" : 0, "end_offset" : 7, 
> "type" : "word", "position" : 0 }, ]
>
> I want the above token to be like:
>
> "tokens" : [ { "token" : "testing", "start_offset" : 0, "end_offset" : 7, 
> "type" : "some_random_type", "position" : 0 }, ]
>
> Since the token matches with the pattern "\p{Alnum}+", the type of the token 
> should be changed to the type specified inside the "typeAttribute.setType."
>
> But, the code that I have done is spitting out all the tokens of the type 
> "some_random_type." If any token is not being matched with the pattern 
> "\p{Alnum}+", it is also getting the type "some_random_type".
>
> How can I make only the specific tokens get the type "some_random_type" which 
> matches the pattern "some_random_type".

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



REMINDER - Travel Assistance available for ApacheCon NA New Orleans 2022

2022-05-03 Thread Gavin McDonald
Hi All Contributors and Committers,

This is a first reminder email that travel
assistance applications for ApacheCon NA 2022 are now open!

We will be supporting ApacheCon North America in New Orleans, Louisiana,
on October 3rd through 6th, 2022.

TAC exists to help those that would like to attend ApacheCon events, but
are unable to do so for financial reasons. This year, We are supporting
both committers and non-committers involved with projects at the
Apache Software Foundation, or open source projects in general.

For more info on this year's applications and qualifying criteria, please
visit the TAC website at http://www.apache.org/travel/
Applications are open and will close on the 1st of July 2022.

Important: Applicants have until the closing date above to submit their
applications (which should contain as much supporting material as required
to efficiently and accurately process their request), this will enable TAC
to announce successful awards shortly afterwards.

As usual, TAC expects to deal with a range of applications from a diverse
range of backgrounds. We therefore encourage (as always) anyone thinking
about sending in an application to do so ASAP.

Why should you attend as a TAC recipient? We encourage you to read stories
from
past recipients at https://apache.org/travel/stories/ . Also note that
previous TAC recipients have gone on to become Committers, PMC Members, ASF
Members, Directors of the ASF Board and Infrastructure Staff members.
Others have gone from Committer to full time Open Source Developers!

How far can you go! - Let TAC help get you there.


REMINDER - Travel Assistance available for ApacheCon NA New Orleans 2022

2022-05-03 Thread Gavin McDonald
Hi All Contributors and Committers,

This is a first reminder email that travel
assistance applications for ApacheCon NA 2022 are now open!

We will be supporting ApacheCon North America in New Orleans, Louisiana,
on October 3rd through 6th, 2022.

TAC exists to help those that would like to attend ApacheCon events, but
are unable to do so for financial reasons. This year, We are supporting
both committers and non-committers involved with projects at the
Apache Software Foundation, or open source projects in general.

For more info on this year's applications and qualifying criteria, please
visit the TAC website at http://www.apache.org/travel/
Applications are open and will close on the 1st of July 2022.

Important: Applicants have until the closing date above to submit their
applications (which should contain as much supporting material as required
to efficiently and accurately process their request), this will enable TAC
to announce successful awards shortly afterwards.

As usual, TAC expects to deal with a range of applications from a diverse
range of backgrounds. We therefore encourage (as always) anyone thinking
about sending in an application to do so ASAP.

Why should you attend as a TAC recipient? We encourage you to read stories
from
past recipients at https://apache.org/travel/stories/ . Also note that
previous TAC recipients have gone on to become Committers, PMC Members, ASF
Members, Directors of the ASF Board and Infrastructure Staff members.
Others have gone from Committer to full time Open Source Developers!

How far can you go! - Let TAC help get you there.


REMINDER - Travel Assistance available for ApacheCon NA New Orleans 2022

2022-05-03 Thread Gavin McDonald
Hi All Contributors and Committers,

This is a first reminder email that travel
assistance applications for ApacheCon NA 2022 are now open!

We will be supporting ApacheCon North America in New Orleans, Louisiana,
on October 3rd through 6th, 2022.

TAC exists to help those that would like to attend ApacheCon events, but
are unable to do so for financial reasons. This year, We are supporting
both committers and non-committers involved with projects at the
Apache Software Foundation, or open source projects in general.

For more info on this year's applications and qualifying criteria, please
visit the TAC website at http://www.apache.org/travel/
Applications are open and will close on the 1st of July 2022.

Important: Applicants have until the closing date above to submit their
applications (which should contain as much supporting material as required
to efficiently and accurately process their request), this will enable TAC
to announce successful awards shortly afterwards.

As usual, TAC expects to deal with a range of applications from a diverse
range of backgrounds. We therefore encourage (as always) anyone thinking
about sending in an application to do so ASAP.

Why should you attend as a TAC recipient? We encourage you to read stories
from
past recipients at https://apache.org/travel/stories/ . Also note that
previous TAC recipients have gone on to become Committers, PMC Members, ASF
Members, Directors of the ASF Board and Infrastructure Staff members.
Others have gone from Committer to full time Open Source Developers!

How far can you go! - Let TAC help get you there.


Lucene 9.2 release

2022-05-03 Thread Alan Woodward
Hi all,

It’s been six weeks or so since we released 9.1, and we have a bunch of nice 
new features and enhancements piling up in the 9.x branch.  I’d like to 
volunteer to be a release manager for a 9.2 release.  I propose to cut a branch 
this time next week, 10th May.

- Alan
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Changing type of the tokens generated by pattern tokenizer

2022-05-03 Thread Tomoko Uchida
Hi,
you pass input.toString() to the matcher - this is the entire source
character stream to be tokenized; I think this would lead to the result you
saw.
If you'd like to match the pattern to the specific token (a substring of
the input), I think you may want to give the substring of the input string
to the matcher, like termAtt.append() do so in your code.
Also, I'd suggest including "^" and "$" in your regex to avoid
unintentional matches.

Tomoko


2022年5月3日(火) 16:47 dishant sharma :

> I am creating a custom Pattern Tokenizer to change the type of the
> generated tokens. By incrementToken() function looks like the below code:
>
> public boolean incrementToken() {
> if (index >= str.length()) return false;
> clearAttributes();
> if (group >= 0) {
>
> // match a specific group
> while (matcher.find()) {
> index = matcher.start(group);
> final int endIndex = matcher.end(group);
> if (index == endIndex) continue;
> termAtt.setEmpty().append(str, index, endIndex);
> offsetAtt.setOffset(correctOffset(index), 
> correctOffset(endIndex));
> //Changing Token Type based on the pattern matcher
> Pattern pattern = Pattern.compile("\\p{Alnum}+");
> Matcher matcher = pattern.matcher(input.toString());
> boolean matchFound = matcher.find();
> if (matchFound) {
> typeAttribute.setType("some_random_type".toLowerCase());
> }
> return true;
> }
> }
> }
>
> I'm trying to change the type of the generated tokens based on the
> condition that whenever the token encounters a particular regex, using the
> typeAttribute, the type of the token should be changed. Here, I am using
> the pattern "\p{Alnum}+", so whenever there is an alphanumeric token, its
> type should be changed.
>
> Currently, I am getting the token as:
>
> "tokens" : [ { "token" : "testing", "start_offset" : 0, "end_offset" : 7,
> "type" : "word", "position" : 0 }, ]
>
> I want the above token to be like:
>
> "tokens" : [ { "token" : "testing", "start_offset" : 0, "end_offset" : 7,
> "type" : "some_random_type", "position" : 0 }, ]
>
> Since the token matches with the pattern "\p{Alnum}+", the type of the
> token should be changed to the type specified inside the
> "typeAttribute.setType."
>
> But, the code that I have done is spitting out all the tokens of the type
> "some_random_type." If any token is not being matched with the pattern
> "\p{Alnum}+", it is also getting the type "some_random_type".
>
> How can I make only the specific tokens get the type "some_random_type"
> which matches the pattern "some_random_type".
>


Changing type of the tokens generated by pattern tokenizer

2022-05-03 Thread dishant sharma
I am creating a custom Pattern Tokenizer to change the type of the
generated tokens. By incrementToken() function looks like the below code:

public boolean incrementToken() {
if (index >= str.length()) return false;
clearAttributes();
if (group >= 0) {

// match a specific group
while (matcher.find()) {
index = matcher.start(group);
final int endIndex = matcher.end(group);
if (index == endIndex) continue;
termAtt.setEmpty().append(str, index, endIndex);
offsetAtt.setOffset(correctOffset(index), correctOffset(endIndex));
//Changing Token Type based on the pattern matcher
Pattern pattern = Pattern.compile("\\p{Alnum}+");
Matcher matcher = pattern.matcher(input.toString());
boolean matchFound = matcher.find();
if (matchFound) {
typeAttribute.setType("some_random_type".toLowerCase());
}
return true;
}
}
}

I'm trying to change the type of the generated tokens based on the
condition that whenever the token encounters a particular regex, using the
typeAttribute, the type of the token should be changed. Here, I am using
the pattern "\p{Alnum}+", so whenever there is an alphanumeric token, its
type should be changed.

Currently, I am getting the token as:

"tokens" : [ { "token" : "testing", "start_offset" : 0, "end_offset" : 7,
"type" : "word", "position" : 0 }, ]

I want the above token to be like:

"tokens" : [ { "token" : "testing", "start_offset" : 0, "end_offset" : 7,
"type" : "some_random_type", "position" : 0 }, ]

Since the token matches with the pattern "\p{Alnum}+", the type of the
token should be changed to the type specified inside the
"typeAttribute.setType."

But, the code that I have done is spitting out all the tokens of the type
"some_random_type." If any token is not being matched with the pattern
"\p{Alnum}+", it is also getting the type "some_random_type".

How can I make only the specific tokens get the type "some_random_type"
which matches the pattern "some_random_type".