from:"Mike Sokolov \(JIRA\)"

[jira] [Resolved] (LUCENE-8971) Enable constructing JapaneseTokenizer from custom dictionary

2019-09-11 Thread Mike Sokolov (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-8971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Sokolov resolved LUCENE-8971.
--
  Assignee: Mike Sokolov
Resolution: Fixed

> Enable constructing JapaneseTokenizer from custom dictionary 
> -
>
> Key: LUCENE-8971
> URL: https://issues.apache.org/jira/browse/LUCENE-8971
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Assignee: Mike Sokolov
>Priority: Major
> Fix For: 8.3
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> This is basically just finishing up what was started in LUCENE-8863. It adds 
> a public constructor to {{JapaneseTokenizer }}that lets you bring-your-own 
> dictionary, plus exposing the necessary constructors for 
> {{UnknownDictionary,TokenInfoDictionary,}} and {{ConnectionCosts.}}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-8971) Enable constructing JapaneseTokenizer from custom dictionary

2019-09-11 Thread Mike Sokolov (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-8971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Sokolov updated LUCENE-8971:
-
Fix Version/s: 8.3

> Enable constructing JapaneseTokenizer from custom dictionary 
> -
>
> Key: LUCENE-8971
> URL: https://issues.apache.org/jira/browse/LUCENE-8971
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Major
> Fix For: 8.3
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> This is basically just finishing up what was started in LUCENE-8863. It adds 
> a public constructor to {{JapaneseTokenizer }}that lets you bring-your-own 
> dictionary, plus exposing the necessary constructors for 
> {{UnknownDictionary,TokenInfoDictionary,}} and {{ConnectionCosts.}}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-8971) Enable constructing JapaneseTokenizer from custom dictionary

2019-09-06 Thread Mike Sokolov (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-8971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Sokolov updated LUCENE-8971:
-
Description: This is basically just finishing up what was started in 
LUCENE-8863. It adds a public constructor to {{JapaneseTokenizer }}that lets 
you bring-your-own dictionary, plus exposing the necessary constructors for 
{{UnknownDictionary,TokenInfoDictionary,}} and {{ConnectionCosts.}}  (was: This 
is basically just finishing up what was started in LUCENE-8863. It adds a 
public constructor to {JapaneseTokenizer} that lets you bring-your-own 
dictionary, plus exposing the necessary constructors for {UnknownDictionary}, 
{TokenInfoDictionary}, and {ConnectionCosts}.)

> Enable constructing JapaneseTokenizer from custom dictionary 
> -
>
> Key: LUCENE-8971
> URL: https://issues.apache.org/jira/browse/LUCENE-8971
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Major
>
> This is basically just finishing up what was started in LUCENE-8863. It adds 
> a public constructor to {{JapaneseTokenizer }}that lets you bring-your-own 
> dictionary, plus exposing the necessary constructors for 
> {{UnknownDictionary,TokenInfoDictionary,}} and {{ConnectionCosts.}}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-8971) Enable constructing JapaneseTokenizer from custom dictionary

2019-09-06 Thread Mike Sokolov (Jira)

Mike Sokolov created LUCENE-8971:


 Summary: Enable constructing JapaneseTokenizer from custom 
dictionary 
 Key: LUCENE-8971
 URL: https://issues.apache.org/jira/browse/LUCENE-8971
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Mike Sokolov


This is basically just finishing up what was started in LUCENE-8863. It adds a 
public constructor to {JapaneseTokenizer} that lets you bring-your-own 
dictionary, plus exposing the necessary constructors for {UnknownDictionary}, 
{TokenInfoDictionary}, and {ConnectionCosts}.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8966) KoreanTokenizer should split unknown words on digits

2019-09-06 Thread Mike Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16924206#comment-16924206
 ] 

Mike Sokolov commented on LUCENE-8966:
--

> For complex number grouping and normalization, Namgyu Kim added a 
> KoreanNumberFilter in https://issues.apache.org/jira/browse/LUCENE-8812

Ah thanks, I'll have a look

> KoreanTokenizer should split unknown words on digits
> 
>
> Key: LUCENE-8966
> URL: https://issues.apache.org/jira/browse/LUCENE-8966
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
> Attachments: LUCENE-8966.patch, LUCENE-8966.patch
>
>
> Since https://issues.apache.org/jira/browse/LUCENE-8548 the Korean tokenizer 
> groups characters of unknown words if they belong to the same script or an 
> inherited one. This is ok for inputs like Мoscow (with a Cyrillic М and the 
> rest in Latin) but this rule doesn't work well on digits since they are 
> considered common with other scripts. For instance the input "44사이즈" is kept 
> as is even though "사이즈" is part of the dictionary. We should restore the 
> original behavior and splits any unknown words if a digit is followed by 
> another type.
> This issue was first discovered in 
> [https://github.com/elastic/elasticsearch/issues/46365]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-09-06 Thread Mike Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16924203#comment-16924203
 ] 

Mike Sokolov commented on LUCENE-8920:
--

If I understand you correctly, T1 is the threshold we introduced earlier this 
year (or its inverse DIRECT_ARC_LOAD_FACTOR in fst.Builder). It's currently set 
to 4, or (1/4 as T1 in your formulation).  There was pre-existing logic to 
decide (var-encoded) list vs. the (fixed-size, packed) array encoding; my 
change was piggy-backed on that. It's a threshold on N that depends on the 
depth in the FST. See FST.shouldExpand.

If you want to write up the open addressing idea in more detail, it's fine to 
add comments here unless you think they are too long / inconvenient to write in 
this form, then maybe attach a doc? I think that goes directly to the point of 
reducing space consumption, so this issue seems like a fine place for it.

> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Blocker
> Fix For: 8.3
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-09-05 Thread Mike Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16923530#comment-16923530
 ] 

Mike Sokolov commented on LUCENE-8920:
--

I like this! I would be happy to review if you want to post a patch. I may try 
eventually too if you don't get to it. I hope it should be a bit easier to try 
now that we have done some refactoring here, too. One potential complication is 
we are running out of bits to signal different encodings, but I think there 
should be one or two left?

> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Blocker
> Fix For: 8.3
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8966) KoreanTokenizer should split unknown words on digits

2019-09-05 Thread Mike Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16923378#comment-16923378
 ] 

Mike Sokolov commented on LUCENE-8966:
--

Would you consider grouping numbers and (at least some) punctuation together so 
that we can preserve decimals and fractions?

> KoreanTokenizer should split unknown words on digits
> 
>
> Key: LUCENE-8966
> URL: https://issues.apache.org/jira/browse/LUCENE-8966
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
>Priority: Minor
> Attachments: LUCENE-8966.patch, LUCENE-8966.patch
>
>
> Since https://issues.apache.org/jira/browse/LUCENE-8548 the Korean tokenizer 
> groups characters of unknown words if they belong to the same script or an 
> inherited one. This is ok for inputs like Мoscow (with a Cyrillic М and the 
> rest in Latin) but this rule doesn't work well on digits since they are 
> considered common with other scripts. For instance the input "44사이즈" is kept 
> as is even though "사이즈" is part of the dictionary. We should restore the 
> original behavior and splits any unknown words if a digit is followed by 
> another type.
> This issue was first discovered in 
> [https://github.com/elastic/elasticsearch/issues/46365]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-07-31 Thread Mike Sokolov (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16897113#comment-16897113
 ] 

Mike Sokolov commented on LUCENE-8920:
--

[~noble.paul] thanks for fixing - I thought I had been watching the mailing 
list and would have seen any build fails from this, but somehow I did not!

> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Blocker
> Fix For: 8.3
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-07-19 Thread Mike Sokolov (JIRA)



 [ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Sokolov updated LUCENE-8920:
-
Status: Patch Available  (was: Open)

> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Blocker
> Fix For: 8.3
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-07-19 Thread Mike Sokolov (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1675#comment-1675
 ] 

Mike Sokolov commented on LUCENE-8920:
--

bq. I'm making it a blocker for 8.3 since we haven't reverted from branch_8x.

Fair enough. I have the commits prepared for the refactoring steps,
but I foolishly did them on a home pc that I cannot access r.n.; will
post tonight.



> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Blocker
> Fix For: 8.3
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-07-18 Thread Mike Sokolov (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16888416#comment-16888416
 ] 

Mike Sokolov commented on LUCENE-8920:
--

Before digging in in earnest on FST size reduction, I'd like to tighten up the 
FST.Arc contract. Right now it has all public members and no methods to speak 
of, so the abstraction boundary is not well defined, and in fact we see 
consumers modifying Arc members in a few places outside of the FST class 
itself. This makes it more difficult to reason about the code and make provably 
valid changes. My plan is to do some nonfunctional commits:

1. Add accessors (mostly getters, a few setters will be needed temporarily) to 
Arc, and make all of its members private. It seems as if we often write 
accessors with the same name as the member (rather than the bean standard), so 
I'll go with that.
2. Eliminate the setters; this will require some light refactoring in FSTEnum, 
and a few changes to the memory codec, which keeps a list of Arcs locally and 
updates them for its own purposes.
3. Some refactoring and general cleanup (tightening up access, whitespace 
fixes, etc)

Because that first step is going to touch a lot of files, keep it very strictly 
about introducing the accessors, so there won't be anything beyond changing 
things like `arc.flags` to `arc.flags()`, in a lot of places.

Once these changes are in, the fun can begin again :) I'll add Adrien's 
worst-case test and work on getting the size down for that, pursuing the ideas 
in the description.


> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-07-17 Thread Mike Sokolov (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16887089#comment-16887089
 ] 

Mike Sokolov commented on LUCENE-8920:
--

Note: I pushed the old-format Kuromoji dictionary and it seems to have fixed 
the build

> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-07-16 Thread Mike Sokolov (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16886603#comment-16886603
 ] 

Mike Sokolov edited comment on LUCENE-8920 at 7/17/19 1:27 AM:
---

Yes, that makes sense. Because we reverted the "current version" in FST.java, 
we can no longer read FSTs created with the newer version, so we need to revert 
the dictionary file.  I'll do that and run a full suite of tests just to make 
sure something else isn't still broken. Thanks for pointing this out, 
[~hossman] and finding the fix [~tomoko], and sorry for not being more careful 
with the "fix" the first time!


was (Author: sokolov):
Yes, that makes sense. Because we reverted the "current version" in FST.java, 
we can no longer read FSTs created with the newer version, so we need to revert 
the dictionary file.  I'll do that and run a full suite of tests just to make 
sure something else isn't still broken

> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-07-16 Thread Mike Sokolov (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16886603#comment-16886603
 ] 

Mike Sokolov commented on LUCENE-8920:
--

Yes, that makes sense. Because we reverted the "current version" in FST.java, 
we can no longer read FSTs created with the newer version, so we need to revert 
the dictionary file.  I'll do that and run a full suite of tests just to make 
sure something else isn't still broken

> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-07-15 Thread Mike Sokolov (JIRA)



 [ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Sokolov updated LUCENE-8920:
-
Description: 
Some data can lead to worst-case ~4x RAM usage due to this optimization. 
Several ideas were suggested to combat this on the mailing list:

bq. I think we can improve thesituation here by tracking, per-FST instance, the 
size increase we're seeing while building (or perhaps do a preliminary pass 
before building) in order to decide whether to apply the encoding. 

bq. we could also make the encoding a bit more efficient. For instance I 
noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
which make gaps very costly. Associating each label with a dense id and having 
an intermediate lookup, ie. lookup label -> id and then id->arc offset instead 
of doing label->arc directly could save a lot of space in some cases? Also it 
seems that we are repeating the label in the arc metadata when array-with-gaps 
is used, even though it shouldn't be necessary since the label is implicit from 
the address?

  was:
Some data can lead to worst-case ~4x RAM usage due to this optimization. 
Several ideas were suggested to combat this on the mailing list:

bq. I think we can improve thesituation here by tracking, per-FST instance, the 
size increase we're seeing while building (or perhaps do a preliminary pass 
before building) in order to decide whether to apply the encoding. 

bq. we could also make the encoding a
bit more efficient. For instance I noticed that arc metadata is pretty
large in some cases (in the 10-20 bytes) which make gaps very costly.
Associating each label with a dense id and having an intermediate
lookup, ie. lookup label -> id and then id->arc offset instead of
doing label->arc directly could save a lot of space in some cases?
Also it seems that we are repeating the label in the arc metadata when
array-with-gaps is used, even though it shouldn't be necessary since
the label is implicit from the address?


> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Major
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-07-15 Thread Mike Sokolov (JIRA)

Mike Sokolov created LUCENE-8920:


 Summary: Reduce size of FSTs due to use of direct-addressing 
encoding 
 Key: LUCENE-8920
 URL: https://issues.apache.org/jira/browse/LUCENE-8920
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Mike Sokolov


Some data can lead to worst-case ~4x RAM usage due to this optimization. 
Several ideas were suggested to combat this on the mailing list:

bq. I think we can improve thesituation here by tracking, per-FST instance, the 
size increase we're seeing while building (or perhaps do a preliminary pass 
before building) in order to decide whether to apply the encoding. 

bq. we could also make the encoding a
bit more efficient. For instance I noticed that arc metadata is pretty
large in some cases (in the 10-20 bytes) which make gaps very costly.
Associating each label with a dense id and having an intermediate
lookup, ie. lookup label -> id and then id->arc offset instead of
doing label->arc directly could save a lot of space in some cases?
Also it seems that we are repeating the label in the arc metadata when
array-with-gaps is used, even though it shouldn't be necessary since
the label is implicit from the address?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-13629) Remove whitespace only lines & trailing whitespace from analytics package

2019-07-13 Thread Mike Sokolov (JIRA)



[ 
https://issues.apache.org/jira/browse/SOLR-13629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16884528#comment-16884528
 ] 

Mike Sokolov commented on SOLR-13629:
-

We don't want to remove all the blank (or whitespace-only) lines; many of them 
are there to help with readability. Glancing at the patch it looks like you did 
not remove blank lines, just trailiing white space, so that's good, but if 
that's the intent, you should update the issue description.

> Remove whitespace only lines & trailing whitespace from analytics package
> -
>
> Key: SOLR-13629
> URL: https://issues.apache.org/jira/browse/SOLR-13629
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 8.1.1
>Reporter: Neal Sidhwaney
>Priority: Trivial
> Attachments: SOLR-13629.patch
>
>
> I"m making some changes to analytics and noticed that the guidelines ask to 
> create separate patches for formatting/whitespace changes.  This issue is 
> meant for the patch to remove whitespace only lines as well as trailing 
> whitespace from lines. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-6672) function results' names should not include trailing whitespace

2019-07-08 Thread Mike Sokolov (JIRA)



[ 
https://issues.apache.org/jira/browse/SOLR-6672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16880853#comment-16880853
 ] 

Mike Sokolov commented on SOLR-6672:


Thanks! I had forgotten about this. Did you at least test interactively?

> function results' names should not include trailing whitespace
> --
>
> Key: SOLR-6672
> URL: https://issues.apache.org/jira/browse/SOLR-6672
> Project: Solr
>  Issue Type: Bug
>  Components: search
>Reporter: Mike Sokolov
>Priority: Minor
> Attachments: SOLR-6672.patch
>
>
> If you include a function as a result field in a list of multiple fields 
> separated by white space, the corresponding key in the result markup includes 
> trailing whitespace; Example:
> {code}
> fl="id field(units_used) archive_id"
> {code}
> ends up returning results like this:
> {code}
>   {
> "id": "nest.epubarchive.1",
> "archive_id": "urn:isbn:97849D42C5A01",
> "field(units_used) ": 123
>   ^
>   }
> {code}
> A workaround is to use comma separators instead of whitespace
> {code} 
> fl="id,field(units_used),archive_id"
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4312) Index format to store position length per position

2019-07-06 Thread Mike Sokolov (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-4312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16879757#comment-16879757
 ] 

Mike Sokolov commented on LUCENE-4312:
--

Yes, we're compromising precision today when we apply index-time synonyms (and 
other analysis that produces token graphs). I think this would be an awesome 
addition If the cost of indexing position length in postings is not too great, 
and I think you are right -- it will usually be 1, sometimes 0 and rarely > 1, 
so we should be able to encode compactly, and decode efficiently.

> Index format to store position length per position
> --
>
> Key: LUCENE-4312
> URL: https://issues.apache.org/jira/browse/LUCENE-4312
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 6.0
>Reporter: Gang Luo
>Priority: Minor
>  Labels: Suggestion
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Mike Mccandless said:TokenStreams are actually graphs.
> Indexer ignores PositionLengthAttribute.Need change the index format (and 
> Codec APIs) to store an additional int position length per position.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8895) Switch all FSTs to use direct addressing optimization

2019-07-03 Thread Mike Sokolov (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16877938#comment-16877938
 ] 

Mike Sokolov commented on LUCENE-8895:
--

Ah yes, thanks! I now deprecated the other one too.

> Switch all FSTs to use direct addressing optimization
> -
>
> Key: LUCENE-8895
> URL: https://issues.apache.org/jira/browse/LUCENE-8895
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Major
> Fix For: 8.2
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> See discussion in LUCENE-8781 about turning on array-with-gaps encoding 
> everywhere. Let's conduct any further discussion here so we can use an open 
> issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-8895) Switch all FSTs to use direct addressing optimization

2019-07-02 Thread Mike Sokolov (JIRA)



 [ 
https://issues.apache.org/jira/browse/LUCENE-8895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Sokolov updated LUCENE-8895:
-
Resolution: Fixed
Status: Resolved  (was: Patch Available)

> Switch all FSTs to use direct addressing optimization
> -
>
> Key: LUCENE-8895
> URL: https://issues.apache.org/jira/browse/LUCENE-8895
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Major
> Fix For: 8.2
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> See discussion in LUCENE-8781 about turning on array-with-gaps encoding 
> everywhere. Let's conduct any further discussion here so we can use an open 
> issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-8895) Switch all FSTs to use direct addressing optimization

2019-07-02 Thread Mike Sokolov (JIRA)



 [ 
https://issues.apache.org/jira/browse/LUCENE-8895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Sokolov updated LUCENE-8895:
-
Fix Version/s: 8.2

> Switch all FSTs to use direct addressing optimization
> -
>
> Key: LUCENE-8895
> URL: https://issues.apache.org/jira/browse/LUCENE-8895
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Major
> Fix For: 8.2
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> See discussion in LUCENE-8781 about turning on array-with-gaps encoding 
> everywhere. Let's conduct any further discussion here so we can use an open 
> issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-8781) Explore FST direct array arc encoding

2019-07-02 Thread Mike Sokolov (JIRA)



 [ 
https://issues.apache.org/jira/browse/LUCENE-8781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Sokolov resolved LUCENE-8781.
--
Resolution: Fixed

> Explore FST direct array arc encoding 
> --
>
> Key: LUCENE-8781
> URL: https://issues.apache.org/jira/browse/LUCENE-8781
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Assignee: Mike Sokolov
>Priority: Major
> Fix For: master (9.0), 8.2
>
> Attachments: FST-2-4.png, FST-6-9.png, FST-size.png
>
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> This issue is for exploring an alternate FST encoding of Arcs as full-sized 
> arrays so Arcs are addressed directly by label, avoiding binary search that 
> we use today for arrays of Arcs. PR: 
> https://github.com/apache/lucene-solr/pull/657
> h3. Testing
> ant test passes. I added some unit tests that were helpful in uncovering bugs 
> while
> implementing which are more difficult to chase down when uncovered by the 
> randomized testing we already do. They don't really test anything new; 
> they're just more focused.
> I'm not sure why, but ant precommit failed for me with:
> {noformat}
>  ...lucene-solr/solr/common-build.xml:536: Check for forbidden API calls 
> failed while scanning class 
> 'org.apache.solr.metrics.reporters.SolrGangliaReporterTest' 
> (SolrGangliaReporterTest.java): java.lang.ClassNotFoundException: 
> info.ganglia.gmetric4j.gmetric.GMetric (while looking up details about 
> referenced class 'info.ganglia.gmetric4j.gmetric.GMetric')
> {noformat}
> I also got Test2BFST running (it was originally timing out due to excessive 
> calls to ramBytesUsage(), which seems to have gotten slow), and it passed; 
> that change isn't include here.
> h4. Micro-benchmark
> I timed lookups in FST via FSTEnum.seekExact in a unit test under various 
> conditions. 
> h5. English words
> A test of looking up existing words in a dictionary of ~17 English words 
> shows improvements; the numbers listed are % change in FST size, time to look 
> up (FSTEnum.seekExact) words that are in the dict, and time to look up random 
> strings that are not in the dict. The comparison is against the current 
> codebase with the optimization disabled. A separate comparison of showed no 
> significant change of the baseline (no opto applied) vs the current master 
> FST impl with no code changes applied.
> ||  load=2||   load=4 ||  load=16 ||
> | +4, -6, -7  | +18, -11, -8 | +22, -11.5, -7 |
> The "load factor" used for those measurements controls when direct array arc 
> encoding is used;
> namely when the number of outgoing arcs was > load * (max label - min label).
> h5. sequential and random terms
> The same test, with terms being a sequence of integers as strings shows a 
> larger improvement, around 20% (load=4). This is presumably the best case for 
> this delta, where every Arc is encoded as a direct lookup.
> When random lowercase ASCII strings are used, a smaller improvement of around 
> 4% is seen.
> h4. luceneutil
> Testing w/luceneutil (wikimediumall) we see improvements mostly in the 
> PKLookup case. Other results seem noisy, with perhaps a small improvment in 
> some of the queries.
> {noformat}
> TaskQPS base  StdDevQPS opto  StdDev  
>   Pct diff
>   OrHighHigh6.93  (3.0%)6.89  (3.1%)   
> -0.5% (  -6% -5%)
>OrHighMed   45.15  (3.9%)   44.92  (3.5%)   
> -0.5% (  -7% -7%)
> Wildcard8.72  (4.7%)8.69  (4.6%)   
> -0.4% (  -9% -9%)
>   AndHighLow  274.11  (2.6%)  273.58  (3.1%)   
> -0.2% (  -5% -5%)
>OrHighLow  241.41  (1.9%)  241.11  (3.5%)   
> -0.1% (  -5% -5%)
>   AndHighMed   52.23  (4.1%)   52.41  (5.3%)
> 0.3% (  -8% -   10%)
>  MedTerm 1026.24  (3.1%) 1030.52  (4.3%)
> 0.4% (  -6% -8%)
> HighTerm .10  (3.4%) 1116.70  (4.0%)
> 0.5% (  -6% -8%)
>HighTermDayOfYearSort   14.59  (8.2%)   14.73  (9.3%)
> 1.0% ( -15% -   20%)
>  AndHighHigh   13.45  (6.2%)   13.61  (4.4%)
> 1.2% (  -8% -   12%)
>HighTermMonthSort   63.09 (12.5%)   64.13 (10.9%)
> 1.6% ( -19% -   28%)
>  LowTerm 1338.94  (3.3%) 1383.90  (5.5%)
> 3.4% (  -5% -   12%)
> PKLookup  120.45  (2.5%)  130.91  (3.5%)
> 8.7% (   2% -   15%)
> {noformat}
> h4.FST perf tests
> I ran LookupBenchmarkTest to see the impact on the suggesters which make 
> heavy use of FSTs. Some show little or no improv

[jira] [Commented] (LUCENE-8781) Explore FST direct array arc encoding

2019-07-02 Thread Mike Sokolov (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16877104#comment-16877104
 ] 

Mike Sokolov commented on LUCENE-8781:
--

The extension of this feature to more use cases is tracked in LUCENE-8895, so I 
think I'll close again. You can see the proposed changes in the attached PR 
there.

> Explore FST direct array arc encoding 
> --
>
> Key: LUCENE-8781
> URL: https://issues.apache.org/jira/browse/LUCENE-8781
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Assignee: Mike Sokolov
>Priority: Major
> Fix For: master (9.0), 8.2
>
> Attachments: FST-2-4.png, FST-6-9.png, FST-size.png
>
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> This issue is for exploring an alternate FST encoding of Arcs as full-sized 
> arrays so Arcs are addressed directly by label, avoiding binary search that 
> we use today for arrays of Arcs. PR: 
> https://github.com/apache/lucene-solr/pull/657
> h3. Testing
> ant test passes. I added some unit tests that were helpful in uncovering bugs 
> while
> implementing which are more difficult to chase down when uncovered by the 
> randomized testing we already do. They don't really test anything new; 
> they're just more focused.
> I'm not sure why, but ant precommit failed for me with:
> {noformat}
>  ...lucene-solr/solr/common-build.xml:536: Check for forbidden API calls 
> failed while scanning class 
> 'org.apache.solr.metrics.reporters.SolrGangliaReporterTest' 
> (SolrGangliaReporterTest.java): java.lang.ClassNotFoundException: 
> info.ganglia.gmetric4j.gmetric.GMetric (while looking up details about 
> referenced class 'info.ganglia.gmetric4j.gmetric.GMetric')
> {noformat}
> I also got Test2BFST running (it was originally timing out due to excessive 
> calls to ramBytesUsage(), which seems to have gotten slow), and it passed; 
> that change isn't include here.
> h4. Micro-benchmark
> I timed lookups in FST via FSTEnum.seekExact in a unit test under various 
> conditions. 
> h5. English words
> A test of looking up existing words in a dictionary of ~17 English words 
> shows improvements; the numbers listed are % change in FST size, time to look 
> up (FSTEnum.seekExact) words that are in the dict, and time to look up random 
> strings that are not in the dict. The comparison is against the current 
> codebase with the optimization disabled. A separate comparison of showed no 
> significant change of the baseline (no opto applied) vs the current master 
> FST impl with no code changes applied.
> ||  load=2||   load=4 ||  load=16 ||
> | +4, -6, -7  | +18, -11, -8 | +22, -11.5, -7 |
> The "load factor" used for those measurements controls when direct array arc 
> encoding is used;
> namely when the number of outgoing arcs was > load * (max label - min label).
> h5. sequential and random terms
> The same test, with terms being a sequence of integers as strings shows a 
> larger improvement, around 20% (load=4). This is presumably the best case for 
> this delta, where every Arc is encoded as a direct lookup.
> When random lowercase ASCII strings are used, a smaller improvement of around 
> 4% is seen.
> h4. luceneutil
> Testing w/luceneutil (wikimediumall) we see improvements mostly in the 
> PKLookup case. Other results seem noisy, with perhaps a small improvment in 
> some of the queries.
> {noformat}
> TaskQPS base  StdDevQPS opto  StdDev  
>   Pct diff
>   OrHighHigh6.93  (3.0%)6.89  (3.1%)   
> -0.5% (  -6% -5%)
>OrHighMed   45.15  (3.9%)   44.92  (3.5%)   
> -0.5% (  -7% -7%)
> Wildcard8.72  (4.7%)8.69  (4.6%)   
> -0.4% (  -9% -9%)
>   AndHighLow  274.11  (2.6%)  273.58  (3.1%)   
> -0.2% (  -5% -5%)
>OrHighLow  241.41  (1.9%)  241.11  (3.5%)   
> -0.1% (  -5% -5%)
>   AndHighMed   52.23  (4.1%)   52.41  (5.3%)
> 0.3% (  -8% -   10%)
>  MedTerm 1026.24  (3.1%) 1030.52  (4.3%)
> 0.4% (  -6% -8%)
> HighTerm .10  (3.4%) 1116.70  (4.0%)
> 0.5% (  -6% -8%)
>HighTermDayOfYearSort   14.59  (8.2%)   14.73  (9.3%)
> 1.0% ( -15% -   20%)
>  AndHighHigh   13.45  (6.2%)   13.61  (4.4%)
> 1.2% (  -8% -   12%)
>HighTermMonthSort   63.09 (12.5%)   64.13 (10.9%)
> 1.6% ( -19% -   28%)
>  LowTerm 1338.94  (3.3%) 1383.90  (5.5%)
> 3.4% (  -5% -   12%)
> PKLookup  120.45  (2.5%)  130.91

[jira] [Updated] (LUCENE-8895) Switch all FSTs to use direct addressing optimization

2019-06-30 Thread Mike Sokolov (JIRA)



 [ 
https://issues.apache.org/jira/browse/LUCENE-8895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Sokolov updated LUCENE-8895:
-
Status: Patch Available  (was: Open)

> Switch all FSTs to use direct addressing optimization
> -
>
> Key: LUCENE-8895
> URL: https://issues.apache.org/jira/browse/LUCENE-8895
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> See discussion in LUCENE-8781 about turning on array-with-gaps encoding 
> everywhere. Let's conduct any further discussion here so we can use an open 
> issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-8895) Switch all FSTs to use direct addressing optimization

2019-06-30 Thread Mike Sokolov (JIRA)

Mike Sokolov created LUCENE-8895:


 Summary: Switch all FSTs to use direct addressing optimization
 Key: LUCENE-8895
 URL: https://issues.apache.org/jira/browse/LUCENE-8895
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Mike Sokolov


See discussion in LUCENE-8781 about turning on array-with-gaps encoding 
everywhere. Let's conduct any further discussion here so we can use an open 
issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8781) Explore FST direct array arc encoding

2019-06-29 Thread Mike Sokolov (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16875610#comment-16875610
 ] 

Mike Sokolov commented on LUCENE-8781:
--

Well, there is an easy fix for {{blocktreeords}}, but it might cost some 
performance. 

This codec uses the somewhat esoteric feature {{getByOutput}}, which is akin to 
the implementation in {{fst.Util.getByOutput}}. Both of these look up by the 
output of the FST, which only works when the outputs are guaranteed to be 
ordered the same as the inputs. By the way, the Util version is never used 
anywhere - we should probably delete it? At any rate, I had previously "fixed" 
the Util version by having it scan forward over arcs (rather than do a binary 
search). It's not possible to do direct lookup by output.

I'm inclined to impose this (probably small) penalty in the interest of 
simplifying the code base, but it would be good to hear from people who know 
more about this {{blocktreeords}}. I guess it's an experimental codec (not the 
default) -- is it seeing any use that we know about? What is it designed for? 
[~mikemccand] do you know?

> Explore FST direct array arc encoding 
> --
>
> Key: LUCENE-8781
> URL: https://issues.apache.org/jira/browse/LUCENE-8781
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Assignee: Dawid Weiss
>Priority: Major
> Fix For: master (9.0), 8.2
>
> Attachments: FST-2-4.png, FST-6-9.png, FST-size.png
>
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> This issue is for exploring an alternate FST encoding of Arcs as full-sized 
> arrays so Arcs are addressed directly by label, avoiding binary search that 
> we use today for arrays of Arcs. PR: 
> https://github.com/apache/lucene-solr/pull/657
> h3. Testing
> ant test passes. I added some unit tests that were helpful in uncovering bugs 
> while
> implementing which are more difficult to chase down when uncovered by the 
> randomized testing we already do. They don't really test anything new; 
> they're just more focused.
> I'm not sure why, but ant precommit failed for me with:
> {noformat}
>  ...lucene-solr/solr/common-build.xml:536: Check for forbidden API calls 
> failed while scanning class 
> 'org.apache.solr.metrics.reporters.SolrGangliaReporterTest' 
> (SolrGangliaReporterTest.java): java.lang.ClassNotFoundException: 
> info.ganglia.gmetric4j.gmetric.GMetric (while looking up details about 
> referenced class 'info.ganglia.gmetric4j.gmetric.GMetric')
> {noformat}
> I also got Test2BFST running (it was originally timing out due to excessive 
> calls to ramBytesUsage(), which seems to have gotten slow), and it passed; 
> that change isn't include here.
> h4. Micro-benchmark
> I timed lookups in FST via FSTEnum.seekExact in a unit test under various 
> conditions. 
> h5. English words
> A test of looking up existing words in a dictionary of ~17 English words 
> shows improvements; the numbers listed are % change in FST size, time to look 
> up (FSTEnum.seekExact) words that are in the dict, and time to look up random 
> strings that are not in the dict. The comparison is against the current 
> codebase with the optimization disabled. A separate comparison of showed no 
> significant change of the baseline (no opto applied) vs the current master 
> FST impl with no code changes applied.
> ||  load=2||   load=4 ||  load=16 ||
> | +4, -6, -7  | +18, -11, -8 | +22, -11.5, -7 |
> The "load factor" used for those measurements controls when direct array arc 
> encoding is used;
> namely when the number of outgoing arcs was > load * (max label - min label).
> h5. sequential and random terms
> The same test, with terms being a sequence of integers as strings shows a 
> larger improvement, around 20% (load=4). This is presumably the best case for 
> this delta, where every Arc is encoded as a direct lookup.
> When random lowercase ASCII strings are used, a smaller improvement of around 
> 4% is seen.
> h4. luceneutil
> Testing w/luceneutil (wikimediumall) we see improvements mostly in the 
> PKLookup case. Other results seem noisy, with perhaps a small improvment in 
> some of the queries.
> {noformat}
> TaskQPS base  StdDevQPS opto  StdDev  
>   Pct diff
>   OrHighHigh6.93  (3.0%)6.89  (3.1%)   
> -0.5% (  -6% -5%)
>OrHighMed   45.15  (3.9%)   44.92  (3.5%)   
> -0.5% (  -7% -7%)
> Wildcard8.72  (4.7%)8.69  (4.6%)   
> -0.4% (  -9% -9%)
>   AndHighLow  274.11  (2.6%)  273.58  (3.1%)   
> -0.2% (  -5% -5%)
>OrHighLow  241.41  (1.9%)  241.11  (3.5%)   
> -0.1% (  -

[jira] [Commented] (LUCENE-8781) Explore FST direct array arc encoding

2019-06-29 Thread Mike Sokolov (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16875607#comment-16875607
 ] 

Mike Sokolov commented on LUCENE-8781:
--

Hmm, I found that {{blocktreeords}} codec has some FST-traversal code of its 
own that needs to be upgraded, so this won't be a trivial switchover.

> Explore FST direct array arc encoding 
> --
>
> Key: LUCENE-8781
> URL: https://issues.apache.org/jira/browse/LUCENE-8781
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Assignee: Dawid Weiss
>Priority: Major
> Fix For: master (9.0), 8.2
>
> Attachments: FST-2-4.png, FST-6-9.png, FST-size.png
>
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> This issue is for exploring an alternate FST encoding of Arcs as full-sized 
> arrays so Arcs are addressed directly by label, avoiding binary search that 
> we use today for arrays of Arcs. PR: 
> https://github.com/apache/lucene-solr/pull/657
> h3. Testing
> ant test passes. I added some unit tests that were helpful in uncovering bugs 
> while
> implementing which are more difficult to chase down when uncovered by the 
> randomized testing we already do. They don't really test anything new; 
> they're just more focused.
> I'm not sure why, but ant precommit failed for me with:
> {noformat}
>  ...lucene-solr/solr/common-build.xml:536: Check for forbidden API calls 
> failed while scanning class 
> 'org.apache.solr.metrics.reporters.SolrGangliaReporterTest' 
> (SolrGangliaReporterTest.java): java.lang.ClassNotFoundException: 
> info.ganglia.gmetric4j.gmetric.GMetric (while looking up details about 
> referenced class 'info.ganglia.gmetric4j.gmetric.GMetric')
> {noformat}
> I also got Test2BFST running (it was originally timing out due to excessive 
> calls to ramBytesUsage(), which seems to have gotten slow), and it passed; 
> that change isn't include here.
> h4. Micro-benchmark
> I timed lookups in FST via FSTEnum.seekExact in a unit test under various 
> conditions. 
> h5. English words
> A test of looking up existing words in a dictionary of ~17 English words 
> shows improvements; the numbers listed are % change in FST size, time to look 
> up (FSTEnum.seekExact) words that are in the dict, and time to look up random 
> strings that are not in the dict. The comparison is against the current 
> codebase with the optimization disabled. A separate comparison of showed no 
> significant change of the baseline (no opto applied) vs the current master 
> FST impl with no code changes applied.
> ||  load=2||   load=4 ||  load=16 ||
> | +4, -6, -7  | +18, -11, -8 | +22, -11.5, -7 |
> The "load factor" used for those measurements controls when direct array arc 
> encoding is used;
> namely when the number of outgoing arcs was > load * (max label - min label).
> h5. sequential and random terms
> The same test, with terms being a sequence of integers as strings shows a 
> larger improvement, around 20% (load=4). This is presumably the best case for 
> this delta, where every Arc is encoded as a direct lookup.
> When random lowercase ASCII strings are used, a smaller improvement of around 
> 4% is seen.
> h4. luceneutil
> Testing w/luceneutil (wikimediumall) we see improvements mostly in the 
> PKLookup case. Other results seem noisy, with perhaps a small improvment in 
> some of the queries.
> {noformat}
> TaskQPS base  StdDevQPS opto  StdDev  
>   Pct diff
>   OrHighHigh6.93  (3.0%)6.89  (3.1%)   
> -0.5% (  -6% -5%)
>OrHighMed   45.15  (3.9%)   44.92  (3.5%)   
> -0.5% (  -7% -7%)
> Wildcard8.72  (4.7%)8.69  (4.6%)   
> -0.4% (  -9% -9%)
>   AndHighLow  274.11  (2.6%)  273.58  (3.1%)   
> -0.2% (  -5% -5%)
>OrHighLow  241.41  (1.9%)  241.11  (3.5%)   
> -0.1% (  -5% -5%)
>   AndHighMed   52.23  (4.1%)   52.41  (5.3%)
> 0.3% (  -8% -   10%)
>  MedTerm 1026.24  (3.1%) 1030.52  (4.3%)
> 0.4% (  -6% -8%)
> HighTerm .10  (3.4%) 1116.70  (4.0%)
> 0.5% (  -6% -8%)
>HighTermDayOfYearSort   14.59  (8.2%)   14.73  (9.3%)
> 1.0% ( -15% -   20%)
>  AndHighHigh   13.45  (6.2%)   13.61  (4.4%)
> 1.2% (  -8% -   12%)
>HighTermMonthSort   63.09 (12.5%)   64.13 (10.9%)
> 1.6% ( -19% -   28%)
>  LowTerm 1338.94  (3.3%) 1383.90  (5.5%)
> 3.4% (  -5% -   12%)
> PKLookup  120.45  (2.5%)  130.91  (3.5%)
>

[jira] [Commented] (LUCENE-8781) Explore FST direct array arc encoding

2019-06-29 Thread Mike Sokolov (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16875600#comment-16875600
 ] 

Mike Sokolov commented on LUCENE-8781:
--

Funny you should mention this - I just today tested Kuromoji after enabling 
this and I see a 10% reduction in times to complete 
{{TestJapaneseTokenizer.testWikipedia}}. This was a case I was worried about, 
but it seems fine, so I agree - we should just turn this on as the default. I'm 
working up a change that will simply remove the parameter and set it always on.

> Explore FST direct array arc encoding 
> --
>
> Key: LUCENE-8781
> URL: https://issues.apache.org/jira/browse/LUCENE-8781
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Assignee: Dawid Weiss
>Priority: Major
> Fix For: master (9.0), 8.2
>
> Attachments: FST-2-4.png, FST-6-9.png, FST-size.png
>
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> This issue is for exploring an alternate FST encoding of Arcs as full-sized 
> arrays so Arcs are addressed directly by label, avoiding binary search that 
> we use today for arrays of Arcs. PR: 
> https://github.com/apache/lucene-solr/pull/657
> h3. Testing
> ant test passes. I added some unit tests that were helpful in uncovering bugs 
> while
> implementing which are more difficult to chase down when uncovered by the 
> randomized testing we already do. They don't really test anything new; 
> they're just more focused.
> I'm not sure why, but ant precommit failed for me with:
> {noformat}
>  ...lucene-solr/solr/common-build.xml:536: Check for forbidden API calls 
> failed while scanning class 
> 'org.apache.solr.metrics.reporters.SolrGangliaReporterTest' 
> (SolrGangliaReporterTest.java): java.lang.ClassNotFoundException: 
> info.ganglia.gmetric4j.gmetric.GMetric (while looking up details about 
> referenced class 'info.ganglia.gmetric4j.gmetric.GMetric')
> {noformat}
> I also got Test2BFST running (it was originally timing out due to excessive 
> calls to ramBytesUsage(), which seems to have gotten slow), and it passed; 
> that change isn't include here.
> h4. Micro-benchmark
> I timed lookups in FST via FSTEnum.seekExact in a unit test under various 
> conditions. 
> h5. English words
> A test of looking up existing words in a dictionary of ~17 English words 
> shows improvements; the numbers listed are % change in FST size, time to look 
> up (FSTEnum.seekExact) words that are in the dict, and time to look up random 
> strings that are not in the dict. The comparison is against the current 
> codebase with the optimization disabled. A separate comparison of showed no 
> significant change of the baseline (no opto applied) vs the current master 
> FST impl with no code changes applied.
> ||  load=2||   load=4 ||  load=16 ||
> | +4, -6, -7  | +18, -11, -8 | +22, -11.5, -7 |
> The "load factor" used for those measurements controls when direct array arc 
> encoding is used;
> namely when the number of outgoing arcs was > load * (max label - min label).
> h5. sequential and random terms
> The same test, with terms being a sequence of integers as strings shows a 
> larger improvement, around 20% (load=4). This is presumably the best case for 
> this delta, where every Arc is encoded as a direct lookup.
> When random lowercase ASCII strings are used, a smaller improvement of around 
> 4% is seen.
> h4. luceneutil
> Testing w/luceneutil (wikimediumall) we see improvements mostly in the 
> PKLookup case. Other results seem noisy, with perhaps a small improvment in 
> some of the queries.
> {noformat}
> TaskQPS base  StdDevQPS opto  StdDev  
>   Pct diff
>   OrHighHigh6.93  (3.0%)6.89  (3.1%)   
> -0.5% (  -6% -5%)
>OrHighMed   45.15  (3.9%)   44.92  (3.5%)   
> -0.5% (  -7% -7%)
> Wildcard8.72  (4.7%)8.69  (4.6%)   
> -0.4% (  -9% -9%)
>   AndHighLow  274.11  (2.6%)  273.58  (3.1%)   
> -0.2% (  -5% -5%)
>OrHighLow  241.41  (1.9%)  241.11  (3.5%)   
> -0.1% (  -5% -5%)
>   AndHighMed   52.23  (4.1%)   52.41  (5.3%)
> 0.3% (  -8% -   10%)
>  MedTerm 1026.24  (3.1%) 1030.52  (4.3%)
> 0.4% (  -6% -8%)
> HighTerm .10  (3.4%) 1116.70  (4.0%)
> 0.5% (  -6% -8%)
>HighTermDayOfYearSort   14.59  (8.2%)   14.73  (9.3%)
> 1.0% ( -15% -   20%)
>  AndHighHigh   13.45  (6.2%)   13.61  (4.4%)
> 1.2% (  -8% -   12%)
>HighTermMonthSort   63.09 (12.5%)   64.13

[jira] [Resolved] (LUCENE-8871) Move Kuromoji DictionaryBuilder tool from src/tools to src/

2019-06-29 Thread Mike Sokolov (JIRA)



 [ 
https://issues.apache.org/jira/browse/LUCENE-8871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Sokolov resolved LUCENE-8871.
--
Resolution: Fixed

> Move Kuromoji DictionaryBuilder tool from src/tools to src/ 
> 
>
> Key: LUCENE-8871
> URL: https://issues.apache.org/jira/browse/LUCENE-8871
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Major
> Fix For: 8.2
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently tests in tools directories are not run as part of the normal 
> testing done by {{ant test}} - you have to explicitly run {{test-tools}}, 
> which it seems people don't do (and it might not survivie translation to 
> gradle, who knows), so [~rcmuir] suggested we just move the tools into the 
> main source tree (under src/java and src/test)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-8871) Move Kuromoji DictionaryBuilder tool from src/tools to src/

2019-06-29 Thread Mike Sokolov (JIRA)



 [ 
https://issues.apache.org/jira/browse/LUCENE-8871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Sokolov updated LUCENE-8871:
-
Fix Version/s: 8.2

> Move Kuromoji DictionaryBuilder tool from src/tools to src/ 
> 
>
> Key: LUCENE-8871
> URL: https://issues.apache.org/jira/browse/LUCENE-8871
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Major
> Fix For: 8.2
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently tests in tools directories are not run as part of the normal 
> testing done by {{ant test}} - you have to explicitly run {{test-tools}}, 
> which it seems people don't do (and it might not survivie translation to 
> gradle, who knows), so [~rcmuir] suggested we just move the tools into the 
> main source tree (under src/java and src/test)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-13571) Make recent RefGuide rank well in Google

2019-06-27 Thread Mike Sokolov (JIRA)



[ 
https://issues.apache.org/jira/browse/SOLR-13571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16874290#comment-16874290
 ] 

Mike Sokolov commented on SOLR-13571:
-

Have we ever tried publishing a site map? Google used to have a feature that 
would read an XL file that described all the pages on the sure as a hint to its 
crawler. Also I wonder if we have ever checked out Google webmaster tools for 
the documentation site(s). 

> Make recent RefGuide rank well in Google
> 
>
> Key: SOLR-13571
> URL: https://issues.apache.org/jira/browse/SOLR-13571
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: documentation
>Reporter: Jan Høydahl
>Priority: Major
>
> Spinoff from SOLR-13548
> The old Confluence ref-guide has a lot of pages pointing to it, and all of 
> that link karma is delegated to the {{/solr/guide/6_6/}} html ref guide, 
> making it often rank top. However we'd want newer content to rank high. See 
> these comments for some first ideas.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8871) Move Kuromoji DictionaryBuilder tool from src/tools to src/

2019-06-27 Thread Mike Sokolov (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16874014#comment-16874014
 ] 

Mike Sokolov commented on LUCENE-8871:
--

I see, thanks for explaining. I was reading the commit backwards, and thought 
that you had made the classes *public* rather than the other way around. I 
think all that remains here now is to back port to 8.x branch. I think that is 
worth doing, and safe

> Move Kuromoji DictionaryBuilder tool from src/tools to src/ 
> 
>
> Key: LUCENE-8871
> URL: https://issues.apache.org/jira/browse/LUCENE-8871
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently tests in tools directories are not run as part of the normal 
> testing done by {{ant test}} - you have to explicitly run {{test-tools}}, 
> which it seems people don't do (and it might not survivie translation to 
> gradle, who knows), so [~rcmuir] suggested we just move the tools into the 
> main source tree (under src/java and src/test)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8871) Move Kuromoji DictionaryBuilder tool from src/tools to src/

2019-06-27 Thread Mike Sokolov (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16874009#comment-16874009
 ] 

Mike Sokolov commented on LUCENE-8871:
--

I see what you did there [~jpountz]! Thank you for fixing. I have to say I'm 
really confused why this failed now, yet I am pretty sure I ran precommit 
earlier. I may have been distracted and forgot, but I thought I had done it. In 
principle the visibility changes seem OK to me, but I wonder why they were 
needed. I would have thought these classes were only referenced from their own 
package? I'm not seeing the whole picture - maybe some crosstalk between 
o.a.l.a.ja.dict and o.a.l.a.ja.util?

> Move Kuromoji DictionaryBuilder tool from src/tools to src/ 
> 
>
> Key: LUCENE-8871
> URL: https://issues.apache.org/jira/browse/LUCENE-8871
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently tests in tools directories are not run as part of the normal 
> testing done by {{ant test}} - you have to explicitly run {{test-tools}}, 
> which it seems people don't do (and it might not survivie translation to 
> gradle, who knows), so [~rcmuir] suggested we just move the tools into the 
> main source tree (under src/java and src/test)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8871) Move Kuromoji DictionaryBuilder tool from src/tools to src/

2019-06-27 Thread Mike Sokolov (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16874008#comment-16874008
 ] 

Mike Sokolov commented on LUCENE-8871:
--

I see what you did there! Thank you for fixing. I have to say I'm
really confused why this failed now, yet I am pretty sure I ran
precommit earlier. I may have been distracted and forgot, but I
thought I had done it. In principle the visibility changes seem OK to
me, but I wonder why they were needed. I would have thought these
classes were only referenced from their own package? I'm not seeing
the whole picture - maybe some crosstalk between o.a.l.a.ja.dict and
o.a.l.a.ja.util?



> Move Kuromoji DictionaryBuilder tool from src/tools to src/ 
> 
>
> Key: LUCENE-8871
> URL: https://issues.apache.org/jira/browse/LUCENE-8871
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently tests in tools directories are not run as part of the normal 
> testing done by {{ant test}} - you have to explicitly run {{test-tools}}, 
> which it seems people don't do (and it might not survivie translation to 
> gradle, who knows), so [~rcmuir] suggested we just move the tools into the 
> main source tree (under src/java and src/test)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8871) Move Kuromoji DictionaryBuilder tool from src/tools to src/

2019-06-25 Thread Mike Sokolov (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16872790#comment-16872790
 ] 

Mike Sokolov commented on LUCENE-8871:
--

Thanks for reviewing. FYI I will be delayed a bit in pushing since my primary 
laptop died, and I'm traveling, but will get back to this soon.

> Move Kuromoji DictionaryBuilder tool from src/tools to src/ 
> 
>
> Key: LUCENE-8871
> URL: https://issues.apache.org/jira/browse/LUCENE-8871
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently tests in tools directories are not run as part of the normal 
> testing done by {{ant test}} - you have to explicitly run {{test-tools}}, 
> which it seems people don't do (and it might not survivie translation to 
> gradle, who knows), so [~rcmuir] suggested we just move the tools into the 
> main source tree (under src/java and src/test)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8871) Move Kuromoji DictionaryBuilder tool from src/tools to src/

2019-06-24 Thread Mike Sokolov (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16871403#comment-16871403
 ] 

Mike Sokolov commented on LUCENE-8871:
--

This has been up for a day, and is I think pretty uncontroversial - just moving 
files, and some code hygiene. Unless there are objections, I'll push this later 
today

> Move Kuromoji DictionaryBuilder tool from src/tools to src/ 
> 
>
> Key: LUCENE-8871
> URL: https://issues.apache.org/jira/browse/LUCENE-8871
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently tests in tools directories are not run as part of the normal 
> testing done by {{ant test}} - you have to explicitly run {{test-tools}}, 
> which it seems people don't do (and it might not survivie translation to 
> gradle, who knows), so [~rcmuir] suggested we just move the tools into the 
> main source tree (under src/java and src/test)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8869) Build kuromoji system dictionary as a separated jar and load it from JapaneseTokenizer at runtime

2019-06-23 Thread Mike Sokolov (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16870518#comment-16870518
 ] 

Mike Sokolov commented on LUCENE-8869:
--

[~tomoko] there might be some minor conflicts with LUCENE-8871, since it also 
touches the code that reads the resources, but they should be easy to resolve, 
I think?

 

> Build kuromoji system dictionary as a separated jar and load it from 
> JapaneseTokenizer at runtime
> -
>
> Key: LUCENE-8869
> URL: https://issues.apache.org/jira/browse/LUCENE-8869
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Major
>
> This is a sub-task for LUCENE-8816.
>  In this issue, I will try to make small but self-contained changes to 
> kuromoji system dictionary.
>  - Make it possible to build a jar that contains (maybe) only dictionary data 
> resource generated by the {{build-dict}} task.
>  -- Maybe a new ant target will be added.
>  - Make it possible to load external dictionary when initializing 
> JapaneseTokenizer.
>  -- Some work are already done on LUCENE-8863
>  - Decouple current system dictionary data (mecab ipadic) from kuromoji 
> itself and use it as default (Possibly it can be done with another issue).
> Also, some refactoring of the directory/source tree structure may be needed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8870) Support numeric value in Field class

2019-06-22 Thread Mike Sokolov (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16870200#comment-16870200
 ] 

Mike Sokolov commented on LUCENE-8870:
--

Personally I find the Field type-facade kind of annoying; it imposes this 
artificial type safety: in the end we store the value as an Object and then 
later cast it. Callers that also handle values generically, as Objects then 
need an adapter to detect  the type of a value, cast it properly, only to have 
Lucene throw away all the type info and do that dance all over again internally!

Having said that, it's not really relevant to this change, which seems helpful. 
Maybe use {{Objects.requireNonNull}} for the null checks?

> Support numeric value in Field class
> 
>
> Key: LUCENE-8870
> URL: https://issues.apache.org/jira/browse/LUCENE-8870
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Namgyu Kim
>Priority: Major
> Attachments: LUCENE-8870.patch
>
>
> I checked the following comment in Field class.
> {code:java}
> // TODO: allow direct construction of int, long, float, double value too..?
> {code}
> We already have some fields like IntPoint and StoredField, but I think it's 
> okay.
> The test cases are set in the TestField class.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-8863) Improve Kuromoji DictionaryBuilder error handling, and enable loading external dictionary for testing

2019-06-20 Thread Mike Sokolov (JIRA)



 [ 
https://issues.apache.org/jira/browse/LUCENE-8863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Sokolov resolved LUCENE-8863.
--
Resolution: Fixed

> Improve Kuromoji DictionaryBuilder error handling, and enable loading 
> external dictionary for testing 
> --
>
> Key: LUCENE-8863
> URL: https://issues.apache.org/jira/browse/LUCENE-8863
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Assignee: Mike Sokolov
>Priority: Major
> Fix For: 8.2
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> While building a custom Kuromoji system dictionary, I discovered a few issues.
> First, the dictionary encoding has room for 13-bit (left and right) ids, but 
> really only supports 12 bits since this was all that was needed for the 
> IPADIC dictionary that ships with Kuromoji. The good news is we can easily 
> add support by fixing the bit-twiddling math.
> Second, the dictionary builder has a number of assertions that help uncover 
> problems in the input (like these overlarge ids), but the assertions aren't 
> enabled by default, so an unsuspecting new user doesn't get any benefit from 
> them, so we should upgrade to "real" exceptions.
> Finally, we want to handle the case of empty base forms differently. Kuromoji 
> does stemming by substituting a base form for a word when there is a base 
> form in the dictionary. Missing base forms are expected to be supplied as 
> {{*}}, but if a dictionary provides an empty string base form, we would end 
> up stripping that token completely. Since there is no possible meaning for an 
> empty base form (and the dictionary builder already treats {{*}} and empty 
> strings as equivalent in a number of other cases), I think we should simply 
> ignore empty base forms (rather than replacing words with empty strings when 
> tokenizing!)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-8863) Improve Kuromoji DictionaryBuilder error handling, and enable loading external dictionary for testing

2019-06-20 Thread Mike Sokolov (JIRA)



 [ 
https://issues.apache.org/jira/browse/LUCENE-8863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Sokolov updated LUCENE-8863:
-
Fix Version/s: 8.2

> Improve Kuromoji DictionaryBuilder error handling, and enable loading 
> external dictionary for testing 
> --
>
> Key: LUCENE-8863
> URL: https://issues.apache.org/jira/browse/LUCENE-8863
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Assignee: Mike Sokolov
>Priority: Major
> Fix For: 8.2
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> While building a custom Kuromoji system dictionary, I discovered a few issues.
> First, the dictionary encoding has room for 13-bit (left and right) ids, but 
> really only supports 12 bits since this was all that was needed for the 
> IPADIC dictionary that ships with Kuromoji. The good news is we can easily 
> add support by fixing the bit-twiddling math.
> Second, the dictionary builder has a number of assertions that help uncover 
> problems in the input (like these overlarge ids), but the assertions aren't 
> enabled by default, so an unsuspecting new user doesn't get any benefit from 
> them, so we should upgrade to "real" exceptions.
> Finally, we want to handle the case of empty base forms differently. Kuromoji 
> does stemming by substituting a base form for a word when there is a base 
> form in the dictionary. Missing base forms are expected to be supplied as 
> {{*}}, but if a dictionary provides an empty string base form, we would end 
> up stripping that token completely. Since there is no possible meaning for an 
> empty base form (and the dictionary builder already treats {{*}} and empty 
> strings as equivalent in a number of other cases), I think we should simply 
> ignore empty base forms (rather than replacing words with empty strings when 
> tokenizing!)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8816) Decouple Kuromoji's morphological analyser and its dictionary

2019-06-19 Thread Mike Sokolov (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16867977#comment-16867977
 ] 

Mike Sokolov commented on LUCENE-8816:
--

LUCENE-8871 opened to cover moving dictionary builder tools into main kuromoji 
source tree, mostly so it gets tested properly.

> Decouple Kuromoji's morphological analyser and its dictionary
> -
>
> Key: LUCENE-8816
> URL: https://issues.apache.org/jira/browse/LUCENE-8816
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Major
>
> I've inspired by this mail-list thread.
>  
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E]
> As many Japanese already know, default built-in dictionary bundled with 
> Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years. 
> While it has been slowly obsoleted, well-maintained and/or extended 
> dictionaries risen up in recent years (e.g. 
> [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd], 
> [UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some 
> attempts/projects/efforts are made in Japan.
> However current architecture - dictionary bundled jar - is essentially 
> incompatible with the idea "switch the system dictionary", and developers 
> have difficulties to do so.
> Traditionally, the morphological analysis engine (viterbi logic) and the 
> encoded dictionary (language model) had been decoupled (like MeCab, the 
> origin of Kuromoji, or lucene-gosen). So actually decoupling them is a 
> natural idea, and I feel that it's good time to re-think the current 
> architecture.
> Also this would be good for advanced users who have customized/re-trained 
> their own system dictionary.
> Goals of this issue:
>  * Decouple JapaneseTokenizer itself and encoded system dictionary.
>  * Implement dynamic dictionary load mechanism.
>  * Provide developer-oriented dictionary build tool.
> Non-goals:
>   * Provide learner or language model (it's up to users and should be outside 
> the scope).
> I have not dove into the code yet, so have no idea about it's easy or 
> difficult at this moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-8871) Move Kuromoji DictionaryBuilder tool from src/tools to src/

2019-06-19 Thread Mike Sokolov (JIRA)

Mike Sokolov created LUCENE-8871:


 Summary: Move Kuromoji DictionaryBuilder tool from src/tools to 
src/ 
 Key: LUCENE-8871
 URL: https://issues.apache.org/jira/browse/LUCENE-8871
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Mike Sokolov


Currently tests in tools directories are not run as part of the normal testing 
done by {{ant test}} - you have to explicitly run {{test-tools}}, which it 
seems people don't do (and it might not survivie translation to gradle, who 
knows), so [~rcmuir] suggested we just move the tools into the main source tree 
(under src/java and src/test)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8863) Improve Kuromoji DictionaryBuilder error handling, and enable loading external dictionary for testing

2019-06-19 Thread Mike Sokolov (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16867970#comment-16867970
 ] 

Mike Sokolov commented on LUCENE-8863:
--

Agreed - I'll edit the description to indicate how we added a constructor here

> Improve Kuromoji DictionaryBuilder error handling, and enable loading 
> external dictionary for testing 
> --
>
> Key: LUCENE-8863
> URL: https://issues.apache.org/jira/browse/LUCENE-8863
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Assignee: Mike Sokolov
>Priority: Major
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> While building a custom Kuromoji system dictionary, I discovered a few issues.
> First, the dictionary encoding has room for 13-bit (left and right) ids, but 
> really only supports 12 bits since this was all that was needed for the 
> IPADIC dictionary that ships with Kuromoji. The good news is we can easily 
> add support by fixing the bit-twiddling math.
> Second, the dictionary builder has a number of assertions that help uncover 
> problems in the input (like these overlarge ids), but the assertions aren't 
> enabled by default, so an unsuspecting new user doesn't get any benefit from 
> them, so we should upgrade to "real" exceptions.
> Finally, we want to handle the case of empty base forms differently. Kuromoji 
> does stemming by substituting a base form for a word when there is a base 
> form in the dictionary. Missing base forms are expected to be supplied as 
> {{*}}, but if a dictionary provides an empty string base form, we would end 
> up stripping that token completely. Since there is no possible meaning for an 
> empty base form (and the dictionary builder already treats {{*}} and empty 
> strings as equivalent in a number of other cases), I think we should simply 
> ignore empty base forms (rather than replacing words with empty strings when 
> tokenizing!)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-8863) Improve Kuromoji DictionaryBuilder error handling, and enable loading external dictionary for testing

2019-06-19 Thread Mike Sokolov (JIRA)



 [ 
https://issues.apache.org/jira/browse/LUCENE-8863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Sokolov updated LUCENE-8863:
-
Summary: Improve Kuromoji DictionaryBuilder error handling, and enable 
loading external dictionary for testing   (was: Improve handling of edge cases 
in Kuromoji's DIctionaryBuilder)

> Improve Kuromoji DictionaryBuilder error handling, and enable loading 
> external dictionary for testing 
> --
>
> Key: LUCENE-8863
> URL: https://issues.apache.org/jira/browse/LUCENE-8863
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Assignee: Mike Sokolov
>Priority: Major
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> While building a custom Kuromoji system dictionary, I discovered a few issues.
> First, the dictionary encoding has room for 13-bit (left and right) ids, but 
> really only supports 12 bits since this was all that was needed for the 
> IPADIC dictionary that ships with Kuromoji. The good news is we can easily 
> add support by fixing the bit-twiddling math.
> Second, the dictionary builder has a number of assertions that help uncover 
> problems in the input (like these overlarge ids), but the assertions aren't 
> enabled by default, so an unsuspecting new user doesn't get any benefit from 
> them, so we should upgrade to "real" exceptions.
> Finally, we want to handle the case of empty base forms differently. Kuromoji 
> does stemming by substituting a base form for a word when there is a base 
> form in the dictionary. Missing base forms are expected to be supplied as 
> {{*}}, but if a dictionary provides an empty string base form, we would end 
> up stripping that token completely. Since there is no possible meaning for an 
> empty base form (and the dictionary builder already treats {{*}} and empty 
> strings as equivalent in a number of other cases), I think we should simply 
> ignore empty base forms (rather than replacing words with empty strings when 
> tokenizing!)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-8781) Explore FST direct array arc encoding

2019-06-19 Thread Mike Sokolov (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16867966#comment-16867966
 ] 

Mike Sokolov edited comment on LUCENE-8781 at 6/19/19 8:09 PM:
---

re-closing after pushing fix that handled missing case (found while testing 
memory codec)


was (Author: sokolov):
re-closing after pushing fix that handled missing case (in memory codec)

> Explore FST direct array arc encoding 
> --
>
> Key: LUCENE-8781
> URL: https://issues.apache.org/jira/browse/LUCENE-8781
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Assignee: Dawid Weiss
>Priority: Major
> Fix For: master (9.0), 8.2
>
> Attachments: FST-2-4.png, FST-6-9.png, FST-size.png
>
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> This issue is for exploring an alternate FST encoding of Arcs as full-sized 
> arrays so Arcs are addressed directly by label, avoiding binary search that 
> we use today for arrays of Arcs. PR: 
> https://github.com/apache/lucene-solr/pull/657
> h3. Testing
> ant test passes. I added some unit tests that were helpful in uncovering bugs 
> while
> implementing which are more difficult to chase down when uncovered by the 
> randomized testing we already do. They don't really test anything new; 
> they're just more focused.
> I'm not sure why, but ant precommit failed for me with:
> {noformat}
>  ...lucene-solr/solr/common-build.xml:536: Check for forbidden API calls 
> failed while scanning class 
> 'org.apache.solr.metrics.reporters.SolrGangliaReporterTest' 
> (SolrGangliaReporterTest.java): java.lang.ClassNotFoundException: 
> info.ganglia.gmetric4j.gmetric.GMetric (while looking up details about 
> referenced class 'info.ganglia.gmetric4j.gmetric.GMetric')
> {noformat}
> I also got Test2BFST running (it was originally timing out due to excessive 
> calls to ramBytesUsage(), which seems to have gotten slow), and it passed; 
> that change isn't include here.
> h4. Micro-benchmark
> I timed lookups in FST via FSTEnum.seekExact in a unit test under various 
> conditions. 
> h5. English words
> A test of looking up existing words in a dictionary of ~17 English words 
> shows improvements; the numbers listed are % change in FST size, time to look 
> up (FSTEnum.seekExact) words that are in the dict, and time to look up random 
> strings that are not in the dict. The comparison is against the current 
> codebase with the optimization disabled. A separate comparison of showed no 
> significant change of the baseline (no opto applied) vs the current master 
> FST impl with no code changes applied.
> ||  load=2||   load=4 ||  load=16 ||
> | +4, -6, -7  | +18, -11, -8 | +22, -11.5, -7 |
> The "load factor" used for those measurements controls when direct array arc 
> encoding is used;
> namely when the number of outgoing arcs was > load * (max label - min label).
> h5. sequential and random terms
> The same test, with terms being a sequence of integers as strings shows a 
> larger improvement, around 20% (load=4). This is presumably the best case for 
> this delta, where every Arc is encoded as a direct lookup.
> When random lowercase ASCII strings are used, a smaller improvement of around 
> 4% is seen.
> h4. luceneutil
> Testing w/luceneutil (wikimediumall) we see improvements mostly in the 
> PKLookup case. Other results seem noisy, with perhaps a small improvment in 
> some of the queries.
> {noformat}
> TaskQPS base  StdDevQPS opto  StdDev  
>   Pct diff
>   OrHighHigh6.93  (3.0%)6.89  (3.1%)   
> -0.5% (  -6% -5%)
>OrHighMed   45.15  (3.9%)   44.92  (3.5%)   
> -0.5% (  -7% -7%)
> Wildcard8.72  (4.7%)8.69  (4.6%)   
> -0.4% (  -9% -9%)
>   AndHighLow  274.11  (2.6%)  273.58  (3.1%)   
> -0.2% (  -5% -5%)
>OrHighLow  241.41  (1.9%)  241.11  (3.5%)   
> -0.1% (  -5% -5%)
>   AndHighMed   52.23  (4.1%)   52.41  (5.3%)
> 0.3% (  -8% -   10%)
>  MedTerm 1026.24  (3.1%) 1030.52  (4.3%)
> 0.4% (  -6% -8%)
> HighTerm .10  (3.4%) 1116.70  (4.0%)
> 0.5% (  -6% -8%)
>HighTermDayOfYearSort   14.59  (8.2%)   14.73  (9.3%)
> 1.0% ( -15% -   20%)
>  AndHighHigh   13.45  (6.2%)   13.61  (4.4%)
> 1.2% (  -8% -   12%)
>HighTermMonthSort   63.09 (12.5%)   64.13 (10.9%)
> 1.6% ( -19% -   28%)
>  LowTerm 1338.94  (3.3%) 1383.90  (5.5%)
> 3.4% (  -5%

[jira] [Resolved] (LUCENE-8781) Explore FST direct array arc encoding

2019-06-19 Thread Mike Sokolov (JIRA)



 [ 
https://issues.apache.org/jira/browse/LUCENE-8781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Sokolov resolved LUCENE-8781.
--
Resolution: Fixed

re-closing after pushing fix that handled missing case (in memory codec)

> Explore FST direct array arc encoding 
> --
>
> Key: LUCENE-8781
> URL: https://issues.apache.org/jira/browse/LUCENE-8781
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Assignee: Dawid Weiss
>Priority: Major
> Fix For: master (9.0), 8.2
>
> Attachments: FST-2-4.png, FST-6-9.png, FST-size.png
>
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> This issue is for exploring an alternate FST encoding of Arcs as full-sized 
> arrays so Arcs are addressed directly by label, avoiding binary search that 
> we use today for arrays of Arcs. PR: 
> https://github.com/apache/lucene-solr/pull/657
> h3. Testing
> ant test passes. I added some unit tests that were helpful in uncovering bugs 
> while
> implementing which are more difficult to chase down when uncovered by the 
> randomized testing we already do. They don't really test anything new; 
> they're just more focused.
> I'm not sure why, but ant precommit failed for me with:
> {noformat}
>  ...lucene-solr/solr/common-build.xml:536: Check for forbidden API calls 
> failed while scanning class 
> 'org.apache.solr.metrics.reporters.SolrGangliaReporterTest' 
> (SolrGangliaReporterTest.java): java.lang.ClassNotFoundException: 
> info.ganglia.gmetric4j.gmetric.GMetric (while looking up details about 
> referenced class 'info.ganglia.gmetric4j.gmetric.GMetric')
> {noformat}
> I also got Test2BFST running (it was originally timing out due to excessive 
> calls to ramBytesUsage(), which seems to have gotten slow), and it passed; 
> that change isn't include here.
> h4. Micro-benchmark
> I timed lookups in FST via FSTEnum.seekExact in a unit test under various 
> conditions. 
> h5. English words
> A test of looking up existing words in a dictionary of ~17 English words 
> shows improvements; the numbers listed are % change in FST size, time to look 
> up (FSTEnum.seekExact) words that are in the dict, and time to look up random 
> strings that are not in the dict. The comparison is against the current 
> codebase with the optimization disabled. A separate comparison of showed no 
> significant change of the baseline (no opto applied) vs the current master 
> FST impl with no code changes applied.
> ||  load=2||   load=4 ||  load=16 ||
> | +4, -6, -7  | +18, -11, -8 | +22, -11.5, -7 |
> The "load factor" used for those measurements controls when direct array arc 
> encoding is used;
> namely when the number of outgoing arcs was > load * (max label - min label).
> h5. sequential and random terms
> The same test, with terms being a sequence of integers as strings shows a 
> larger improvement, around 20% (load=4). This is presumably the best case for 
> this delta, where every Arc is encoded as a direct lookup.
> When random lowercase ASCII strings are used, a smaller improvement of around 
> 4% is seen.
> h4. luceneutil
> Testing w/luceneutil (wikimediumall) we see improvements mostly in the 
> PKLookup case. Other results seem noisy, with perhaps a small improvment in 
> some of the queries.
> {noformat}
> TaskQPS base  StdDevQPS opto  StdDev  
>   Pct diff
>   OrHighHigh6.93  (3.0%)6.89  (3.1%)   
> -0.5% (  -6% -5%)
>OrHighMed   45.15  (3.9%)   44.92  (3.5%)   
> -0.5% (  -7% -7%)
> Wildcard8.72  (4.7%)8.69  (4.6%)   
> -0.4% (  -9% -9%)
>   AndHighLow  274.11  (2.6%)  273.58  (3.1%)   
> -0.2% (  -5% -5%)
>OrHighLow  241.41  (1.9%)  241.11  (3.5%)   
> -0.1% (  -5% -5%)
>   AndHighMed   52.23  (4.1%)   52.41  (5.3%)
> 0.3% (  -8% -   10%)
>  MedTerm 1026.24  (3.1%) 1030.52  (4.3%)
> 0.4% (  -6% -8%)
> HighTerm .10  (3.4%) 1116.70  (4.0%)
> 0.5% (  -6% -8%)
>HighTermDayOfYearSort   14.59  (8.2%)   14.73  (9.3%)
> 1.0% ( -15% -   20%)
>  AndHighHigh   13.45  (6.2%)   13.61  (4.4%)
> 1.2% (  -8% -   12%)
>HighTermMonthSort   63.09 (12.5%)   64.13 (10.9%)
> 1.6% ( -19% -   28%)
>  LowTerm 1338.94  (3.3%) 1383.90  (5.5%)
> 3.4% (  -5% -   12%)
> PKLookup  120.45  (2.5%)  130.91  (3.5%)
> 8.7% (   2% -   15%)
> {noformat}
> h4.FST perf tests
> I ran LookupBenchmarkTest to see the impact on the

[jira] [Commented] (LUCENE-8863) Improve handling of edge cases in Kuromoji's DIctionaryBuilder

2019-06-18 Thread Mike Sokolov (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16866918#comment-16866918
 ] 

Mike Sokolov commented on LUCENE-8863:
--

 I'll push this in a couple of days if there are not other concerns. @mocobeta 
I think the linked PR is already taking a step towards LUCENE-8616 since it 
allows loading an external system dictionary. Not sure if you saw it, but if 
you have a moment maybe you could check if it is along the lines you were 
planning?

> Improve handling of edge cases in Kuromoji's DIctionaryBuilder
> --
>
> Key: LUCENE-8863
> URL: https://issues.apache.org/jira/browse/LUCENE-8863
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Assignee: Mike Sokolov
>Priority: Major
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> While building a custom Kuromoji system dictionary, I discovered a few issues.
> First, the dictionary encoding has room for 13-bit (left and right) ids, but 
> really only supports 12 bits since this was all that was needed for the 
> IPADIC dictionary that ships with Kuromoji. The good news is we can easily 
> add support by fixing the bit-twiddling math.
> Second, the dictionary builder has a number of assertions that help uncover 
> problems in the input (like these overlarge ids), but the assertions aren't 
> enabled by default, so an unsuspecting new user doesn't get any benefit from 
> them, so we should upgrade to "real" exceptions.
> Finally, we want to handle the case of empty base forms differently. Kuromoji 
> does stemming by substituting a base form for a word when there is a base 
> form in the dictionary. Missing base forms are expected to be supplied as 
> {{*}}, but if a dictionary provides an empty string base form, we would end 
> up stripping that token completely. Since there is no possible meaning for an 
> empty base form (and the dictionary builder already treats {{*}} and empty 
> strings as equivalent in a number of other cases), I think we should simply 
> ignore empty base forms (rather than replacing words with empty strings when 
> tokenizing!)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8866) Remove ICU dependency of kuromoji tools/test-tools

2019-06-18 Thread Mike Sokolov (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16866916#comment-16866916
 ] 

Mike Sokolov commented on LUCENE-8866:
--

+1 if people have more precise normalization requirements, they can encode them 
in their dictionary – I think we can presume this is not noisy user data, and 
should already have been cleaned.

> Remove ICU dependency of kuromoji tools/test-tools
> --
>
> Key: LUCENE-8866
> URL: https://issues.apache.org/jira/browse/LUCENE-8866
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Robert Muir
>Priority: Major
> Attachments: LUCENE-8866.patch
>
>
> The tooling stuff has an off-by-default option to normalize entries, 
> currently using the ICU api.
> But I think since its off-by-default, and just doing NFKC normalization at 
> dictionary-build-time, its a better tradeoff to use the JDK here?
> I would rather remove the ICU dependency for the tooling and look at 
> simplifying the build to have less modules (e.g. investigate moving the 
> tooling and tests into src/java and src/tools, so that [~msoko...@gmail.com] 
> new tests in LUCENE-8863 are running by default, dictionary tool is shipped 
> as a commandline tool in the JAR, etc)
> "ant regenerate" should be enough to prevent any chicken-and-eggs in the 
> dictionary construction code, so I don't think we need separate modules to 
> enforce it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-8863) Improve handling of edge cases in Kuromoji's DIctionaryBuilder

2019-06-17 Thread Mike Sokolov (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16865906#comment-16865906
 ] 

Mike Sokolov edited comment on LUCENE-8863 at 6/17/19 7:10 PM:
---

OK, I will check for empty base form and raise an exception, not allow it. 
People can pass * if they want to have no base form. I think it is valid for 
any single one of the several POS fields in the input to be empty. We currently 
join them together with "-" separators unless they are empty, in which case we 
ignore them. I guess we could check if *none* of the POS fields have a 
non-empty value and throw an error in that case.


was (Author: sokolov):
OK, I will check for empty base form and raise an exception, not allow it. 
People can pass '{{*}}' if they want to have no base form. I think it is valid 
for any single one of the several POS fields in the input to be empty. We 
currently join them together with "-" separators unless they are empty, in 
which case we ignore them. I guess we could check if *none* of the POS fields 
have a non-empty value and throw an error in that case.

> Improve handling of edge cases in Kuromoji's DIctionaryBuilder
> --
>
> Key: LUCENE-8863
> URL: https://issues.apache.org/jira/browse/LUCENE-8863
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Assignee: Mike Sokolov
>Priority: Major
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> While building a custom Kuromoji system dictionary, I discovered a few issues.
> First, the dictionary encoding has room for 13-bit (left and right) ids, but 
> really only supports 12 bits since this was all that was needed for the 
> IPADIC dictionary that ships with Kuromoji. The good news is we can easily 
> add support by fixing the bit-twiddling math.
> Second, the dictionary builder has a number of assertions that help uncover 
> problems in the input (like these overlarge ids), but the assertions aren't 
> enabled by default, so an unsuspecting new user doesn't get any benefit from 
> them, so we should upgrade to "real" exceptions.
> Finally, we want to handle the case of empty base forms differently. Kuromoji 
> does stemming by substituting a base form for a word when there is a base 
> form in the dictionary. Missing base forms are expected to be supplied as 
> {{*}}, but if a dictionary provides an empty string base form, we would end 
> up stripping that token completely. Since there is no possible meaning for an 
> empty base form (and the dictionary builder already treats {{*}} and empty 
> strings as equivalent in a number of other cases), I think we should simply 
> ignore empty base forms (rather than replacing words with empty strings when 
> tokenizing!)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-8863) Improve handling of edge cases in Kuromoji's DIctionaryBuilder

2019-06-17 Thread Mike Sokolov (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16865906#comment-16865906
 ] 

Mike Sokolov edited comment on LUCENE-8863 at 6/17/19 7:08 PM:
---

OK, I will check for empty base form and raise an exception, not allow it. 
People can pass '{{*}}' if they want to have no base form. I think it is valid 
for any single one of the several POS fields in the input to be empty. We 
currently join them together with "-" separators unless they are empty, in 
which case we ignore them. I guess we could check if *none* of the POS fields 
have a non-empty value and throw an error in that case.


was (Author: sokolov):
OK, I will check for empty base form and raise an exception, not allow it. 
People can pass '*' if they want to have no base form. I think it is valid for 
any single one of the several POS fields in the input to be empty. We currently 
join them together with "-" separators unless they are empty, in which case we 
ignore them. I guess we could check if *none* of the POS fields have a 
non-empty value and throw an error in that case.

> Improve handling of edge cases in Kuromoji's DIctionaryBuilder
> --
>
> Key: LUCENE-8863
> URL: https://issues.apache.org/jira/browse/LUCENE-8863
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Assignee: Mike Sokolov
>Priority: Major
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> While building a custom Kuromoji system dictionary, I discovered a few issues.
> First, the dictionary encoding has room for 13-bit (left and right) ids, but 
> really only supports 12 bits since this was all that was needed for the 
> IPADIC dictionary that ships with Kuromoji. The good news is we can easily 
> add support by fixing the bit-twiddling math.
> Second, the dictionary builder has a number of assertions that help uncover 
> problems in the input (like these overlarge ids), but the assertions aren't 
> enabled by default, so an unsuspecting new user doesn't get any benefit from 
> them, so we should upgrade to "real" exceptions.
> Finally, we want to handle the case of empty base forms differently. Kuromoji 
> does stemming by substituting a base form for a word when there is a base 
> form in the dictionary. Missing base forms are expected to be supplied as 
> {{*}}, but if a dictionary provides an empty string base form, we would end 
> up stripping that token completely. Since there is no possible meaning for an 
> empty base form (and the dictionary builder already treats {{*}} and empty 
> strings as equivalent in a number of other cases), I think we should simply 
> ignore empty base forms (rather than replacing words with empty strings when 
> tokenizing!)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8863) Improve handling of edge cases in Kuromoji's DIctionaryBuilder

2019-06-17 Thread Mike Sokolov (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16865906#comment-16865906
 ] 

Mike Sokolov commented on LUCENE-8863:
--

OK, I will check for empty base form and raise an exception, not allow it. 
People can pass '*' if they want to have no base form. I think it is valid for 
any single one of the several POS fields in the input to be empty. We currently 
join them together with "-" separators unless they are empty, in which case we 
ignore them. I guess we could check if *none* of the POS fields have a 
non-empty value and throw an error in that case.

> Improve handling of edge cases in Kuromoji's DIctionaryBuilder
> --
>
> Key: LUCENE-8863
> URL: https://issues.apache.org/jira/browse/LUCENE-8863
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Assignee: Mike Sokolov
>Priority: Major
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> While building a custom Kuromoji system dictionary, I discovered a few issues.
> First, the dictionary encoding has room for 13-bit (left and right) ids, but 
> really only supports 12 bits since this was all that was needed for the 
> IPADIC dictionary that ships with Kuromoji. The good news is we can easily 
> add support by fixing the bit-twiddling math.
> Second, the dictionary builder has a number of assertions that help uncover 
> problems in the input (like these overlarge ids), but the assertions aren't 
> enabled by default, so an unsuspecting new user doesn't get any benefit from 
> them, so we should upgrade to "real" exceptions.
> Finally, we want to handle the case of empty base forms differently. Kuromoji 
> does stemming by substituting a base form for a word when there is a base 
> form in the dictionary. Missing base forms are expected to be supplied as 
> {{*}}, but if a dictionary provides an empty string base form, we would end 
> up stripping that token completely. Since there is no possible meaning for an 
> empty base form (and the dictionary builder already treats {{*}} and empty 
> strings as equivalent in a number of other cases), I think we should simply 
> ignore empty base forms (rather than replacing words with empty strings when 
> tokenizing!)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-8863) Improve handling of edge cases in Kuromoji's DIctionaryBuilder

2019-06-15 Thread Mike Sokolov (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16864841#comment-16864841
 ] 

Mike Sokolov edited comment on LUCENE-8863 at 6/15/19 7:56 PM:
---

{quote}Can we just throw an exception on empty base form? It sounds like a 
missing check in the code. I don't think its good to try to support N different 
ways of doing things when only one is tested (the ipadic way)
{quote}
I agree with the sentiment - be strict, and keep it simple. One thing is we 
already handle empty POS fields by ignoring. EG in the section where it says 
"build up the POS string" we concatenate various POS tokens with "-" as a 
separator, unless they are empty, and then we don't add adjacent separator 
chars. The other thing is – I don't know what dictionaries may already exist? 
Is there an externally-defined standard we would should accept? I can certainly 
modify the dictionary I have to have "*," but what about Unidic or dictionaries 
people might get from Sudachi or other neologd providers? If there is some 
common usage that expects empty strings, I think we should support it, and it 
really is kind of natural to express a missing value with an empty string? Are 
there people here who have looked at those?

 

 [Here's a link to a preliminary patch 
|https://github.com/apache/lucene-solr/pull/722](no tests yet)


was (Author: sokolov):
{quote}Can we just throw an exception on empty base form? It sounds like a 
missing check in the code. I don't think its good to try to support N different 
ways of doing things when only one is tested (the ipadic way)
{quote}
I agree with the sentiment - be strict, and keep it simple. One thing is we 
already handle empty POS fields by ignoring. EG in the section where it says 
"build up the POS string" we concatenate various POS tokens with "-" as a 
separator, unless they are empty, and then we don't add adjacent separator 
chars. The other thing is – I don't know what dictionaries may already exist? 
Is there an externally-defined standard we would should accept? I can certainly 
modify the dictionary I have to have "*," but what about Unidic or dictionaries 
people might get from Sudachi or other neologd providers? If there is some 
common usage that expects empty strings, I think we should support it, and it 
really is kind of natural to express a missing value with an empty string? Are 
there people here who have looked at those?

> Improve handling of edge cases in Kuromoji's DIctionaryBuilder
> --
>
> Key: LUCENE-8863
> URL: https://issues.apache.org/jira/browse/LUCENE-8863
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Assignee: Mike Sokolov
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> While building a custom Kuromoji system dictionary, I discovered a few issues.
> First, the dictionary encoding has room for 13-bit (left and right) ids, but 
> really only supports 12 bits since this was all that was needed for the 
> IPADIC dictionary that ships with Kuromoji. The good news is we can easily 
> add support by fixing the bit-twiddling math.
> Second, the dictionary builder has a number of assertions that help uncover 
> problems in the input (like these overlarge ids), but the assertions aren't 
> enabled by default, so an unsuspecting new user doesn't get any benefit from 
> them, so we should upgrade to "real" exceptions.
> Finally, we want to handle the case of empty base forms differently. Kuromoji 
> does stemming by substituting a base form for a word when there is a base 
> form in the dictionary. Missing base forms are expected to be supplied as 
> {{*}}, but if a dictionary provides an empty string base form, we would end 
> up stripping that token completely. Since there is no possible meaning for an 
> empty base form (and the dictionary builder already treats {{*}} and empty 
> strings as equivalent in a number of other cases), I think we should simply 
> ignore empty base forms (rather than replacing words with empty strings when 
> tokenizing!)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8863) Improve handling of edge cases in Kuromoji's DIctionaryBuilder

2019-06-15 Thread Mike Sokolov (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16864841#comment-16864841
 ] 

Mike Sokolov commented on LUCENE-8863:
--

{quote}Can we just throw an exception on empty base form? It sounds like a 
missing check in the code. I don't think its good to try to support N different 
ways of doing things when only one is tested (the ipadic way)
{quote}
I agree with the sentiment - be strict, and keep it simple. One thing is we 
already handle empty POS fields by ignoring. EG in the section where it says 
"build up the POS string" we concatenate various POS tokens with "-" as a 
separator, unless they are empty, and then we don't add adjacent separator 
chars. The other thing is – I don't know what dictionaries may already exist? 
Is there an externally-defined standard we would should accept? I can certainly 
modify the dictionary I have to have "*," but what about Unidic or dictionaries 
people might get from Sudachi or other neologd providers? If there is some 
common usage that expects empty strings, I think we should support it, and it 
really is kind of natural to express a missing value with an empty string? Are 
there people here who have looked at those?

> Improve handling of edge cases in Kuromoji's DIctionaryBuilder
> --
>
> Key: LUCENE-8863
> URL: https://issues.apache.org/jira/browse/LUCENE-8863
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Assignee: Mike Sokolov
>Priority: Major
>
> While building a custom Kuromoji system dictionary, I discovered a few issues.
> First, the dictionary encoding has room for 13-bit (left and right) ids, but 
> really only supports 12 bits since this was all that was needed for the 
> IPADIC dictionary that ships with Kuromoji. The good news is we can easily 
> add support by fixing the bit-twiddling math.
> Second, the dictionary builder has a number of assertions that help uncover 
> problems in the input (like these overlarge ids), but the assertions aren't 
> enabled by default, so an unsuspecting new user doesn't get any benefit from 
> them, so we should upgrade to "real" exceptions.
> Finally, we want to handle the case of empty base forms differently. Kuromoji 
> does stemming by substituting a base form for a word when there is a base 
> form in the dictionary. Missing base forms are expected to be supplied as 
> {{*}}, but if a dictionary provides an empty string base form, we would end 
> up stripping that token completely. Since there is no possible meaning for an 
> empty base form (and the dictionary builder already treats {{*}} and empty 
> strings as equivalent in a number of other cases), I think we should simply 
> ignore empty base forms (rather than replacing words with empty strings when 
> tokenizing!)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8863) Improve handling of edge cases in Kuromoji's DIctionaryBuilder

2019-06-15 Thread Mike Sokolov (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16864701#comment-16864701
 ] 

Mike Sokolov commented on LUCENE-8863:
--

I'll submit a patch soon. My initial idea was to maintain LUCENE-8816 as an 
overall tracking/planning issue and then have smaller issues, one for each 
patch. I'm used to a style of one issue tracking a fairly small amount of work; 
perhaps one or two self-contained patches.

> Improve handling of edge cases in Kuromoji's DIctionaryBuilder
> --
>
> Key: LUCENE-8863
> URL: https://issues.apache.org/jira/browse/LUCENE-8863
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Assignee: Mike Sokolov
>Priority: Major
>
> While building a custom Kuromoji system dictionary, I discovered a few issues.
> First, the dictionary encoding has room for 13-bit (left and right) ids, but 
> really only supports 12 bits since this was all that was needed for the 
> IPADIC dictionary that ships with Kuromoji. The good news is we can easily 
> add support by fixing the bit-twiddling math.
> Second, the dictionary builder has a number of assertions that help uncover 
> problems in the input (like these overlarge ids), but the assertions aren't 
> enabled by default, so an unsuspecting new user doesn't get any benefit from 
> them, so we should upgrade to "real" exceptions.
> Finally, we want to handle the case of empty base forms differently. Kuromoji 
> does stemming by substituting a base form for a word when there is a base 
> form in the dictionary. Missing base forms are expected to be supplied as 
> {{*}}, but if a dictionary provides an empty string base form, we would end 
> up stripping that token completely. Since there is no possible meaning for an 
> empty base form (and the dictionary builder already treats {{*}} and empty 
> strings as equivalent in a number of other cases), I think we should simply 
> ignore empty base forms (rather than replacing words with empty strings when 
> tokenizing!)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8816) Decouple Kuromoji's morphological analyser and its dictionary

2019-06-15 Thread Mike Sokolov (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16864672#comment-16864672
 ] 

Mike Sokolov commented on LUCENE-8816:
--

I opened LUCENE-8863 to cover some small, but blocking, issues I uncovered 
while loading a custom dictionary.

> Decouple Kuromoji's morphological analyser and its dictionary
> -
>
> Key: LUCENE-8816
> URL: https://issues.apache.org/jira/browse/LUCENE-8816
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Major
>
> I've inspired by this mail-list thread.
>  
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E]
> As many Japanese already know, default built-in dictionary bundled with 
> Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years. 
> While it has been slowly obsoleted, well-maintained and/or extended 
> dictionaries risen up in recent years (e.g. 
> [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd], 
> [UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some 
> attempts/projects/efforts are made in Japan.
> However current architecture - dictionary bundled jar - is essentially 
> incompatible with the idea "switch the system dictionary", and developers 
> have difficulties to do so.
> Traditionally, the morphological analysis engine (viterbi logic) and the 
> encoded dictionary (language model) had been decoupled (like MeCab, the 
> origin of Kuromoji, or lucene-gosen). So actually decoupling them is a 
> natural idea, and I feel that it's good time to re-think the current 
> architecture.
> Also this would be good for advanced users who have customized/re-trained 
> their own system dictionary.
> Goals of this issue:
>  * Decouple JapaneseTokenizer itself and encoded system dictionary.
>  * Implement dynamic dictionary load mechanism.
>  * Provide developer-oriented dictionary build tool.
> Non-goals:
>   * Provide learner or language model (it's up to users and should be outside 
> the scope).
> I have not dove into the code yet, so have no idea about it's easy or 
> difficult at this moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-8863) Improve handling of edge cases in Kuromoji's DIctionaryBuilder

2019-06-15 Thread Mike Sokolov (JIRA)

Mike Sokolov created LUCENE-8863:


 Summary: Improve handling of edge cases in Kuromoji's 
DIctionaryBuilder
 Key: LUCENE-8863
 URL: https://issues.apache.org/jira/browse/LUCENE-8863
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Mike Sokolov
Assignee: Mike Sokolov


While building a custom Kuromoji system dictionary, I discovered a few issues.

First, the dictionary encoding has room for 13-bit (left and right) ids, but 
really only supports 12 bits since this was all that was needed for the IPADIC 
dictionary that ships with Kuromoji. The good news is we can easily add support 
by fixing the bit-twiddling math.

Second, the dictionary builder has a number of assertions that help uncover 
problems in the input (like these overlarge ids), but the assertions aren't 
enabled by default, so an unsuspecting new user doesn't get any benefit from 
them, so we should upgrade to "real" exceptions.

Finally, we want to handle the case of empty base forms differently. Kuromoji 
does stemming by substituting a base form for a word when there is a base form 
in the dictionary. Missing base forms are expected to be supplied as {{*}}, but 
if a dictionary provides an empty string base form, we would end up stripping 
that token completely. Since there is no possible meaning for an empty base 
form (and the dictionary builder already treats {{*}} and empty strings as 
equivalent in a number of other cases), I think we should simply ignore empty 
base forms (rather than replacing words with empty strings when tokenizing!)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8781) Explore FST direct array arc encoding

2019-06-15 Thread Mike Sokolov (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16864668#comment-16864668
 ] 

Mike Sokolov commented on LUCENE-8781:
--

Thanks for testing, [~dsmiley], you definitely found a bug. The issue is that 
when adding the array-with-gaps encoding, I did not update the 
{{org.apache.util.fst.Util}} class, specifically its {{readCeilArc}}. It's only 
used in a couple of places, and I just overlooked. I'm posting a patch that 
fixes this issue [[here|https://github.com/apache/lucene-solr/pull/721]]

> Explore FST direct array arc encoding 
> --
>
> Key: LUCENE-8781
> URL: https://issues.apache.org/jira/browse/LUCENE-8781
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Assignee: Dawid Weiss
>Priority: Major
> Fix For: master (9.0), 8.2
>
> Attachments: FST-2-4.png, FST-6-9.png, FST-size.png
>
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> This issue is for exploring an alternate FST encoding of Arcs as full-sized 
> arrays so Arcs are addressed directly by label, avoiding binary search that 
> we use today for arrays of Arcs. PR: 
> https://github.com/apache/lucene-solr/pull/657
> h3. Testing
> ant test passes. I added some unit tests that were helpful in uncovering bugs 
> while
> implementing which are more difficult to chase down when uncovered by the 
> randomized testing we already do. They don't really test anything new; 
> they're just more focused.
> I'm not sure why, but ant precommit failed for me with:
> {noformat}
>  ...lucene-solr/solr/common-build.xml:536: Check for forbidden API calls 
> failed while scanning class 
> 'org.apache.solr.metrics.reporters.SolrGangliaReporterTest' 
> (SolrGangliaReporterTest.java): java.lang.ClassNotFoundException: 
> info.ganglia.gmetric4j.gmetric.GMetric (while looking up details about 
> referenced class 'info.ganglia.gmetric4j.gmetric.GMetric')
> {noformat}
> I also got Test2BFST running (it was originally timing out due to excessive 
> calls to ramBytesUsage(), which seems to have gotten slow), and it passed; 
> that change isn't include here.
> h4. Micro-benchmark
> I timed lookups in FST via FSTEnum.seekExact in a unit test under various 
> conditions. 
> h5. English words
> A test of looking up existing words in a dictionary of ~17 English words 
> shows improvements; the numbers listed are % change in FST size, time to look 
> up (FSTEnum.seekExact) words that are in the dict, and time to look up random 
> strings that are not in the dict. The comparison is against the current 
> codebase with the optimization disabled. A separate comparison of showed no 
> significant change of the baseline (no opto applied) vs the current master 
> FST impl with no code changes applied.
> ||  load=2||   load=4 ||  load=16 ||
> | +4, -6, -7  | +18, -11, -8 | +22, -11.5, -7 |
> The "load factor" used for those measurements controls when direct array arc 
> encoding is used;
> namely when the number of outgoing arcs was > load * (max label - min label).
> h5. sequential and random terms
> The same test, with terms being a sequence of integers as strings shows a 
> larger improvement, around 20% (load=4). This is presumably the best case for 
> this delta, where every Arc is encoded as a direct lookup.
> When random lowercase ASCII strings are used, a smaller improvement of around 
> 4% is seen.
> h4. luceneutil
> Testing w/luceneutil (wikimediumall) we see improvements mostly in the 
> PKLookup case. Other results seem noisy, with perhaps a small improvment in 
> some of the queries.
> {noformat}
> TaskQPS base  StdDevQPS opto  StdDev  
>   Pct diff
>   OrHighHigh6.93  (3.0%)6.89  (3.1%)   
> -0.5% (  -6% -5%)
>OrHighMed   45.15  (3.9%)   44.92  (3.5%)   
> -0.5% (  -7% -7%)
> Wildcard8.72  (4.7%)8.69  (4.6%)   
> -0.4% (  -9% -9%)
>   AndHighLow  274.11  (2.6%)  273.58  (3.1%)   
> -0.2% (  -5% -5%)
>OrHighLow  241.41  (1.9%)  241.11  (3.5%)   
> -0.1% (  -5% -5%)
>   AndHighMed   52.23  (4.1%)   52.41  (5.3%)
> 0.3% (  -8% -   10%)
>  MedTerm 1026.24  (3.1%) 1030.52  (4.3%)
> 0.4% (  -6% -8%)
> HighTerm .10  (3.4%) 1116.70  (4.0%)
> 0.5% (  -6% -8%)
>HighTermDayOfYearSort   14.59  (8.2%)   14.73  (9.3%)
> 1.0% ( -15% -   20%)
>  AndHighHigh   13.45  (6.2%)   13.61  (4.4%)
> 1.2% (  -8% -   12%)
>HighTermMonthSort   63.09 (12.5%)   64.

[jira] [Commented] (LUCENE-8816) Decouple Kuromoji's morphological analyser and its dictionary

2019-06-11 Thread Mike Sokolov (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16861622#comment-16861622
 ] 

Mike Sokolov commented on LUCENE-8816:
--

Thanks Robert, yeah I understand this was built for a single dictionary, not a 
general-purpose tool, and hardening is required to enable wider usage.

> Decouple Kuromoji's morphological analyser and its dictionary
> -
>
> Key: LUCENE-8816
> URL: https://issues.apache.org/jira/browse/LUCENE-8816
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Major
>
> I've inspired by this mail-list thread.
>  
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E]
> As many Japanese already know, default built-in dictionary bundled with 
> Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years. 
> While it has been slowly obsoleted, well-maintained and/or extended 
> dictionaries risen up in recent years (e.g. 
> [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd], 
> [UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some 
> attempts/projects/efforts are made in Japan.
> However current architecture - dictionary bundled jar - is essentially 
> incompatible with the idea "switch the system dictionary", and developers 
> have difficulties to do so.
> Traditionally, the morphological analysis engine (viterbi logic) and the 
> encoded dictionary (language model) had been decoupled (like MeCab, the 
> origin of Kuromoji, or lucene-gosen). So actually decoupling them is a 
> natural idea, and I feel that it's good time to re-think the current 
> architecture.
> Also this would be good for advanced users who have customized/re-trained 
> their own system dictionary.
> Goals of this issue:
>  * Decouple JapaneseTokenizer itself and encoded system dictionary.
>  * Implement dynamic dictionary load mechanism.
>  * Provide developer-oriented dictionary build tool.
> Non-goals:
>   * Provide learner or language model (it's up to users and should be outside 
> the scope).
> I have not dove into the code yet, so have no idea about it's easy or 
> difficult at this moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8816) Decouple Kuromoji's morphological analyser and its dictionary

2019-06-11 Thread Mike Sokolov (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16861609#comment-16861609
 ] 

Mike Sokolov commented on LUCENE-8816:
--

I see that in {{BinaryDictionaryWriter}} we restrict incoming leftID (and 
rightID) to be < 4096 because we are going to pack into a 16-bit short with 3 
flag bits. However it seems we have room for one more bit (since 2^(16-3) == 
8192). Am I missing something? Do we use that other bit somewhere? I see eg 
that in {{BinaryDictionary}} when we decode, we >>> 3 to get back the ids, so I 
think it should be OK to allow ids up to 8191. [~rcmuir] do you know why it is 
currenly limited to 4096? Also I think it would make sense to change the 
asserts there to be IllegalArgumentException so they are raised whenever the 
tool is run, since we would get garbage if this limit is exceeded, and (I 
think) nothing else will catch it.

> Decouple Kuromoji's morphological analyser and its dictionary
> -
>
> Key: LUCENE-8816
> URL: https://issues.apache.org/jira/browse/LUCENE-8816
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Major
>
> I've inspired by this mail-list thread.
>  
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E]
> As many Japanese already know, default built-in dictionary bundled with 
> Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years. 
> While it has been slowly obsoleted, well-maintained and/or extended 
> dictionaries risen up in recent years (e.g. 
> [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd], 
> [UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some 
> attempts/projects/efforts are made in Japan.
> However current architecture - dictionary bundled jar - is essentially 
> incompatible with the idea "switch the system dictionary", and developers 
> have difficulties to do so.
> Traditionally, the morphological analysis engine (viterbi logic) and the 
> encoded dictionary (language model) had been decoupled (like MeCab, the 
> origin of Kuromoji, or lucene-gosen). So actually decoupling them is a 
> natural idea, and I feel that it's good time to re-think the current 
> architecture.
> Also this would be good for advanced users who have customized/re-trained 
> their own system dictionary.
> Goals of this issue:
>  * Decouple JapaneseTokenizer itself and encoded system dictionary.
>  * Implement dynamic dictionary load mechanism.
>  * Provide developer-oriented dictionary build tool.
> Non-goals:
>   * Provide learner or language model (it's up to users and should be outside 
> the scope).
> I have not dove into the code yet, so have no idea about it's easy or 
> difficult at this moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8791) Add CollectorRescorer

2019-06-10 Thread Mike Sokolov (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16860382#comment-16860382
 ] 

Mike Sokolov commented on LUCENE-8791:
--

bq. We distribute total number of results we are looking from matching across 
segments evenly plus some static number for overhead

I think this is the same pro-rated idea from LUCENE-8681; when the documents 
are randomly distributed among segments, the prediction can be quite accurate. 
In the case of a time series index though (eg, or any index where the 
distribution among segments is correlated with the rank), then this approach to 
early termination is not directly applicable.

> Add CollectorRescorer
> -
>
> Key: LUCENE-8791
> URL: https://issues.apache.org/jira/browse/LUCENE-8791
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Elbek Kamoliddinov
>Priority: Major
> Attachments: LUCENE-8791.patch, LUCENE-8791.patch, LUCENE-8791.patch, 
> LUCENE-8791.patch, LUCENE-8791.patch
>
>
> This is another implementation of query rescorer api (LUCENE-5489). It adds 
> rescoring functionality based on provided CollectorManager. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8781) Explore FST direct array arc encoding

2019-06-08 Thread Mike Sokolov (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16859227#comment-16859227
 ] 

Mike Sokolov commented on LUCENE-8781:
--

Got it, thanks. Yeah this was a tiny change, doesn't seem worth all the 
ceremony; I plan to just push after precommit, no PR, so why open an issue?

> Explore FST direct array arc encoding 
> --
>
> Key: LUCENE-8781
> URL: https://issues.apache.org/jira/browse/LUCENE-8781
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Assignee: Dawid Weiss
>Priority: Major
> Fix For: master (9.0), 8.2
>
> Attachments: FST-2-4.png, FST-6-9.png, FST-size.png
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> This issue is for exploring an alternate FST encoding of Arcs as full-sized 
> arrays so Arcs are addressed directly by label, avoiding binary search that 
> we use today for arrays of Arcs. PR: 
> https://github.com/apache/lucene-solr/pull/657
> h3. Testing
> ant test passes. I added some unit tests that were helpful in uncovering bugs 
> while
> implementing which are more difficult to chase down when uncovered by the 
> randomized testing we already do. They don't really test anything new; 
> they're just more focused.
> I'm not sure why, but ant precommit failed for me with:
> {noformat}
>  ...lucene-solr/solr/common-build.xml:536: Check for forbidden API calls 
> failed while scanning class 
> 'org.apache.solr.metrics.reporters.SolrGangliaReporterTest' 
> (SolrGangliaReporterTest.java): java.lang.ClassNotFoundException: 
> info.ganglia.gmetric4j.gmetric.GMetric (while looking up details about 
> referenced class 'info.ganglia.gmetric4j.gmetric.GMetric')
> {noformat}
> I also got Test2BFST running (it was originally timing out due to excessive 
> calls to ramBytesUsage(), which seems to have gotten slow), and it passed; 
> that change isn't include here.
> h4. Micro-benchmark
> I timed lookups in FST via FSTEnum.seekExact in a unit test under various 
> conditions. 
> h5. English words
> A test of looking up existing words in a dictionary of ~17 English words 
> shows improvements; the numbers listed are % change in FST size, time to look 
> up (FSTEnum.seekExact) words that are in the dict, and time to look up random 
> strings that are not in the dict. The comparison is against the current 
> codebase with the optimization disabled. A separate comparison of showed no 
> significant change of the baseline (no opto applied) vs the current master 
> FST impl with no code changes applied.
> ||  load=2||   load=4 ||  load=16 ||
> | +4, -6, -7  | +18, -11, -8 | +22, -11.5, -7 |
> The "load factor" used for those measurements controls when direct array arc 
> encoding is used;
> namely when the number of outgoing arcs was > load * (max label - min label).
> h5. sequential and random terms
> The same test, with terms being a sequence of integers as strings shows a 
> larger improvement, around 20% (load=4). This is presumably the best case for 
> this delta, where every Arc is encoded as a direct lookup.
> When random lowercase ASCII strings are used, a smaller improvement of around 
> 4% is seen.
> h4. luceneutil
> Testing w/luceneutil (wikimediumall) we see improvements mostly in the 
> PKLookup case. Other results seem noisy, with perhaps a small improvment in 
> some of the queries.
> {noformat}
> TaskQPS base  StdDevQPS opto  StdDev  
>   Pct diff
>   OrHighHigh6.93  (3.0%)6.89  (3.1%)   
> -0.5% (  -6% -5%)
>OrHighMed   45.15  (3.9%)   44.92  (3.5%)   
> -0.5% (  -7% -7%)
> Wildcard8.72  (4.7%)8.69  (4.6%)   
> -0.4% (  -9% -9%)
>   AndHighLow  274.11  (2.6%)  273.58  (3.1%)   
> -0.2% (  -5% -5%)
>OrHighLow  241.41  (1.9%)  241.11  (3.5%)   
> -0.1% (  -5% -5%)
>   AndHighMed   52.23  (4.1%)   52.41  (5.3%)
> 0.3% (  -8% -   10%)
>  MedTerm 1026.24  (3.1%) 1030.52  (4.3%)
> 0.4% (  -6% -8%)
> HighTerm .10  (3.4%) 1116.70  (4.0%)
> 0.5% (  -6% -8%)
>HighTermDayOfYearSort   14.59  (8.2%)   14.73  (9.3%)
> 1.0% ( -15% -   20%)
>  AndHighHigh   13.45  (6.2%)   13.61  (4.4%)
> 1.2% (  -8% -   12%)
>HighTermMonthSort   63.09 (12.5%)   64.13 (10.9%)
> 1.6% ( -19% -   28%)
>  LowTerm 1338.94  (3.3%) 1383.90  (5.5%)
> 3.4% (  -5% -   12%)
> PKLookup  120.45  (2.5%)  130.91  (3.5%)

[jira] [Updated] (LUCENE-8844) Bump FST Version (to 7)

2019-06-08 Thread Mike Sokolov (JIRA)



 [ 
https://issues.apache.org/jira/browse/LUCENE-8844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Sokolov updated LUCENE-8844:
-
Summary: Bump FST Version (to 7)  (was: Bump FST Version)

> Bump FST Version (to 7)
> ---
>
> Key: LUCENE-8844
> URL: https://issues.apache.org/jira/browse/LUCENE-8844
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Mike Sokolov
>Assignee: Mike Sokolov
>Priority: Major
>
> In LUCENE-8781, we changed the FST encoding but did not bump the version 
> number we write in its header. The change was backwards-compatible (new 
> readers can still read old FSTs), but not forwards-compatible: older readers 
> would exhibit strange behavior if they attempted to read one of these newer 
> format FSTs.
> It would be much better if readers could catch such errors and notify in a 
> sensible way, and we have version checking that does that; we just need to 
> increase the VERSION_CURRENT constant.
> Also, ~[~dsmiley] points out the CHANGES.txt entries for LUCENE-8781 should 
> be moved to the 8.2.0 section since that change was backported. I think we 
> can clean that up at the same time since it's version-related.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-8844) Bump FST Version

2019-06-08 Thread Mike Sokolov (JIRA)



 [ 
https://issues.apache.org/jira/browse/LUCENE-8844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Sokolov updated LUCENE-8844:
-
Description: 
In LUCENE-8781, we changed the FST encoding but did not bump the version number 
we write in its header. The change was backwards-compatible (new readers can 
still read old FSTs), but not forwards-compatible: older readers would exhibit 
strange behavior if they attempted to read one of these newer format FSTs.

It would be much better if readers could catch such errors and notify in a 
sensible way, and we have version checking that does that; we just need to 
increase the VERSION_CURRENT constant.

Also, ~[~dsmiley] points out the CHANGES.txt entries for LUCENE-8781 should be 
moved to the 8.2.0 section since that change was backported. I think we can 
clean that up at the same time since it's version-related.

  was:
In LUCENE-8781, we changed the FST encoding but not bump the version number we 
write in its header. The change was backwards-compatible (new readers can still 
read old FSTs), but not forwards-compatible: older readers would exhibit 
strange behavior if they attempted to read one of these newer format FSTs.

It would be much better if readers could catch such errors and notify in a 
sensible way, and we have version checking that does that; we just need to 
increase the VERSION_CURRENT constant.

Also, ~[~dsmiley] points out the CHANGES.txt entries for LUCENE-8781 should be 
moved to the 8.2.0 section since that change was backported. I think we can 
clean that up at the same time since it's version-related.


> Bump FST Version
> 
>
> Key: LUCENE-8844
> URL: https://issues.apache.org/jira/browse/LUCENE-8844
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Mike Sokolov
>Assignee: Mike Sokolov
>Priority: Major
>
> In LUCENE-8781, we changed the FST encoding but did not bump the version 
> number we write in its header. The change was backwards-compatible (new 
> readers can still read old FSTs), but not forwards-compatible: older readers 
> would exhibit strange behavior if they attempted to read one of these newer 
> format FSTs.
> It would be much better if readers could catch such errors and notify in a 
> sensible way, and we have version checking that does that; we just need to 
> increase the VERSION_CURRENT constant.
> Also, ~[~dsmiley] points out the CHANGES.txt entries for LUCENE-8781 should 
> be moved to the 8.2.0 section since that change was backported. I think we 
> can clean that up at the same time since it's version-related.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Assigned] (LUCENE-8844) Bump FST Version

2019-06-08 Thread Mike Sokolov (JIRA)



 [ 
https://issues.apache.org/jira/browse/LUCENE-8844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Sokolov reassigned LUCENE-8844:


Assignee: Mike Sokolov

> Bump FST Version
> 
>
> Key: LUCENE-8844
> URL: https://issues.apache.org/jira/browse/LUCENE-8844
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Mike Sokolov
>Assignee: Mike Sokolov
>Priority: Major
>
> In LUCENE-8781, we changed the FST encoding but not bump the version number 
> we write in its header. The change was backwards-compatible (new readers can 
> still read old FSTs), but not forwards-compatible: older readers would 
> exhibit strange behavior if they attempted to read one of these newer format 
> FSTs.
> It would be much better if readers could catch such errors and notify in a 
> sensible way, and we have version checking that does that; we just need to 
> increase the VERSION_CURRENT constant.
> Also, ~[~dsmiley] points out the CHANGES.txt entries for LUCENE-8781 should 
> be moved to the 8.2.0 section since that change was backported. I think we 
> can clean that up at the same time since it's version-related.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-8844) Bump FST Version

2019-06-08 Thread Mike Sokolov (JIRA)

Mike Sokolov created LUCENE-8844:


 Summary: Bump FST Version
 Key: LUCENE-8844
 URL: https://issues.apache.org/jira/browse/LUCENE-8844
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Mike Sokolov


In LUCENE-8781, we changed the FST encoding but not bump the version number we 
write in its header. The change was backwards-compatible (new readers can still 
read old FSTs), but not forwards-compatible: older readers would exhibit 
strange behavior if they attempted to read one of these newer format FSTs.

It would be much better if readers could catch such errors and notify in a 
sensible way, and we have version checking that does that; we just need to 
increase the VERSION_CURRENT constant.

Also, ~[~dsmiley] points out the CHANGES.txt entries for LUCENE-8781 should be 
moved to the 8.2.0 section since that change was backported. I think we can 
clean that up at the same time since it's version-related.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8781) Explore FST direct array arc encoding

2019-06-08 Thread Mike Sokolov (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16859220#comment-16859220
 ] 

Mike Sokolov commented on LUCENE-8781:
--

OK, I see we write a version header and then check it for compatibility when 
reading. I think in the spirit of full disclosure, failing early, no surprises, 
etc, we should increase the version. That way older readers will know to fail 
fast when they come across a newer version FST. I'll open a new issue for this, 
and fixing the CHANGES.txt


> Explore FST direct array arc encoding 
> --
>
> Key: LUCENE-8781
> URL: https://issues.apache.org/jira/browse/LUCENE-8781
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Assignee: Dawid Weiss
>Priority: Major
> Fix For: master (9.0), 8.2
>
> Attachments: FST-2-4.png, FST-6-9.png, FST-size.png
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> This issue is for exploring an alternate FST encoding of Arcs as full-sized 
> arrays so Arcs are addressed directly by label, avoiding binary search that 
> we use today for arrays of Arcs. PR: 
> https://github.com/apache/lucene-solr/pull/657
> h3. Testing
> ant test passes. I added some unit tests that were helpful in uncovering bugs 
> while
> implementing which are more difficult to chase down when uncovered by the 
> randomized testing we already do. They don't really test anything new; 
> they're just more focused.
> I'm not sure why, but ant precommit failed for me with:
> {noformat}
>  ...lucene-solr/solr/common-build.xml:536: Check for forbidden API calls 
> failed while scanning class 
> 'org.apache.solr.metrics.reporters.SolrGangliaReporterTest' 
> (SolrGangliaReporterTest.java): java.lang.ClassNotFoundException: 
> info.ganglia.gmetric4j.gmetric.GMetric (while looking up details about 
> referenced class 'info.ganglia.gmetric4j.gmetric.GMetric')
> {noformat}
> I also got Test2BFST running (it was originally timing out due to excessive 
> calls to ramBytesUsage(), which seems to have gotten slow), and it passed; 
> that change isn't include here.
> h4. Micro-benchmark
> I timed lookups in FST via FSTEnum.seekExact in a unit test under various 
> conditions. 
> h5. English words
> A test of looking up existing words in a dictionary of ~17 English words 
> shows improvements; the numbers listed are % change in FST size, time to look 
> up (FSTEnum.seekExact) words that are in the dict, and time to look up random 
> strings that are not in the dict. The comparison is against the current 
> codebase with the optimization disabled. A separate comparison of showed no 
> significant change of the baseline (no opto applied) vs the current master 
> FST impl with no code changes applied.
> ||  load=2||   load=4 ||  load=16 ||
> | +4, -6, -7  | +18, -11, -8 | +22, -11.5, -7 |
> The "load factor" used for those measurements controls when direct array arc 
> encoding is used;
> namely when the number of outgoing arcs was > load * (max label - min label).
> h5. sequential and random terms
> The same test, with terms being a sequence of integers as strings shows a 
> larger improvement, around 20% (load=4). This is presumably the best case for 
> this delta, where every Arc is encoded as a direct lookup.
> When random lowercase ASCII strings are used, a smaller improvement of around 
> 4% is seen.
> h4. luceneutil
> Testing w/luceneutil (wikimediumall) we see improvements mostly in the 
> PKLookup case. Other results seem noisy, with perhaps a small improvment in 
> some of the queries.
> {noformat}
> TaskQPS base  StdDevQPS opto  StdDev  
>   Pct diff
>   OrHighHigh6.93  (3.0%)6.89  (3.1%)   
> -0.5% (  -6% -5%)
>OrHighMed   45.15  (3.9%)   44.92  (3.5%)   
> -0.5% (  -7% -7%)
> Wildcard8.72  (4.7%)8.69  (4.6%)   
> -0.4% (  -9% -9%)
>   AndHighLow  274.11  (2.6%)  273.58  (3.1%)   
> -0.2% (  -5% -5%)
>OrHighLow  241.41  (1.9%)  241.11  (3.5%)   
> -0.1% (  -5% -5%)
>   AndHighMed   52.23  (4.1%)   52.41  (5.3%)
> 0.3% (  -8% -   10%)
>  MedTerm 1026.24  (3.1%) 1030.52  (4.3%)
> 0.4% (  -6% -8%)
> HighTerm .10  (3.4%) 1116.70  (4.0%)
> 0.5% (  -6% -8%)
>HighTermDayOfYearSort   14.59  (8.2%)   14.73  (9.3%)
> 1.0% ( -15% -   20%)
>  AndHighHigh   13.45  (6.2%)   13.61  (4.4%)
> 1.2% (  -8% -   12%)
>HighTermMonthSort   63.09 (12.5%)   64.13 (10.9%)
> 1.6%

[jira] [Commented] (LUCENE-8781) Explore FST direct array arc encoding

2019-06-06 Thread Mike Sokolov (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16858079#comment-16858079
 ] 

Mike Sokolov commented on LUCENE-8781:
--

I think I -- did not understand how to edit CHANGES.txt correctly. I can 
address that, sure.

With this change you read "old" indexes and write "new" indexes. It is true 
that once you upgrade and write a "new" index, you can no longer read it with 
"old" code. So e.g. an index written with 8.2.0 could not be read by 8.1.0, but 
vice-versa is fine. I think that is backwards-compatible but not 
forwards-compatible.

I did not enable it everywhere since I did not test it everywhere. There were 
some cases where I saw substantial size increases, but no performance 
improvement; eg see AnalyzingSuggester above. But as you say at the 4x setting 
those did not grow much, so perhaps it would be safe to enable unconditionally. 
I'd like to see test eg Kurumoji and Nori to see if it helps there. I'm a 
little concerned about those since the packing of those chars is likely to be 
much sparser than ASCII or even UTF8 Latin chars? I don't know maybe those 
character sets are in dense-enough blocks that it will help.

> Explore FST direct array arc encoding 
> --
>
> Key: LUCENE-8781
> URL: https://issues.apache.org/jira/browse/LUCENE-8781
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Assignee: Dawid Weiss
>Priority: Major
> Fix For: master (9.0), 8.2
>
> Attachments: FST-2-4.png, FST-6-9.png, FST-size.png
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> This issue is for exploring an alternate FST encoding of Arcs as full-sized 
> arrays so Arcs are addressed directly by label, avoiding binary search that 
> we use today for arrays of Arcs. PR: 
> https://github.com/apache/lucene-solr/pull/657
> h3. Testing
> ant test passes. I added some unit tests that were helpful in uncovering bugs 
> while
> implementing which are more difficult to chase down when uncovered by the 
> randomized testing we already do. They don't really test anything new; 
> they're just more focused.
> I'm not sure why, but ant precommit failed for me with:
> {noformat}
>  ...lucene-solr/solr/common-build.xml:536: Check for forbidden API calls 
> failed while scanning class 
> 'org.apache.solr.metrics.reporters.SolrGangliaReporterTest' 
> (SolrGangliaReporterTest.java): java.lang.ClassNotFoundException: 
> info.ganglia.gmetric4j.gmetric.GMetric (while looking up details about 
> referenced class 'info.ganglia.gmetric4j.gmetric.GMetric')
> {noformat}
> I also got Test2BFST running (it was originally timing out due to excessive 
> calls to ramBytesUsage(), which seems to have gotten slow), and it passed; 
> that change isn't include here.
> h4. Micro-benchmark
> I timed lookups in FST via FSTEnum.seekExact in a unit test under various 
> conditions. 
> h5. English words
> A test of looking up existing words in a dictionary of ~17 English words 
> shows improvements; the numbers listed are % change in FST size, time to look 
> up (FSTEnum.seekExact) words that are in the dict, and time to look up random 
> strings that are not in the dict. The comparison is against the current 
> codebase with the optimization disabled. A separate comparison of showed no 
> significant change of the baseline (no opto applied) vs the current master 
> FST impl with no code changes applied.
> ||  load=2||   load=4 ||  load=16 ||
> | +4, -6, -7  | +18, -11, -8 | +22, -11.5, -7 |
> The "load factor" used for those measurements controls when direct array arc 
> encoding is used;
> namely when the number of outgoing arcs was > load * (max label - min label).
> h5. sequential and random terms
> The same test, with terms being a sequence of integers as strings shows a 
> larger improvement, around 20% (load=4). This is presumably the best case for 
> this delta, where every Arc is encoded as a direct lookup.
> When random lowercase ASCII strings are used, a smaller improvement of around 
> 4% is seen.
> h4. luceneutil
> Testing w/luceneutil (wikimediumall) we see improvements mostly in the 
> PKLookup case. Other results seem noisy, with perhaps a small improvment in 
> some of the queries.
> {noformat}
> TaskQPS base  StdDevQPS opto  StdDev  
>   Pct diff
>   OrHighHigh6.93  (3.0%)6.89  (3.1%)   
> -0.5% (  -6% -5%)
>OrHighMed   45.15  (3.9%)   44.92  (3.5%)   
> -0.5% (  -7% -7%)
> Wildcard8.72  (4.7%)8.69  (4.6%)   
> -0.4% (  -9% -9%)
>   AndHighLow  274.11  (2.6%)  273.58  (3.1%)   
> -0.2% (  -5% -5%)
>OrHig

[jira] [Comment Edited] (LUCENE-8816) Decouple Kuromoji's morphological analyser and its dictionary

2019-05-28 Thread Mike Sokolov (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16849731#comment-16849731
 ] 

Mike Sokolov edited comment on LUCENE-8816 at 5/28/19 1:41 PM:
---

What if we changed the various dictionary classes to load-on-demand from a 
configurable classpath-directory, rather than from a single built-in one? If we 
do that, then users can supply a separate jar containing only the model files, 
and reference it when initializing the JapaneseAnalyzer.

Also - about testing; I think we can test using the built-in dictionary. Do we 
need to unit test dictionaries that we don't provide? Or -- are you 
anticipating providing multiple dictionaries as part of the Lucene distro 
itself?  I think both have merit (expose ability to bring your own dictionary) 
and (provide better dictionaries). 


was (Author: sokolov):
What if we changed the various dictionary classes to load-on-demand from a 
configurable classpath-directory, rather than from a single built-in one? If we 
do that, then users can supply a separate jar containing only the model files, 
and reference it when initializing the JapaneseAnalyzer. 

> Decouple Kuromoji's morphological analyser and its dictionary
> -
>
> Key: LUCENE-8816
> URL: https://issues.apache.org/jira/browse/LUCENE-8816
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Major
>
> I've inspired by this mail-list thread.
>  
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E]
> As many Japanese already know, default built-in dictionary bundled with 
> Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years. 
> While it has been slowly obsoleted, well-maintained and/or extended 
> dictionaries risen up in recent years (e.g. 
> [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd], 
> [UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some 
> attempts/projects/efforts are made in Japan.
> However current architecture - dictionary bundled jar - is essentially 
> incompatible with the idea "switch the system dictionary", and developers 
> have difficulties to do so.
> Traditionally, the morphological analysis engine (viterbi logic) and the 
> encoded dictionary (language model) had been decoupled (like MeCab, the 
> origin of Kuromoji, or lucene-gosen). So actually decoupling them is a 
> natural idea, and I feel that it's good time to re-think the current 
> architecture.
> Also this would be good for advanced users who have customized/re-trained 
> their own system dictionary.
> Goals of this issue:
>  * Decouple JapaneseTokenizer itself and encoded system dictionary.
>  * Implement dynamic dictionary load mechanism.
>  * Provide developer-oriented dictionary build tool.
> Non-goals:
>   * Provide learner or language model (it's up to users and should be outside 
> the scope).
> I have not dove into the code yet, so have no idea about it's easy or 
> difficult at this moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8816) Decouple Kuromoji's morphological analyser and its dictionary

2019-05-28 Thread Mike Sokolov (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16849731#comment-16849731
 ] 

Mike Sokolov commented on LUCENE-8816:
--

What if we changed the various dictionary classes to load-on-demand from a 
configurable classpath-directory, rather than from a single built-in one? If we 
do that, then users can supply a separate jar containing only the model files, 
and reference it when initializing the JapaneseAnalyzer. 

> Decouple Kuromoji's morphological analyser and its dictionary
> -
>
> Key: LUCENE-8816
> URL: https://issues.apache.org/jira/browse/LUCENE-8816
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Major
>
> I've inspired by this mail-list thread.
>  
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E]
> As many Japanese already know, default built-in dictionary bundled with 
> Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years. 
> While it has been slowly obsoleted, well-maintained and/or extended 
> dictionaries risen up in recent years (e.g. 
> [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd], 
> [UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some 
> attempts/projects/efforts are made in Japan.
> However current architecture - dictionary bundled jar - is essentially 
> incompatible with the idea "switch the system dictionary", and developers 
> have difficulties to do so.
> Traditionally, the morphological analysis engine (viterbi logic) and the 
> encoded dictionary (language model) had been decoupled (like MeCab, the 
> origin of Kuromoji, or lucene-gosen). So actually decoupling them is a 
> natural idea, and I feel that it's good time to re-think the current 
> architecture.
> Also this would be good for advanced users who have customized/re-trained 
> their own system dictionary.
> Goals of this issue:
>  * Decouple JapaneseTokenizer itself and encoded system dictionary.
>  * Implement dynamic dictionary load mechanism.
>  * Provide developer-oriented dictionary build tool.
> Non-goals:
>   * Provide learner or language model (it's up to users and should be outside 
> the scope).
> I have not dove into the code yet, so have no idea about it's easy or 
> difficult at this moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-8781) Explore FST direct array arc encoding

2019-05-27 Thread Mike Sokolov (JIRA)



 [ 
https://issues.apache.org/jira/browse/LUCENE-8781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Sokolov resolved LUCENE-8781.
--
Resolution: Fixed

Pushed to 8.x (and 7.x, although it seems there will be no future 7.x releases)

> Explore FST direct array arc encoding 
> --
>
> Key: LUCENE-8781
> URL: https://issues.apache.org/jira/browse/LUCENE-8781
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Assignee: Dawid Weiss
>Priority: Major
> Fix For: master (9.0), 8.2
>
> Attachments: FST-2-4.png, FST-6-9.png, FST-size.png
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> This issue is for exploring an alternate FST encoding of Arcs as full-sized 
> arrays so Arcs are addressed directly by label, avoiding binary search that 
> we use today for arrays of Arcs. PR: 
> https://github.com/apache/lucene-solr/pull/657
> h3. Testing
> ant test passes. I added some unit tests that were helpful in uncovering bugs 
> while
> implementing which are more difficult to chase down when uncovered by the 
> randomized testing we already do. They don't really test anything new; 
> they're just more focused.
> I'm not sure why, but ant precommit failed for me with:
> {noformat}
>  ...lucene-solr/solr/common-build.xml:536: Check for forbidden API calls 
> failed while scanning class 
> 'org.apache.solr.metrics.reporters.SolrGangliaReporterTest' 
> (SolrGangliaReporterTest.java): java.lang.ClassNotFoundException: 
> info.ganglia.gmetric4j.gmetric.GMetric (while looking up details about 
> referenced class 'info.ganglia.gmetric4j.gmetric.GMetric')
> {noformat}
> I also got Test2BFST running (it was originally timing out due to excessive 
> calls to ramBytesUsage(), which seems to have gotten slow), and it passed; 
> that change isn't include here.
> h4. Micro-benchmark
> I timed lookups in FST via FSTEnum.seekExact in a unit test under various 
> conditions. 
> h5. English words
> A test of looking up existing words in a dictionary of ~17 English words 
> shows improvements; the numbers listed are % change in FST size, time to look 
> up (FSTEnum.seekExact) words that are in the dict, and time to look up random 
> strings that are not in the dict. The comparison is against the current 
> codebase with the optimization disabled. A separate comparison of showed no 
> significant change of the baseline (no opto applied) vs the current master 
> FST impl with no code changes applied.
> ||  load=2||   load=4 ||  load=16 ||
> | +4, -6, -7  | +18, -11, -8 | +22, -11.5, -7 |
> The "load factor" used for those measurements controls when direct array arc 
> encoding is used;
> namely when the number of outgoing arcs was > load * (max label - min label).
> h5. sequential and random terms
> The same test, with terms being a sequence of integers as strings shows a 
> larger improvement, around 20% (load=4). This is presumably the best case for 
> this delta, where every Arc is encoded as a direct lookup.
> When random lowercase ASCII strings are used, a smaller improvement of around 
> 4% is seen.
> h4. luceneutil
> Testing w/luceneutil (wikimediumall) we see improvements mostly in the 
> PKLookup case. Other results seem noisy, with perhaps a small improvment in 
> some of the queries.
> {noformat}
> TaskQPS base  StdDevQPS opto  StdDev  
>   Pct diff
>   OrHighHigh6.93  (3.0%)6.89  (3.1%)   
> -0.5% (  -6% -5%)
>OrHighMed   45.15  (3.9%)   44.92  (3.5%)   
> -0.5% (  -7% -7%)
> Wildcard8.72  (4.7%)8.69  (4.6%)   
> -0.4% (  -9% -9%)
>   AndHighLow  274.11  (2.6%)  273.58  (3.1%)   
> -0.2% (  -5% -5%)
>OrHighLow  241.41  (1.9%)  241.11  (3.5%)   
> -0.1% (  -5% -5%)
>   AndHighMed   52.23  (4.1%)   52.41  (5.3%)
> 0.3% (  -8% -   10%)
>  MedTerm 1026.24  (3.1%) 1030.52  (4.3%)
> 0.4% (  -6% -8%)
> HighTerm .10  (3.4%) 1116.70  (4.0%)
> 0.5% (  -6% -8%)
>HighTermDayOfYearSort   14.59  (8.2%)   14.73  (9.3%)
> 1.0% ( -15% -   20%)
>  AndHighHigh   13.45  (6.2%)   13.61  (4.4%)
> 1.2% (  -8% -   12%)
>HighTermMonthSort   63.09 (12.5%)   64.13 (10.9%)
> 1.6% ( -19% -   28%)
>  LowTerm 1338.94  (3.3%) 1383.90  (5.5%)
> 3.4% (  -5% -   12%)
> PKLookup  120.45  (2.5%)  130.91  (3.5%)
> 8.7% (   2% -   15%)
> {noformat}
> h4.FST perf tests
> I ran LookupBenchmarkTest to see the impact

[jira] [Updated] (LUCENE-8781) Explore FST direct array arc encoding

2019-05-26 Thread Mike Sokolov (JIRA)



 [ 
https://issues.apache.org/jira/browse/LUCENE-8781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Sokolov updated LUCENE-8781:
-
Fix Version/s: 8.x

> Explore FST direct array arc encoding 
> --
>
> Key: LUCENE-8781
> URL: https://issues.apache.org/jira/browse/LUCENE-8781
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Assignee: Dawid Weiss
>Priority: Major
> Fix For: 8.x, master (9.0)
>
> Attachments: FST-2-4.png, FST-6-9.png, FST-size.png
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> This issue is for exploring an alternate FST encoding of Arcs as full-sized 
> arrays so Arcs are addressed directly by label, avoiding binary search that 
> we use today for arrays of Arcs. PR: 
> https://github.com/apache/lucene-solr/pull/657
> h3. Testing
> ant test passes. I added some unit tests that were helpful in uncovering bugs 
> while
> implementing which are more difficult to chase down when uncovered by the 
> randomized testing we already do. They don't really test anything new; 
> they're just more focused.
> I'm not sure why, but ant precommit failed for me with:
> {noformat}
>  ...lucene-solr/solr/common-build.xml:536: Check for forbidden API calls 
> failed while scanning class 
> 'org.apache.solr.metrics.reporters.SolrGangliaReporterTest' 
> (SolrGangliaReporterTest.java): java.lang.ClassNotFoundException: 
> info.ganglia.gmetric4j.gmetric.GMetric (while looking up details about 
> referenced class 'info.ganglia.gmetric4j.gmetric.GMetric')
> {noformat}
> I also got Test2BFST running (it was originally timing out due to excessive 
> calls to ramBytesUsage(), which seems to have gotten slow), and it passed; 
> that change isn't include here.
> h4. Micro-benchmark
> I timed lookups in FST via FSTEnum.seekExact in a unit test under various 
> conditions. 
> h5. English words
> A test of looking up existing words in a dictionary of ~17 English words 
> shows improvements; the numbers listed are % change in FST size, time to look 
> up (FSTEnum.seekExact) words that are in the dict, and time to look up random 
> strings that are not in the dict. The comparison is against the current 
> codebase with the optimization disabled. A separate comparison of showed no 
> significant change of the baseline (no opto applied) vs the current master 
> FST impl with no code changes applied.
> ||  load=2||   load=4 ||  load=16 ||
> | +4, -6, -7  | +18, -11, -8 | +22, -11.5, -7 |
> The "load factor" used for those measurements controls when direct array arc 
> encoding is used;
> namely when the number of outgoing arcs was > load * (max label - min label).
> h5. sequential and random terms
> The same test, with terms being a sequence of integers as strings shows a 
> larger improvement, around 20% (load=4). This is presumably the best case for 
> this delta, where every Arc is encoded as a direct lookup.
> When random lowercase ASCII strings are used, a smaller improvement of around 
> 4% is seen.
> h4. luceneutil
> Testing w/luceneutil (wikimediumall) we see improvements mostly in the 
> PKLookup case. Other results seem noisy, with perhaps a small improvment in 
> some of the queries.
> {noformat}
> TaskQPS base  StdDevQPS opto  StdDev  
>   Pct diff
>   OrHighHigh6.93  (3.0%)6.89  (3.1%)   
> -0.5% (  -6% -5%)
>OrHighMed   45.15  (3.9%)   44.92  (3.5%)   
> -0.5% (  -7% -7%)
> Wildcard8.72  (4.7%)8.69  (4.6%)   
> -0.4% (  -9% -9%)
>   AndHighLow  274.11  (2.6%)  273.58  (3.1%)   
> -0.2% (  -5% -5%)
>OrHighLow  241.41  (1.9%)  241.11  (3.5%)   
> -0.1% (  -5% -5%)
>   AndHighMed   52.23  (4.1%)   52.41  (5.3%)
> 0.3% (  -8% -   10%)
>  MedTerm 1026.24  (3.1%) 1030.52  (4.3%)
> 0.4% (  -6% -8%)
> HighTerm .10  (3.4%) 1116.70  (4.0%)
> 0.5% (  -6% -8%)
>HighTermDayOfYearSort   14.59  (8.2%)   14.73  (9.3%)
> 1.0% ( -15% -   20%)
>  AndHighHigh   13.45  (6.2%)   13.61  (4.4%)
> 1.2% (  -8% -   12%)
>HighTermMonthSort   63.09 (12.5%)   64.13 (10.9%)
> 1.6% ( -19% -   28%)
>  LowTerm 1338.94  (3.3%) 1383.90  (5.5%)
> 3.4% (  -5% -   12%)
> PKLookup  120.45  (2.5%)  130.91  (3.5%)
> 8.7% (   2% -   15%)
> {noformat}
> h4.FST perf tests
> I ran LookupBenchmarkTest to see the impact on the suggesters which make 
> heavy use of FSTs. Some show little or no improvem

[jira] [Updated] (LUCENE-8781) Explore FST direct array arc encoding

2019-05-26 Thread Mike Sokolov (JIRA)



 [ 
https://issues.apache.org/jira/browse/LUCENE-8781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Sokolov updated LUCENE-8781:
-
Fix Version/s: (was: 8.x)
   8.2

> Explore FST direct array arc encoding 
> --
>
> Key: LUCENE-8781
> URL: https://issues.apache.org/jira/browse/LUCENE-8781
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Assignee: Dawid Weiss
>Priority: Major
> Fix For: master (9.0), 8.2
>
> Attachments: FST-2-4.png, FST-6-9.png, FST-size.png
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> This issue is for exploring an alternate FST encoding of Arcs as full-sized 
> arrays so Arcs are addressed directly by label, avoiding binary search that 
> we use today for arrays of Arcs. PR: 
> https://github.com/apache/lucene-solr/pull/657
> h3. Testing
> ant test passes. I added some unit tests that were helpful in uncovering bugs 
> while
> implementing which are more difficult to chase down when uncovered by the 
> randomized testing we already do. They don't really test anything new; 
> they're just more focused.
> I'm not sure why, but ant precommit failed for me with:
> {noformat}
>  ...lucene-solr/solr/common-build.xml:536: Check for forbidden API calls 
> failed while scanning class 
> 'org.apache.solr.metrics.reporters.SolrGangliaReporterTest' 
> (SolrGangliaReporterTest.java): java.lang.ClassNotFoundException: 
> info.ganglia.gmetric4j.gmetric.GMetric (while looking up details about 
> referenced class 'info.ganglia.gmetric4j.gmetric.GMetric')
> {noformat}
> I also got Test2BFST running (it was originally timing out due to excessive 
> calls to ramBytesUsage(), which seems to have gotten slow), and it passed; 
> that change isn't include here.
> h4. Micro-benchmark
> I timed lookups in FST via FSTEnum.seekExact in a unit test under various 
> conditions. 
> h5. English words
> A test of looking up existing words in a dictionary of ~17 English words 
> shows improvements; the numbers listed are % change in FST size, time to look 
> up (FSTEnum.seekExact) words that are in the dict, and time to look up random 
> strings that are not in the dict. The comparison is against the current 
> codebase with the optimization disabled. A separate comparison of showed no 
> significant change of the baseline (no opto applied) vs the current master 
> FST impl with no code changes applied.
> ||  load=2||   load=4 ||  load=16 ||
> | +4, -6, -7  | +18, -11, -8 | +22, -11.5, -7 |
> The "load factor" used for those measurements controls when direct array arc 
> encoding is used;
> namely when the number of outgoing arcs was > load * (max label - min label).
> h5. sequential and random terms
> The same test, with terms being a sequence of integers as strings shows a 
> larger improvement, around 20% (load=4). This is presumably the best case for 
> this delta, where every Arc is encoded as a direct lookup.
> When random lowercase ASCII strings are used, a smaller improvement of around 
> 4% is seen.
> h4. luceneutil
> Testing w/luceneutil (wikimediumall) we see improvements mostly in the 
> PKLookup case. Other results seem noisy, with perhaps a small improvment in 
> some of the queries.
> {noformat}
> TaskQPS base  StdDevQPS opto  StdDev  
>   Pct diff
>   OrHighHigh6.93  (3.0%)6.89  (3.1%)   
> -0.5% (  -6% -5%)
>OrHighMed   45.15  (3.9%)   44.92  (3.5%)   
> -0.5% (  -7% -7%)
> Wildcard8.72  (4.7%)8.69  (4.6%)   
> -0.4% (  -9% -9%)
>   AndHighLow  274.11  (2.6%)  273.58  (3.1%)   
> -0.2% (  -5% -5%)
>OrHighLow  241.41  (1.9%)  241.11  (3.5%)   
> -0.1% (  -5% -5%)
>   AndHighMed   52.23  (4.1%)   52.41  (5.3%)
> 0.3% (  -8% -   10%)
>  MedTerm 1026.24  (3.1%) 1030.52  (4.3%)
> 0.4% (  -6% -8%)
> HighTerm .10  (3.4%) 1116.70  (4.0%)
> 0.5% (  -6% -8%)
>HighTermDayOfYearSort   14.59  (8.2%)   14.73  (9.3%)
> 1.0% ( -15% -   20%)
>  AndHighHigh   13.45  (6.2%)   13.61  (4.4%)
> 1.2% (  -8% -   12%)
>HighTermMonthSort   63.09 (12.5%)   64.13 (10.9%)
> 1.6% ( -19% -   28%)
>  LowTerm 1338.94  (3.3%) 1383.90  (5.5%)
> 3.4% (  -5% -   12%)
> PKLookup  120.45  (2.5%)  130.91  (3.5%)
> 8.7% (   2% -   15%)
> {noformat}
> h4.FST perf tests
> I ran LookupBenchmarkTest to see the impact on the suggesters which make 
> heavy use of FST

[jira] [Reopened] (LUCENE-8781) Explore FST direct array arc encoding

2019-05-26 Thread Mike Sokolov (JIRA)



 [ 
https://issues.apache.org/jira/browse/LUCENE-8781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Sokolov reopened LUCENE-8781:
--

reopening to track backporting this improvement to 8.x and 7.x

> Explore FST direct array arc encoding 
> --
>
> Key: LUCENE-8781
> URL: https://issues.apache.org/jira/browse/LUCENE-8781
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Assignee: Dawid Weiss
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: FST-2-4.png, FST-6-9.png, FST-size.png
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> This issue is for exploring an alternate FST encoding of Arcs as full-sized 
> arrays so Arcs are addressed directly by label, avoiding binary search that 
> we use today for arrays of Arcs. PR: 
> https://github.com/apache/lucene-solr/pull/657
> h3. Testing
> ant test passes. I added some unit tests that were helpful in uncovering bugs 
> while
> implementing which are more difficult to chase down when uncovered by the 
> randomized testing we already do. They don't really test anything new; 
> they're just more focused.
> I'm not sure why, but ant precommit failed for me with:
> {noformat}
>  ...lucene-solr/solr/common-build.xml:536: Check for forbidden API calls 
> failed while scanning class 
> 'org.apache.solr.metrics.reporters.SolrGangliaReporterTest' 
> (SolrGangliaReporterTest.java): java.lang.ClassNotFoundException: 
> info.ganglia.gmetric4j.gmetric.GMetric (while looking up details about 
> referenced class 'info.ganglia.gmetric4j.gmetric.GMetric')
> {noformat}
> I also got Test2BFST running (it was originally timing out due to excessive 
> calls to ramBytesUsage(), which seems to have gotten slow), and it passed; 
> that change isn't include here.
> h4. Micro-benchmark
> I timed lookups in FST via FSTEnum.seekExact in a unit test under various 
> conditions. 
> h5. English words
> A test of looking up existing words in a dictionary of ~17 English words 
> shows improvements; the numbers listed are % change in FST size, time to look 
> up (FSTEnum.seekExact) words that are in the dict, and time to look up random 
> strings that are not in the dict. The comparison is against the current 
> codebase with the optimization disabled. A separate comparison of showed no 
> significant change of the baseline (no opto applied) vs the current master 
> FST impl with no code changes applied.
> ||  load=2||   load=4 ||  load=16 ||
> | +4, -6, -7  | +18, -11, -8 | +22, -11.5, -7 |
> The "load factor" used for those measurements controls when direct array arc 
> encoding is used;
> namely when the number of outgoing arcs was > load * (max label - min label).
> h5. sequential and random terms
> The same test, with terms being a sequence of integers as strings shows a 
> larger improvement, around 20% (load=4). This is presumably the best case for 
> this delta, where every Arc is encoded as a direct lookup.
> When random lowercase ASCII strings are used, a smaller improvement of around 
> 4% is seen.
> h4. luceneutil
> Testing w/luceneutil (wikimediumall) we see improvements mostly in the 
> PKLookup case. Other results seem noisy, with perhaps a small improvment in 
> some of the queries.
> {noformat}
> TaskQPS base  StdDevQPS opto  StdDev  
>   Pct diff
>   OrHighHigh6.93  (3.0%)6.89  (3.1%)   
> -0.5% (  -6% -5%)
>OrHighMed   45.15  (3.9%)   44.92  (3.5%)   
> -0.5% (  -7% -7%)
> Wildcard8.72  (4.7%)8.69  (4.6%)   
> -0.4% (  -9% -9%)
>   AndHighLow  274.11  (2.6%)  273.58  (3.1%)   
> -0.2% (  -5% -5%)
>OrHighLow  241.41  (1.9%)  241.11  (3.5%)   
> -0.1% (  -5% -5%)
>   AndHighMed   52.23  (4.1%)   52.41  (5.3%)
> 0.3% (  -8% -   10%)
>  MedTerm 1026.24  (3.1%) 1030.52  (4.3%)
> 0.4% (  -6% -8%)
> HighTerm .10  (3.4%) 1116.70  (4.0%)
> 0.5% (  -6% -8%)
>HighTermDayOfYearSort   14.59  (8.2%)   14.73  (9.3%)
> 1.0% ( -15% -   20%)
>  AndHighHigh   13.45  (6.2%)   13.61  (4.4%)
> 1.2% (  -8% -   12%)
>HighTermMonthSort   63.09 (12.5%)   64.13 (10.9%)
> 1.6% ( -19% -   28%)
>  LowTerm 1338.94  (3.3%) 1383.90  (5.5%)
> 3.4% (  -5% -   12%)
> PKLookup  120.45  (2.5%)  130.91  (3.5%)
> 8.7% (   2% -   15%)
> {noformat}
> h4.FST perf tests
> I ran LookupBenchmarkTest to see the impact on the suggesters which make 
> heavy use of

[jira] [Commented] (LUCENE-4012) Make all query classes serializable, and provide a query parser to consume them

2019-05-19 Thread Mike Sokolov (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-4012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16843445#comment-16843445
 ] 

Mike Sokolov commented on LUCENE-4012:
--

I want to hijack this issue to be about maing Query serializable by any means 
necessary.  The idea of using Jackson seemed like it could be problematic since 
it tends to expose implementation details (constructor signatures, eg), but the 
idea of query serialization is powerful, and we should have it in our bag of 
tricks. A whole class of optimizations stems from analysis of query logs, and 
in order to treat queries as data we need a persistent form for them (not just 
in-memory java Query objects).

It seems like we have a good angle of attack since LUCENE-3041 landed, adding a 
QueryVisitor. My thought is that each query parser could potentially come with 
a serializer that serializes queries into its language, since not every parser 
can represent every query type. Or maybe XML query parser is truly general and 
handles everything thus there is no need for any other flavor? I'm not sure 
though I seem to recall it has some gaps as well.

I worked up a POC that serializes combinations of Boolean and TermQuery into a 
form that is parseable by classic query parser, and I think it can be extended 
pretty easily to cover most query types. I have a question here: to get it to 
work it seemed as if I needed to make BooleanQuery.visit call getSubVisitor for 
every clause (rather than once for each occur-value). This broke a single test 
though in TestQueryVisitor that asserts something about the sequence of these 
calls, and I'm not sure if that assertion is an invariant of the QueryVisitor 
contract, or whether it is simply a byproduct of the implementation. 
[~romseygeek] can you shed some light?  I can post a WIP PR if that would help 
clarify.

> Make all query classes serializable, and provide a query parser to consume 
> them
> ---
>
> Key: LUCENE-4012
> URL: https://issues.apache.org/jira/browse/LUCENE-4012
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: core/queryparser
>Affects Versions: 4.0-ALPHA
>Reporter: Benson Margulies
>Priority: Major
> Attachments: bq.patch
>
>
> I started off on LUCENE-4004 wanting to use DisjunctionMaxQuery via a parser. 
> However, this wasn't really because I thought that human beans should be 
> improvisationally  composing such thing. My real goal was to concoct a query 
> tree over *here*, and then serialize it to send to Solr over *there*. 
> It occurs to me that if the Xml parser is pretty good for this, JSON would be 
> better. It further occurs to me that the query classes may already all work 
> with Jackson, and, if they don't, the required tweaks will be quite small. By 
> allowing Jackson to write out class names as needed, you get the ability to 
> serialize *any* query, so long as the other side has the classes in class 
> path. A trifle verbose, but not as verbose as XML, and furthermore squishable 
> (though not in a URL) via SMILE or BSON.
> So, the goal of this JIRA is to accumulate tweaks to the query classes to 
> make them more 'bean pattern'. An alternative would be Jackson annotations. 
> However, I suspect that folks would be happier to minimize the level of 
> coupling here; in the extreme, the trivial parser could live in contrib if no 
> one wants a dependency, even optional, on Jackson itself.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4012) Make all query classes serializable, and provide a query parser to consume them

2019-05-19 Thread Mike Sokolov (JIRA)



 [ 
https://issues.apache.org/jira/browse/LUCENE-4012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Sokolov updated LUCENE-4012:
-
Summary: Make all query classes serializable, and provide a query parser to 
consume them  (was: Make all query classes serializable with Jackson, and 
provide a trivial query parser to consume them)

> Make all query classes serializable, and provide a query parser to consume 
> them
> ---
>
> Key: LUCENE-4012
> URL: https://issues.apache.org/jira/browse/LUCENE-4012
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: core/queryparser
>Affects Versions: 4.0-ALPHA
>Reporter: Benson Margulies
>Priority: Major
> Attachments: bq.patch
>
>
> I started off on LUCENE-4004 wanting to use DisjunctionMaxQuery via a parser. 
> However, this wasn't really because I thought that human beans should be 
> improvisationally  composing such thing. My real goal was to concoct a query 
> tree over *here*, and then serialize it to send to Solr over *there*. 
> It occurs to me that if the Xml parser is pretty good for this, JSON would be 
> better. It further occurs to me that the query classes may already all work 
> with Jackson, and, if they don't, the required tweaks will be quite small. By 
> allowing Jackson to write out class names as needed, you get the ability to 
> serialize *any* query, so long as the other side has the classes in class 
> path. A trifle verbose, but not as verbose as XML, and furthermore squishable 
> (though not in a URL) via SMILE or BSON.
> So, the goal of this JIRA is to accumulate tweaks to the query classes to 
> make them more 'bean pattern'. An alternative would be Jackson annotations. 
> However, I suspect that folks would be happier to minimize the level of 
> coupling here; in the extreme, the trivial parser could live in contrib if no 
> one wants a dependency, even optional, on Jackson itself.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8798) Autogenerated ID for LeafReaderContexts Within An IndexSearcher

2019-05-13 Thread Mike Sokolov (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16838538#comment-16838538
 ] 

Mike Sokolov commented on LUCENE-8798:
--

I think what confused me was the link to the other JIRA seems to have a typo - 
it links to an ancient unrelated issue, but I guess you meant to link to one of 
your recent ones?

> Autogenerated ID for LeafReaderContexts Within An IndexSearcher
> ---
>
> Key: LUCENE-8798
> URL: https://issues.apache.org/jira/browse/LUCENE-8798
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Minor
>
> It would be good to be able to uniquely identify LeafReaderContext objects 
> associated within a single IndexSearcher. This would allow storing of 
> metadata around segments, such as demonstrated in 
> https://issues.apache.org/jira/browse/LUCENE-879
> The ID will be unique across the IndexSearcher instance and will make no 
> guarantees of any semantic value outside the instance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8798) Autogenerated ID for LeafReaderContexts Within An IndexSearcher

2019-05-13 Thread Mike Sokolov (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16838497#comment-16838497
 ] 

Mike Sokolov commented on LUCENE-8798:
--

[~atris] I glanced at the issue you referenced, but I don't see how it relates 
to this. Could you sketch out a use case where this would be better than using 
the ordinals that we use today to identify segments? Would these be persisted? 

> Autogenerated ID for LeafReaderContexts Within An IndexSearcher
> ---
>
> Key: LUCENE-8798
> URL: https://issues.apache.org/jira/browse/LUCENE-8798
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Atri Sharma
>Priority: Minor
>
> It would be good to be able to uniquely identify LeafReaderContext objects 
> associated within a single IndexSearcher. This would allow storing of 
> metadata around segments, such as demonstrated in 
> https://issues.apache.org/jira/browse/LUCENE-879
> The ID will be unique across the IndexSearcher instance and will make no 
> guarantees of any semantic value outside the instance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8780) Improve ByteBufferGuard in Java 11

2019-04-28 Thread Mike Sokolov (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16828826#comment-16828826
 ] 

Mike Sokolov commented on LUCENE-8780:
--

I don't have a good theory, but I was curious so I ran a few tests, and one 
thing I saw is that if you limit to a single searcher thread, you see only the 
negative side of this distribution, or at least it becomes more negative.

> Improve ByteBufferGuard in Java 11
> --
>
> Key: LUCENE-8780
> URL: https://issues.apache.org/jira/browse/LUCENE-8780
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/store
>Affects Versions: master (9.0)
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
>Priority: Major
>  Labels: Java11
> Attachments: LUCENE-8780.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In LUCENE-7409 we added {{ByteBufferGuard}} to protect MMapDirectory from 
> crushing the JVM with SIGSEGV when you close and unmap the mmapped buffers of 
> an IndexInput, while another thread is accessing it.
> The idea was to do a volatile write access to flush the caches (to trigger a 
> full fence) and set a non-volatile boolean to true. All accesses would check 
> the boolean and stop the caller from accessing the underlying ByteBuffer. 
> This worked most of the time, until the JVM optimized away the plain read 
> access to the boolean (you can easily see this after some runtime of our 
> by-default ignored testcase).
> With master on Java 11, we can improve the whole thing. Using VarHandles you 
> can use any access type when reading or writing the boolean. After reading 
> Doug Lea's expanation  and some 
> testing, I was no longer able to crush my JDK (even after running for minutes 
> unmapping bytebuffers).
> The apraoch is the same, we do a full-fenced write (standard volatile write) 
> when we unmap, then we yield the thread (to finish in-flight reads in other 
> threads) and then unmap all byte buffers.
> On the test side (read access), instead of using a plain read, we use the new 
> "opaque read". Opaque reads are the same as plain reads, there are only 
> different order requirements. Actually the main difference is explained by 
> Doug like this: "For example in constructions in which the only modification 
> of some variable x is for one thread to write in Opaque (or stronger) mode, 
> X.setOpaque(this, 1), any other thread spinning in 
> while(X.getOpaque(this)!=1){} will eventually terminate. Note that this 
> guarantee does NOT hold in Plain mode, in which spin loops may (and usually 
> do) infinitely loop -- they are not required to notice that a write ever 
> occurred in another thread if it was not seen on first encounter." - And 
> that's waht we want to have: We don't want to do volatile reads, but we want 
> to prevent the compiler from optimizing away our read to the boolean. So we 
> want it to "eventually" see the change. By the much stronger volatile write, 
> the cache effects should be visible even faster (like in our Java 8 approach, 
> just now we improved our read side).
> The new code is much slimmer (theoretically we could also use a AtomicBoolean 
> for that and use the new method {{getOpaque()}}, but I wanted to prevent 
> extra method calls, so I used a VarHandle directly).
> It's setup like this:
> - The underlying boolean field is a private member (with unused 
> SuppressWarnings, as its unused by the java compiler), marked as volatile 
> (that's the recommendation, but in reality it does not matter at all).
> - We create a VarHandle to access this boolean, we never do this directly 
> (this is why the volatile marking does not affect us).
> - We use VarHandle.setVolatile() to change our "invalidated" boolean to 
> "true", so enforcing a full fence
> - On the read side we use VarHandle.getOpaque() instead of VarHandle.get() 
> (like in our old code for Java 8).
> I had to tune our test a bit, as the VarHandles make it take longer until it 
> actually crushes (as optimizations jump in later). I also used a random for 
> the reads to prevent the optimizer from removing all the bytebuffer reads. 
> When we commit this, we can disable the test again (it takes approx 50 secs 
> on my machine).
> I'd still like to see the differences between the plain read and the opaque 
> read in production, so maybe [~mikemccand] or [~rcmuir] can do a comparison 
> with nightly benchmarker?
> Have fun, maybe [~dweiss] has some ideas, too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-8781) Explore FST direct array arc encoding

2019-04-27 Thread Mike Sokolov (JIRA)



 [ 
https://issues.apache.org/jira/browse/LUCENE-8781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Sokolov updated LUCENE-8781:
-
Description: 
This issue is for exploring an alternate FST encoding of Arcs as full-sized 
arrays so Arcs are addressed directly by label, avoiding binary search that we 
use today for arrays of Arcs. PR: https://github.com/apache/lucene-solr/pull/657

h3. Testing

ant test passes. I added some unit tests that were helpful in uncovering bugs 
while
implementing which are more difficult to chase down when uncovered by the 
randomized testing we already do. They don't really test anything new; they're 
just more focused.

I'm not sure why, but ant precommit failed for me with:

{noformat}
 ...lucene-solr/solr/common-build.xml:536: Check for forbidden API calls failed 
while scanning class 
'org.apache.solr.metrics.reporters.SolrGangliaReporterTest' 
(SolrGangliaReporterTest.java): java.lang.ClassNotFoundException: 
info.ganglia.gmetric4j.gmetric.GMetric (while looking up details about 
referenced class 'info.ganglia.gmetric4j.gmetric.GMetric')
{noformat}

I also got Test2BFST running (it was originally timing out due to excessive 
calls to ramBytesUsage(), which seems to have gotten slow), and it passed; that 
change isn't include here.

h4. Micro-benchmark

I timed lookups in FST via FSTEnum.seekExact in a unit test under various 
conditions. 

h5. English words

A test of looking up existing words in a dictionary of ~17 English words 
shows improvements; the numbers listed are % change in FST size, time to look 
up (FSTEnum.seekExact) words that are in the dict, and time to look up random 
strings that are not in the dict. The comparison is against the current 
codebase with the optimization disabled. A separate comparison of showed no 
significant change of the baseline (no opto applied) vs the current master FST 
impl with no code changes applied.

||  load=2||   load=4 ||  load=16 ||
| +4, -6, -7  | +18, -11, -8 | +22, -11.5, -7 |

The "load factor" used for those measurements controls when direct array arc 
encoding is used;
namely when the number of outgoing arcs was > load * (max label - min label).

h5. sequential and random terms

The same test, with terms being a sequence of integers as strings shows a 
larger improvement, around 20% (load=4). This is presumably the best case for 
this delta, where every Arc is encoded as a direct lookup.

When random lowercase ASCII strings are used, a smaller improvement of around 
4% is seen.

h4. luceneutil

Testing w/luceneutil (wikimediumall) we see improvements mostly in the PKLookup 
case. Other results seem noisy, with perhaps a small improvment in some of the 
queries.

{noformat}
TaskQPS base  StdDevQPS opto  StdDev
Pct diff
  OrHighHigh6.93  (3.0%)6.89  (3.1%)   
-0.5% (  -6% -5%)
   OrHighMed   45.15  (3.9%)   44.92  (3.5%)   
-0.5% (  -7% -7%)
Wildcard8.72  (4.7%)8.69  (4.6%)   
-0.4% (  -9% -9%)
  AndHighLow  274.11  (2.6%)  273.58  (3.1%)   
-0.2% (  -5% -5%)
   OrHighLow  241.41  (1.9%)  241.11  (3.5%)   
-0.1% (  -5% -5%)
  AndHighMed   52.23  (4.1%)   52.41  (5.3%)
0.3% (  -8% -   10%)
 MedTerm 1026.24  (3.1%) 1030.52  (4.3%)
0.4% (  -6% -8%)
HighTerm .10  (3.4%) 1116.70  (4.0%)
0.5% (  -6% -8%)
   HighTermDayOfYearSort   14.59  (8.2%)   14.73  (9.3%)
1.0% ( -15% -   20%)
 AndHighHigh   13.45  (6.2%)   13.61  (4.4%)
1.2% (  -8% -   12%)
   HighTermMonthSort   63.09 (12.5%)   64.13 (10.9%)
1.6% ( -19% -   28%)
 LowTerm 1338.94  (3.3%) 1383.90  (5.5%)
3.4% (  -5% -   12%)
PKLookup  120.45  (2.5%)  130.91  (3.5%)
8.7% (   2% -   15%)
{noformat}

h4.FST perf tests

I ran LookupBenchmarkTest to see the impact on the suggesters which make heavy 
use of FSTs. Some show little or no improvement, but in some cases there are 
substantial gains.

!FST-2-4.png!
!FST-6-9.png!
!FST-size.png!

  chart TK

h3. API change / implementation notes

The only change in the public FST API is that the Builder constructor now takes 
an additional
boolean ("useDirectArcAddressing") controlling whether or not this new 
optimization is
applied. However, FST's internal details are not really hidden, so in practice 
any change to its
encoding can have ripple effects in other classes.

This is because the FST decoding is repeated in several places in the code 
base, sometimes with
subtle variations: eg FST, FSTEnum, and fst.Util have very similar, but not 
shared, traversal code,

[jira] [Updated] (LUCENE-8781) Explore FST direct array arc encoding

2019-04-27 Thread Mike Sokolov (JIRA)



 [ 
https://issues.apache.org/jira/browse/LUCENE-8781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Sokolov updated LUCENE-8781:
-
Description: 
This issue is for exploring an alternate FST encoding of Arcs as full-sized 
arrays so Arcs are addressed directly by label, avoiding binary search that we 
use today for arrays of Arcs.

h3. Testing

ant test passes. I added some unit tests that were helpful in uncovering bugs 
while
implementing which are more difficult to chase down when uncovered by the 
randomized testing we already do. They don't really test anything new; they're 
just more focused.

I'm not sure why, but ant precommit failed for me with:

{noformat}
 ...lucene-solr/solr/common-build.xml:536: Check for forbidden API calls failed 
while scanning class 
'org.apache.solr.metrics.reporters.SolrGangliaReporterTest' 
(SolrGangliaReporterTest.java): java.lang.ClassNotFoundException: 
info.ganglia.gmetric4j.gmetric.GMetric (while looking up details about 
referenced class 'info.ganglia.gmetric4j.gmetric.GMetric')
{noformat}

I also got Test2BFST running (it was originally timing out due to excessive 
calls to ramBytesUsage(), which seems to have gotten slow), and it passed; that 
change isn't include here.

h4. Micro-benchmark

I timed lookups in FST via FSTEnum.seekExact in a unit test under various 
conditions. 

h5. English words

A test of looking up existing words in a dictionary of ~17 English words 
shows improvements; the numbers listed are % change in FST size, time to look 
up (FSTEnum.seekExact) words that are in the dict, and time to look up random 
strings that are not in the dict. The comparison is against the current 
codebase with the optimization disabled. A separate comparison of showed no 
significant change of the baseline (no opto applied) vs the current master FST 
impl with no code changes applied.

||  load=2||   load=4 ||  load=16 ||
| +4, -6, -7  | +18, -11, -8 | +22, -11.5, -7 |

The "load factor" used for those measurements controls when direct array arc 
encoding is used;
namely when the number of outgoing arcs was > load * (max label - min label).

h5. sequential and random terms

The same test, with terms being a sequence of integers as strings shows a 
larger improvement, around 20% (load=4). This is presumably the best case for 
this delta, where every Arc is encoded as a direct lookup.

When random lowercase ASCII strings are used, a smaller improvement of around 
4% is seen.

h4. luceneutil

Testing w/luceneutil (wikimediumall) we see improvements mostly in the PKLookup 
case. Other results seem noisy, with perhaps a small improvment in some of the 
queries.

{noformat}
TaskQPS base  StdDevQPS opto  StdDev
Pct diff
  OrHighHigh6.93  (3.0%)6.89  (3.1%)   
-0.5% (  -6% -5%)
   OrHighMed   45.15  (3.9%)   44.92  (3.5%)   
-0.5% (  -7% -7%)
Wildcard8.72  (4.7%)8.69  (4.6%)   
-0.4% (  -9% -9%)
  AndHighLow  274.11  (2.6%)  273.58  (3.1%)   
-0.2% (  -5% -5%)
   OrHighLow  241.41  (1.9%)  241.11  (3.5%)   
-0.1% (  -5% -5%)
  AndHighMed   52.23  (4.1%)   52.41  (5.3%)
0.3% (  -8% -   10%)
 MedTerm 1026.24  (3.1%) 1030.52  (4.3%)
0.4% (  -6% -8%)
HighTerm .10  (3.4%) 1116.70  (4.0%)
0.5% (  -6% -8%)
   HighTermDayOfYearSort   14.59  (8.2%)   14.73  (9.3%)
1.0% ( -15% -   20%)
 AndHighHigh   13.45  (6.2%)   13.61  (4.4%)
1.2% (  -8% -   12%)
   HighTermMonthSort   63.09 (12.5%)   64.13 (10.9%)
1.6% ( -19% -   28%)
 LowTerm 1338.94  (3.3%) 1383.90  (5.5%)
3.4% (  -5% -   12%)
PKLookup  120.45  (2.5%)  130.91  (3.5%)
8.7% (   2% -   15%)
{noformat}

h4.FST perf tests

I ran LookupBenchmarkTest to see the impact on the suggesters which make heavy 
use of FSTs. Some show little or no improvement, but in some cases there are 
substantial gains.

!FST-2-4.png!
!FST-6-9.png!
!FST-size.png!

  chart TK

h3. API change / implementation notes

The only change in the public FST API is that the Builder constructor now takes 
an additional
boolean ("useDirectArcAddressing") controlling whether or not this new 
optimization is
applied. However, FST's internal details are not really hidden, so in practice 
any change to its
encoding can have ripple effects in other classes.

This is because the FST decoding is repeated in several places in the code 
base, sometimes with
subtle variations: eg FST, FSTEnum, and fst.Util have very similar, but not 
shared, traversal code,
and there are a few other places this same or simi

[jira] [Created] (LUCENE-8781) Explore FST direct array arc encoding

2019-04-27 Thread Mike Sokolov (JIRA)

Mike Sokolov created LUCENE-8781:


 Summary: Explore FST direct array arc encoding 
 Key: LUCENE-8781
 URL: https://issues.apache.org/jira/browse/LUCENE-8781
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Mike Sokolov


This issue is for exploring an alternate FST encoding of Arcs as full-sized 
arrays so Arcs are addressed directly by label, avoiding binary search that we 
use today for arrays of Arcs.

h3. Testing

ant test passes. I added some unit tests that were helpful in uncovering bugs 
while
implementing which are more difficult to chase down when uncovered by the 
randomized testing we already do. They don't really test anything new; they're 
just more focused.

I'm not sure why, but ant precommit failed for me with:

{noformat}
 ...lucene-solr/solr/common-build.xml:536: Check for forbidden API calls failed 
while scanning class 
'org.apache.solr.metrics.reporters.SolrGangliaReporterTest' 
(SolrGangliaReporterTest.java): java.lang.ClassNotFoundException: 
info.ganglia.gmetric4j.gmetric.GMetric (while looking up details about 
referenced class 'info.ganglia.gmetric4j.gmetric.GMetric')
{noformat}

I also got Test2BFST running (it was originally timing out due to excessive 
calls to ramBytesUsage(), which seems to have gotten slow), and it passed; that 
change isn't include here.

h4. Micro-benchmark

I timed lookups in FST via FSTEnum.seekExact in a unit test under various 
conditions. 

h5. English words

A test of looking up existing words in a dictionary of ~17 English words 
shows improvements; the numbers listed are % change in FST size, time to look 
up (FSTEnum.seekExact) words that are in the dict, and time to look up random 
strings that are not in the dict. The comparison is against the current 
codebase with the optimization disabled. A separate comparison of showed no 
significant change of the baseline (no opto applied) vs the current master FST 
impl with no code changes applied.

   load=2||   load=4 ||  load=16
 +4, -6, -7  | +18, -11, -8 | +22, -11.5, -7

The "load factor" used for those measurements controls when direct array arc 
encoding is used;
namely when the number of outgoing arcs was > load * (max label - min label).

h5. sequential and random terms

The same test, with terms being a sequence of integers as strings shows a 
larger improvement, around 20% (load=4). This is presumably the best case for 
this delta, where every Arc is encoded as a direct lookup.

When random lowercase ASCII strings are used, a smaller improvement of around 
4% is seen.

h4. luceneutil

Testing w/luceneutil (wikimediumall) we see improvements mostly in the PKLookup 
case. Other results seem noisy, with perhaps a small improvment in some of the 
queries.

{noformat}
TaskQPS base  StdDevQPS opto  StdDev
Pct diff
  OrHighHigh6.93  (3.0%)6.89  (3.1%)   
-0.5% (  -6% -5%)
   OrHighMed   45.15  (3.9%)   44.92  (3.5%)   
-0.5% (  -7% -7%)
Wildcard8.72  (4.7%)8.69  (4.6%)   
-0.4% (  -9% -9%)
  AndHighLow  274.11  (2.6%)  273.58  (3.1%)   
-0.2% (  -5% -5%)
   OrHighLow  241.41  (1.9%)  241.11  (3.5%)   
-0.1% (  -5% -5%)
  AndHighMed   52.23  (4.1%)   52.41  (5.3%)
0.3% (  -8% -   10%)
 MedTerm 1026.24  (3.1%) 1030.52  (4.3%)
0.4% (  -6% -8%)
HighTerm .10  (3.4%) 1116.70  (4.0%)
0.5% (  -6% -8%)
   HighTermDayOfYearSort   14.59  (8.2%)   14.73  (9.3%)
1.0% ( -15% -   20%)
 AndHighHigh   13.45  (6.2%)   13.61  (4.4%)
1.2% (  -8% -   12%)
   HighTermMonthSort   63.09 (12.5%)   64.13 (10.9%)
1.6% ( -19% -   28%)
 LowTerm 1338.94  (3.3%) 1383.90  (5.5%)
3.4% (  -5% -   12%)
PKLookup  120.45  (2.5%)  130.91  (3.5%)
8.7% (   2% -   15%)
{noformat}

h4.FST perf tests

I ran LookupBenchmarkTest to see the impact on the suggesters which make heavy 
use of FSTs. Some show little or no improvement, but in some cases there are 
substantial gains.

  chart TK

h3. API change / implementation notes

The only change in the public FST API is that the Builder constructor now takes 
an additional
boolean ("useDirectArcAddressing") controlling whether or not this new 
optimization is
applied. However, FST's internal details are not really hidden, so in practice 
any change to its
encoding can have ripple effects in other classes.

This is because the FST decoding is repeated in several places in the code 
base, sometimes with
subtle variations: eg FST, FSTEnum, and fst.Util have very similar, but

[jira] [Commented] (LUCENE-8681) Prorated early termination

2019-04-15 Thread Mike Sokolov (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16818266#comment-16818266
 ] 

Mike Sokolov commented on LUCENE-8681:
--

I updated the PR with a new patch that changes the API for creating collectors 
that can early terminate to use an enum to see what it would look like. Is this 
what you had in mind [~rcmuir]? For example {{TopFieldCollector.create(int 
numHits, int countingThreshold)}} becomes  {{TopFieldCollector.create(int 
numHits, TerminationStrategy terminationStrategy)}}, and so on. This masks the 
complexity a bit, and adds an explanatory label at the coset of a small loss of 
flexibility (you can no longer specify exactly how many hits should be counted 
exactly; you just get the choice of counting the number of results, or counting 
up to 1000. The change is mostly impacting tests, and some internal calls in 
IndexSearcher. It certainly simplifies the calling API for the pro-rating so it 
no longer must explain what the thresholding parameter is all about, so I think 
that's an improvement.

> Prorated early termination
> --
>
> Key: LUCENE-8681
> URL: https://issues.apache.org/jira/browse/LUCENE-8681
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Mike Sokolov
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> In this issue we'll exploit the distribution of top K documents among 
> segments to extract performance gains when using early termination. The basic 
> idea is we do not need to collect K documents from every segment and then 
> merge. Rather we can collect a number of documents that is proportional to 
> the segment's size plus an error bound derived from the combinatorics seen as 
> a (multinomial) probability distribution.
> https://github.com/apache/lucene-solr/pull/564 has the proposed change.
> [~rcmuir] pointed out on the mailing list that this patch confounds two 
> settings: (1) whether to collect all hits, ensuring correct hit counts, and 
> (2) whether to guarantee that the top K hits are precisely the top K.
> The current patch treats this as the same thing. It takes the position that 
> if the user says it's OK to have approximate counts, then it's also OK to 
> introduce some small chance of ranking error; occasionally some of the top K 
> we return may draw from the top K + epsilon.
> Instead we could provide some additional knobs to the user. Currently the 
> public API is {{TopFieldCOllector.create(Sort, int, FieldDoc, int 
> threshold)}}. The threshold parameter controls when to apply early 
> termination; it allows the collector to terminate once the given number of 
> documents have been collected.
> Instead of using the same threshold to control leaf-level early termination, 
> we could provide an additional leaf-level parameter. For example, this could 
> be a scale factor on the error bound, eg a number of standard deviations to 
> apply. The patch uses 3, but a much more conservative bound would be 4 or 
> even 5. With these values, some speedup would still result, but with a much 
> lower level of ranking errors. A value of MAX_INT would ensure no leaf-level 
> termination would ever occur.
> We could also hide the precise numerical bound and offer users a three-way 
> enum (EXACT, APPROXIMATE_COUNT, APPROXIMATE_RANK) that controls whether to 
> apply this optimization, using some predetermined error bound.
> I posted the patch without any user-level tuning since I think the user has 
> already indicated a preference for speed over precision by specifying a 
> finite (global) threshold, but if we want to provide finer control, these two 
> options seem to make the most sense to me. Providing access to the number of 
> standard deviation to allow from the expected distribution gives the user the 
> finest control, but it could be hard to explain its proper use.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8753) New PostingFormat - UniformSplit

2019-04-03 Thread Mike Sokolov (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16809124#comment-16809124
 ] 

Mike Sokolov commented on LUCENE-8753:
--

The behavior I'm referring to isn't a problem with the benchmark _per se_, it's 
an intrinsic feature of trying to optimize blocks of things that work better 
when you have a bunch of similar things right next to each other sharing a 
common prefix. The PKLookup test is very sensitive to such optimizations (or 
failures to optimize) because it indexes a consecutive block of terms: a00, 
a01, a02, etc. and at the same time, this nice consecutive property can be 
disturbed by index fragmentation, especially when only each term is indexed for 
only a single document. 

> New PostingFormat - UniformSplit
> 
>
> Key: LUCENE-8753
> URL: https://issues.apache.org/jira/browse/LUCENE-8753
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 8.0
>Reporter: Bruno Roustant
>Priority: Major
> Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 
> objectives:
>  - Clear design and simple code.
>  - Easily extensible, for both the logic and the index format.
>  - Light memory usage with a very compact FST.
>  - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance.
> (the pdf attached explains visually the technique in more details)
>  The principle is to split the list of terms into blocks and use a FST to 
> access the block, but not as a prefix trie, rather with a seek-floor pattern. 
> For the selection of the blocks, there is a target average block size (number 
> of terms), with an allowed delta variation (10%) to compare the terms and 
> select the one with the minimal distinguishing prefix.
>  There are also several optimizations inside the block to make it more 
> compact and speed up the loading/scanning.
> The performance obtained is interesting with the luceneutil benchmark, 
> comparing UniformSplit with BlockTree. Find it in the first comment and also 
> attached for better formatting.
> Although the precise percentages vary between runs, three main points:
>  - TermQuery and PhraseQuery are improved.
>  - PrefixQuery and WildcardQuery are ok.
>  - Fuzzy queries are clearly less performant, because BlockTree is so 
> optimized for them.
> Compared to BlockTree, FST size is reduced by 15%, and segment writing time 
> is reduced by 20%. So this PostingsFormat scales to lots of docs, as 
> BlockTree.
> This initial version passes all Lucene tests. Use “ant test 
> -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat.
> Subjectively, we think we have fulfilled our goal of code simplicity. And we 
> have already exercised this PostingsFormat extensibility to create a 
> different flavor for our own use-case.
> Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8753) New PostingFormat - UniformSplit

2019-04-03 Thread Mike Sokolov (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16808860#comment-16808860
 ] 

Mike Sokolov commented on LUCENE-8753:
--

I've been working on some other FST-related changes, and running these 
benchmarks a bunch. PKLookup performance varies wildly (for me)  depending on 
just how the terms are distributed among the segments. The PK's are a sequence 
(with no gaps) of base36 identifiers, and I see different performance 
characteristics in small index vs larger (wikimediumall). In general force 
merge gets me consistent results, but then may not be predictive of reality...

> New PostingFormat - UniformSplit
> 
>
> Key: LUCENE-8753
> URL: https://issues.apache.org/jira/browse/LUCENE-8753
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 8.0
>Reporter: Bruno Roustant
>Priority: Major
> Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 
> objectives:
>  - Clear design and simple code.
>  - Easily extensible, for both the logic and the index format.
>  - Light memory usage with a very compact FST.
>  - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance.
> (the pdf attached explains visually the technique in more details)
>  The principle is to split the list of terms into blocks and use a FST to 
> access the block, but not as a prefix trie, rather with a seek-floor pattern. 
> For the selection of the blocks, there is a target average block size (number 
> of terms), with an allowed delta variation (10%) to compare the terms and 
> select the one with the minimal distinguishing prefix.
>  There are also several optimizations inside the block to make it more 
> compact and speed up the loading/scanning.
> The performance obtained is interesting with the luceneutil benchmark, 
> comparing UniformSplit with BlockTree. Find it in the first comment and also 
> attached for better formatting.
> Although the precise percentages vary between runs, three main points:
>  - TermQuery and PhraseQuery are improved.
>  - PrefixQuery and WildcardQuery are ok.
>  - Fuzzy queries are clearly less performant, because BlockTree is so 
> optimized for them.
> Compared to BlockTree, FST size is reduced by 15%, and segment writing time 
> is reduced by 20%. So this PostingsFormat scales to lots of docs, as 
> BlockTree.
> This initial version passes all Lucene tests. Use “ant test 
> -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat.
> Subjectively, we think we have fulfilled our goal of code simplicity. And we 
> have already exercised this PostingsFormat extensibility to create a 
> different flavor for our own use-case.
> Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8750) Implement setMissingValue for numeric ValueSource sortFields

2019-04-02 Thread Mike Sokolov (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16807816#comment-16807816
 ] 

Mike Sokolov commented on LUCENE-8750:
--

Here's a PR: https://github.com/apache/lucene-solr/pull/631

> Implement setMissingValue for numeric ValueSource sortFields
> 
>
> Key: LUCENE-8750
> URL: https://issues.apache.org/jira/browse/LUCENE-8750
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We currently have setMissingValue for SortFields based on concrete numeric 
> fields, but not for SortFields derived from LongValuesSource and 
> DoubleValuesSource.
> This issue is for implementing 
> LongValuesSource.LongValuesSortField.setMissingValue and 
> DoubleValuesSource.DoubleValuesSortField.setMissingValue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-8750) Implement setMissingValue for numeric ValueSource sortFields

2019-04-02 Thread Mike Sokolov (JIRA)

Mike Sokolov created LUCENE-8750:


 Summary: Implement setMissingValue for numeric ValueSource 
sortFields
 Key: LUCENE-8750
 URL: https://issues.apache.org/jira/browse/LUCENE-8750
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Mike Sokolov


We currently have setMissingValue for SortFields based on concrete numeric 
fields, but not for SortFields derived from LongValuesSource and 
DoubleValuesSource.

This issue is for implementing 
LongValuesSource.LongValuesSortField.setMissingValue and 
DoubleValuesSource.DoubleValuesSortField.setMissingValue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8700) Enable concurrent flushing when no indexing is in progress

2019-02-19 Thread Mike Sokolov (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16772305#comment-16772305
 ] 

Mike Sokolov commented on LUCENE-8700:
--

Pull request for this issue: https://github.com/apache/lucene-solr/pull/580

> Enable concurrent flushing when no indexing is in progress
> --
>
> Key: LUCENE-8700
> URL: https://issues.apache.org/jira/browse/LUCENE-8700
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> As discussed on mailing list, this is for adding a IndexWriter.yield() method 
> that callers can use to enable concurrent flushing. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-8700) Enable concurrent flushing when no indexing is in progress

2019-02-19 Thread Mike Sokolov (JIRA)

Mike Sokolov created LUCENE-8700:


 Summary: Enable concurrent flushing when no indexing is in progress
 Key: LUCENE-8700
 URL: https://issues.apache.org/jira/browse/LUCENE-8700
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Mike Sokolov


As discussed on mailing list, this is for adding a IndexWriter.yield() method 
that callers can use to enable concurrent flushing. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8681) Prorated early termination

2019-02-19 Thread Mike Sokolov (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16771986#comment-16771986
 ] 

Mike Sokolov commented on LUCENE-8681:
--

I posted [a new PR|https://github.com/apache/lucene-solr/pull/579] that (I 
think) addresses your comments, [~rcmuir]. I added a test, rebased on master, 
and it passes precommit for me.

> Prorated early termination
> --
>
> Key: LUCENE-8681
> URL: https://issues.apache.org/jira/browse/LUCENE-8681
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Mike Sokolov
>Priority: Major
>
> In this issue we'll exploit the distribution of top K documents among 
> segments to extract performance gains when using early termination. The basic 
> idea is we do not need to collect K documents from every segment and then 
> merge. Rather we can collect a number of documents that is proportional to 
> the segment's size plus an error bound derived from the combinatorics seen as 
> a (multinomial) probability distribution.
> https://github.com/apache/lucene-solr/pull/564 has the proposed change.
> [~rcmuir] pointed out on the mailing list that this patch confounds two 
> settings: (1) whether to collect all hits, ensuring correct hit counts, and 
> (2) whether to guarantee that the top K hits are precisely the top K.
> The current patch treats this as the same thing. It takes the position that 
> if the user says it's OK to have approximate counts, then it's also OK to 
> introduce some small chance of ranking error; occasionally some of the top K 
> we return may draw from the top K + epsilon.
> Instead we could provide some additional knobs to the user. Currently the 
> public API is {{TopFieldCOllector.create(Sort, int, FieldDoc, int 
> threshold)}}. The threshold parameter controls when to apply early 
> termination; it allows the collector to terminate once the given number of 
> documents have been collected.
> Instead of using the same threshold to control leaf-level early termination, 
> we could provide an additional leaf-level parameter. For example, this could 
> be a scale factor on the error bound, eg a number of standard deviations to 
> apply. The patch uses 3, but a much more conservative bound would be 4 or 
> even 5. With these values, some speedup would still result, but with a much 
> lower level of ranking errors. A value of MAX_INT would ensure no leaf-level 
> termination would ever occur.
> We could also hide the precise numerical bound and offer users a three-way 
> enum (EXACT, APPROXIMATE_COUNT, APPROXIMATE_RANK) that controls whether to 
> apply this optimization, using some predetermined error bound.
> I posted the patch without any user-level tuning since I think the user has 
> already indicated a preference for speed over precision by specifying a 
> finite (global) threshold, but if we want to provide finer control, these two 
> options seem to make the most sense to me. Providing access to the number of 
> standard deviation to allow from the expected distribution gives the user the 
> finest control, but it could be hard to explain its proper use.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8681) Prorated early termination

2019-02-15 Thread Mike Sokolov (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16769854#comment-16769854
 ] 

Mike Sokolov commented on LUCENE-8681:
--

bq. ... doMaxScore and trackTotalHits (did you mean totalHitsThreshold?) aren't 
parameters in master.

Oops - I was looking at an old branch; yes it's much cleaner now. Although 
there's some javadoc still hanging around referring to doMaxScore. 

I'll add the new static createManager with a test soon. I could clean up the 
IndexSearcher javadocs here too, although it's not really related, it seems 
pretty trivial.

I think I see your point about the enum - it's unusual to want to set a 
specific value for these things. Maybe you'd set totalHitsThreshold to be some 
multiple of numHits? But that could certainly remain an expert API, while 
exposing the ability to either use the default (1000) or insist on precise 
counts. I don't know if people are missing that ability or not.

> Prorated early termination
> --
>
> Key: LUCENE-8681
> URL: https://issues.apache.org/jira/browse/LUCENE-8681
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Mike Sokolov
>Priority: Major
>
> In this issue we'll exploit the distribution of top K documents among 
> segments to extract performance gains when using early termination. The basic 
> idea is we do not need to collect K documents from every segment and then 
> merge. Rather we can collect a number of documents that is proportional to 
> the segment's size plus an error bound derived from the combinatorics seen as 
> a (multinomial) probability distribution.
> https://github.com/apache/lucene-solr/pull/564 has the proposed change.
> [~rcmuir] pointed out on the mailing list that this patch confounds two 
> settings: (1) whether to collect all hits, ensuring correct hit counts, and 
> (2) whether to guarantee that the top K hits are precisely the top K.
> The current patch treats this as the same thing. It takes the position that 
> if the user says it's OK to have approximate counts, then it's also OK to 
> introduce some small chance of ranking error; occasionally some of the top K 
> we return may draw from the top K + epsilon.
> Instead we could provide some additional knobs to the user. Currently the 
> public API is {{TopFieldCOllector.create(Sort, int, FieldDoc, int 
> threshold)}}. The threshold parameter controls when to apply early 
> termination; it allows the collector to terminate once the given number of 
> documents have been collected.
> Instead of using the same threshold to control leaf-level early termination, 
> we could provide an additional leaf-level parameter. For example, this could 
> be a scale factor on the error bound, eg a number of standard deviations to 
> apply. The patch uses 3, but a much more conservative bound would be 4 or 
> even 5. With these values, some speedup would still result, but with a much 
> lower level of ranking errors. A value of MAX_INT would ensure no leaf-level 
> termination would ever occur.
> We could also hide the precise numerical bound and offer users a three-way 
> enum (EXACT, APPROXIMATE_COUNT, APPROXIMATE_RANK) that controls whether to 
> apply this optimization, using some predetermined error bound.
> I posted the patch without any user-level tuning since I think the user has 
> already indicated a preference for speed over precision by specifying a 
> finite (global) threshold, but if we want to provide finer control, these two 
> options seem to make the most sense to me. Providing access to the number of 
> standard deviation to allow from the expected distribution gives the user the 
> finest control, but it could be hard to explain its proper use.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8681) Prorated early termination

2019-02-15 Thread Mike Sokolov (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16769698#comment-16769698
 ] 

Mike Sokolov commented on LUCENE-8681:
--

There are a bunch of different ways to provide for opt-in here. The most 
focused would be to just require users to call {{IndexSearcher.search(Query, 
CollectorManager)}}. That's currently the only way to invoke concurrent 
collection, We could provide a convenient {{CollectorManager}} via a static 
method in {{TopFieldCollector}}. I think probably that's enough for this issue?

I thought about how to make this easier for users by pushing up to higher-level 
APIs.  There's not an obvious right way, but here's my 2c. Following the 
current API one would  add yet more overrides of {{IndexSearcher.search}} and 
{{IndexSearcher.searchAfter}} providing the ability to supply a threshold or 
boolean to enable this feature. I see that there is no such convenience 
available for {{trackTotalHits}}, and I suspect folks felt there were simply 
too many overrides already? It certainly seems that way to me. When I see an 
API getting a great many parameters and overloads with different sets of them, 
I want to introduce a class to hold them (we don't have optional args and 
default args in Java). IndexSearcher could take a SearchConfig object that 
would just be a simple struct holding its various options (sort, numHits, 
doDocScores, doMaxScore, trackTotalHits, proratedEarlyTerminationThreshold, 
etc. That would make the search()/searchAfter() methods have simpler signatures 
(eventually). Having an object class to hold options (like IndexWriterConfig) 
gives a nice centralized way to provide documentation. Also, in a search UI one 
often varies pagination and sorting via a different code path than the core 
Query, so it feels natural to me to use a different abstraction to track those 
things.

> Prorated early termination
> --
>
> Key: LUCENE-8681
> URL: https://issues.apache.org/jira/browse/LUCENE-8681
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Mike Sokolov
>Priority: Major
>
> In this issue we'll exploit the distribution of top K documents among 
> segments to extract performance gains when using early termination. The basic 
> idea is we do not need to collect K documents from every segment and then 
> merge. Rather we can collect a number of documents that is proportional to 
> the segment's size plus an error bound derived from the combinatorics seen as 
> a (multinomial) probability distribution.
> https://github.com/apache/lucene-solr/pull/564 has the proposed change.
> [~rcmuir] pointed out on the mailing list that this patch confounds two 
> settings: (1) whether to collect all hits, ensuring correct hit counts, and 
> (2) whether to guarantee that the top K hits are precisely the top K.
> The current patch treats this as the same thing. It takes the position that 
> if the user says it's OK to have approximate counts, then it's also OK to 
> introduce some small chance of ranking error; occasionally some of the top K 
> we return may draw from the top K + epsilon.
> Instead we could provide some additional knobs to the user. Currently the 
> public API is {{TopFieldCOllector.create(Sort, int, FieldDoc, int 
> threshold)}}. The threshold parameter controls when to apply early 
> termination; it allows the collector to terminate once the given number of 
> documents have been collected.
> Instead of using the same threshold to control leaf-level early termination, 
> we could provide an additional leaf-level parameter. For example, this could 
> be a scale factor on the error bound, eg a number of standard deviations to 
> apply. The patch uses 3, but a much more conservative bound would be 4 or 
> even 5. With these values, some speedup would still result, but with a much 
> lower level of ranking errors. A value of MAX_INT would ensure no leaf-level 
> termination would ever occur.
> We could also hide the precise numerical bound and offer users a three-way 
> enum (EXACT, APPROXIMATE_COUNT, APPROXIMATE_RANK) that controls whether to 
> apply this optimization, using some predetermined error bound.
> I posted the patch without any user-level tuning since I think the user has 
> already indicated a preference for speed over precision by specifying a 
> finite (global) threshold, but if we want to provide finer control, these two 
> options seem to make the most sense to me. Providing access to the number of 
> standard deviation to allow from the expected distribution gives the user the 
> finest control, but it could be hard to explain its proper use.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mai

[jira] [Commented] (LUCENE-8681) Prorated early termination

2019-02-13 Thread Mike Sokolov (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16767360#comment-16767360
 ] 

Mike Sokolov commented on LUCENE-8681:
--

Yes, I guess it would be necessary to pass a CollectorManager in order make use 
of the multithreading. Im traveling and not able to focus on this for a couple 
of days, but I'll think about it and come back w/a proposal soon.

> Prorated early termination
> --
>
> Key: LUCENE-8681
> URL: https://issues.apache.org/jira/browse/LUCENE-8681
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Mike Sokolov
>Priority: Major
>
> In this issue we'll exploit the distribution of top K documents among 
> segments to extract performance gains when using early termination. The basic 
> idea is we do not need to collect K documents from every segment and then 
> merge. Rather we can collect a number of documents that is proportional to 
> the segment's size plus an error bound derived from the combinatorics seen as 
> a (multinomial) probability distribution.
> https://github.com/apache/lucene-solr/pull/564 has the proposed change.
> [~rcmuir] pointed out on the mailing list that this patch confounds two 
> settings: (1) whether to collect all hits, ensuring correct hit counts, and 
> (2) whether to guarantee that the top K hits are precisely the top K.
> The current patch treats this as the same thing. It takes the position that 
> if the user says it's OK to have approximate counts, then it's also OK to 
> introduce some small chance of ranking error; occasionally some of the top K 
> we return may draw from the top K + epsilon.
> Instead we could provide some additional knobs to the user. Currently the 
> public API is {{TopFieldCOllector.create(Sort, int, FieldDoc, int 
> threshold)}}. The threshold parameter controls when to apply early 
> termination; it allows the collector to terminate once the given number of 
> documents have been collected.
> Instead of using the same threshold to control leaf-level early termination, 
> we could provide an additional leaf-level parameter. For example, this could 
> be a scale factor on the error bound, eg a number of standard deviations to 
> apply. The patch uses 3, but a much more conservative bound would be 4 or 
> even 5. With these values, some speedup would still result, but with a much 
> lower level of ranking errors. A value of MAX_INT would ensure no leaf-level 
> termination would ever occur.
> We could also hide the precise numerical bound and offer users a three-way 
> enum (EXACT, APPROXIMATE_COUNT, APPROXIMATE_RANK) that controls whether to 
> apply this optimization, using some predetermined error bound.
> I posted the patch without any user-level tuning since I think the user has 
> already indicated a preference for speed over precision by specifying a 
> finite (global) threshold, but if we want to provide finer control, these two 
> options seem to make the most sense to me. Providing access to the number of 
> standard deviation to allow from the expected distribution gives the user the 
> finest control, but it could be hard to explain its proper use.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8681) Prorated early termination

2019-02-12 Thread Mike Sokolov (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16766112#comment-16766112
 ] 

Mike Sokolov commented on LUCENE-8681:
--

bq. so from my perspective, api change is not really crazy here to allow code 
to opt-in.

OK, I agree. How about something like:

{{public static TopFieldCollector createProrated(Sort sort, int numHits, 
FieldDoc after, int totalHitsThreshold, double stddev)}}

or even just

{{public static TopFieldCollector createProrated(Sort sort, int numHits, 
FieldDoc after, int totalHitsThreshold)}}

as far as unit tests go, you are right there is always the possibility of a 
random failure. My approach is to try to write tests where the probability is 
vanishingly small and put up with a tiny amount of noise in the test results, 
eg using a high value of stddev. If they are tests *of this feature*, comment 
the test so future devs can understand the likelihood of a random failure. 
Maybe we would not want to randomly swap in this collector implementation to 
just any old test. If it can be isolated to tests of early termination, then 
perhaps it's not too unreasonable to place some constraints on those?

> Prorated early termination
> --
>
> Key: LUCENE-8681
> URL: https://issues.apache.org/jira/browse/LUCENE-8681
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Mike Sokolov
>Priority: Major
>
> In this issue we'll exploit the distribution of top K documents among 
> segments to extract performance gains when using early termination. The basic 
> idea is we do not need to collect K documents from every segment and then 
> merge. Rather we can collect a number of documents that is proportional to 
> the segment's size plus an error bound derived from the combinatorics seen as 
> a (multinomial) probability distribution.
> https://github.com/apache/lucene-solr/pull/564 has the proposed change.
> [~rcmuir] pointed out on the mailing list that this patch confounds two 
> settings: (1) whether to collect all hits, ensuring correct hit counts, and 
> (2) whether to guarantee that the top K hits are precisely the top K.
> The current patch treats this as the same thing. It takes the position that 
> if the user says it's OK to have approximate counts, then it's also OK to 
> introduce some small chance of ranking error; occasionally some of the top K 
> we return may draw from the top K + epsilon.
> Instead we could provide some additional knobs to the user. Currently the 
> public API is {{TopFieldCOllector.create(Sort, int, FieldDoc, int 
> threshold)}}. The threshold parameter controls when to apply early 
> termination; it allows the collector to terminate once the given number of 
> documents have been collected.
> Instead of using the same threshold to control leaf-level early termination, 
> we could provide an additional leaf-level parameter. For example, this could 
> be a scale factor on the error bound, eg a number of standard deviations to 
> apply. The patch uses 3, but a much more conservative bound would be 4 or 
> even 5. With these values, some speedup would still result, but with a much 
> lower level of ranking errors. A value of MAX_INT would ensure no leaf-level 
> termination would ever occur.
> We could also hide the precise numerical bound and offer users a three-way 
> enum (EXACT, APPROXIMATE_COUNT, APPROXIMATE_RANK) that controls whether to 
> apply this optimization, using some predetermined error bound.
> I posted the patch without any user-level tuning since I think the user has 
> already indicated a preference for speed over precision by specifying a 
> finite (global) threshold, but if we want to provide finer control, these two 
> options seem to make the most sense to me. Providing access to the number of 
> standard deviation to allow from the expected distribution gives the user the 
> finest control, but it could be hard to explain its proper use.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-13233) SpellCheckCollator ignores stacked tokens

2019-02-10 Thread Mike Sokolov (JIRA)



[ 
https://issues.apache.org/jira/browse/SOLR-13233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16764410#comment-16764410
 ] 

Mike Sokolov commented on SOLR-13233:
-

I wonder if SpellCheckCollator should just ignore all stacked tokens, both the 
original and the injected ones. If there is a synonym for something it was 
probably spelled correctly? And as far as WDGF goes it is not going to 
reassemble the original token I think? I mean for input like "five-figgered" 
what is the best we can hope for? I doubt it would ever come up with 
"five-fingered" even if we improved the mapping to original vs injected token.

> SpellCheckCollator ignores stacked tokens
> -
>
> Key: SOLR-13233
> URL: https://issues.apache.org/jira/browse/SOLR-13233
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Alan Woodward
>Priority: Major
>
> When building collations, SpellCheckCollator ignores any tokens with a 
> position increment of 0, assuming that they've been injected and may 
> therefore have incorrect offsets (injected terms generally keep the offsets 
> of the terms they're replacing, as they don't themselves appear anywhere in 
> the original source).  However, this assumption is not necessarily correct - 
> for example, WordDelimiterGraphFilter emits stacked tokens *before* the 
> original token, because it needs to iterate through all stacked tokens to 
> correctly set the original token's position length.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8681) Prorated early termination

2019-02-09 Thread Mike Sokolov (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16764163#comment-16764163
 ] 

Mike Sokolov commented on LUCENE-8681:
--

I hope I'm not reading this the right way (?!? :), but I do agree this is 
potentially inexact. The point of the math above is that we can bound the 
inexactness. The current API assumes the user is OK with some inexact counting, 
and this just builds on that. I can see how we might want to provide more 
explicit control, but whether or not we do that is a somewhat independent 
question.

The assumption this patch makes is that documents are distributed among 
segments in a uniform random way. As I think about it, perhaps there are cases 
where there would be a correlation - eg if the field is a timestamp, recent 
documents could very well be grouped together in a single segment. Although 
multi-threaded indexing and merging will tend to spread them around, there 
could still be a correlation.

> Prorated early termination
> --
>
> Key: LUCENE-8681
> URL: https://issues.apache.org/jira/browse/LUCENE-8681
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Mike Sokolov
>Priority: Major
>
> In this issue we'll exploit the distribution of top K documents among 
> segments to extract performance gains when using early termination. The basic 
> idea is we do not need to collect K documents from every segment and then 
> merge. Rather we can collect a number of documents that is proportional to 
> the segment's size plus an error bound derived from the combinatorics seen as 
> a (multinomial) probability distribution.
> https://github.com/apache/lucene-solr/pull/564 has the proposed change.
> [~rcmuir] pointed out on the mailing list that this patch confounds two 
> settings: (1) whether to collect all hits, ensuring correct hit counts, and 
> (2) whether to guarantee that the top K hits are precisely the top K.
> The current patch treats this as the same thing. It takes the position that 
> if the user says it's OK to have approximate counts, then it's also OK to 
> introduce some small chance of ranking error; occasionally some of the top K 
> we return may draw from the top K + epsilon.
> Instead we could provide some additional knobs to the user. Currently the 
> public API is {{TopFieldCOllector.create(Sort, int, FieldDoc, int 
> threshold)}}. The threshold parameter controls when to apply early 
> termination; it allows the collector to terminate once the given number of 
> documents have been collected.
> Instead of using the same threshold to control leaf-level early termination, 
> we could provide an additional leaf-level parameter. For example, this could 
> be a scale factor on the error bound, eg a number of standard deviations to 
> apply. The patch uses 3, but a much more conservative bound would be 4 or 
> even 5. With these values, some speedup would still result, but with a much 
> lower level of ranking errors. A value of MAX_INT would ensure no leaf-level 
> termination would ever occur.
> We could also hide the precise numerical bound and offer users a three-way 
> enum (EXACT, APPROXIMATE_COUNT, APPROXIMATE_RANK) that controls whether to 
> apply this optimization, using some predetermined error bound.
> I posted the patch without any user-level tuning since I think the user has 
> already indicated a preference for speed over precision by specifying a 
> finite (global) threshold, but if we want to provide finer control, these two 
> options seem to make the most sense to me. Providing access to the number of 
> standard deviation to allow from the expected distribution gives the user the 
> finest control, but it could be hard to explain its proper use.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8635) Lazy loading Lucene FST offheap using mmap

2019-02-07 Thread Mike Sokolov (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16762692#comment-16762692
 ] 

Mike Sokolov commented on LUCENE-8635:
--

[~akjain] that's strange yeah -- this patch was supposed to avoid kicking in 
for PK fields right?

> Lazy loading Lucene FST offheap using mmap
> --
>
> Key: LUCENE-8635
> URL: https://issues.apache.org/jira/browse/LUCENE-8635
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: core/FSTs
> Environment: I used below setup for es_rally tests:
> single node i3.xlarge running ES 6.5
> es_rally was running on another i3.xlarge instance
>Reporter: Ankit Jain
>Priority: Major
> Attachments: fst-offheap-ra-rev.patch, fst-offheap-rev.patch, 
> offheap.patch, optional_offheap_ra.patch, ra.patch, rally_benchmark.xlsx
>
>
> Currently, FST loads all the terms into heap memory during index open. This 
> causes frequent JVM OOM issues if the term size gets big. A better way of 
> doing this will be to lazily load FST using mmap. That ensures only the 
> required terms get loaded into memory.
>  
> Lucene can expose API for providing list of fields to load terms offheap. I'm 
> planning to take following approach for this:
>  # Add a boolean property fstOffHeap in FieldInfo
>  # Pass list of offheap fields to lucene during index open (ALL can be 
> special keyword for loading ALL fields offheap)
>  # Initialize the fstOffHeap property during lucene index open
>  # FieldReader invokes default FST constructor or OffHeap constructor based 
> on fstOffHeap field
>  
> I created a patch (that loads all fields offheap), did some benchmarks using 
> es_rally and results look good.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-8681) Prorated early termination

2019-02-07 Thread Mike Sokolov (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16762656#comment-16762656
 ] 

Mike Sokolov edited comment on LUCENE-8681 at 2/7/19 1:28 PM:
--

bq. However I wonder if this could be implemented directly in a custom 
CollectorManager, currently the CollectorManager creates a Collector for each 
leaf if the executor is not null and merge all the 

Yes, that's totally do-able, but I think doing this here has value since it is 
a good default for anyone performing multithreaded collection with 
approximation enabled, and doesn't really impact single-threaded collection in 
a significant way.

Just to be a little more concrete about the scales here. With a "safety margin" 
of 3 std deviations from the mean, you can expect to see ranking errors (eg 
some result whose true rank is > N in the returned top N) in 1/740 queries (see 
https://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule and note that 
this is a 1-sided distribution -- it's OK if a segment has *fewer* top N 
results than we expect).  With 5 standard deviations, the frequency frpos to 
around 1 in 3 million. 


was (Author: sokolov):
bq. However I wonder if this could be implemented directly in a custom 
CollectorManager, currently the CollectorManager creates a Collector for each 
leaf if the executor is not null and merge all the 

Yes, that's totally do-able, but I think doing this here has value since it is 
a good default for anyone performing multithreaded collection with 
approximation enabled, and doesn't really impact single-threaded collection in 
a significant way.

> Prorated early termination
> --
>
> Key: LUCENE-8681
> URL: https://issues.apache.org/jira/browse/LUCENE-8681
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Mike Sokolov
>Priority: Major
>
> In this issue we'll exploit the distribution of top K documents among 
> segments to extract performance gains when using early termination. The basic 
> idea is we do not need to collect K documents from every segment and then 
> merge. Rather we can collect a number of documents that is proportional to 
> the segment's size plus an error bound derived from the combinatorics seen as 
> a (multinomial) probability distribution.
> https://github.com/apache/lucene-solr/pull/564 has the proposed change.
> [~rcmuir] pointed out on the mailing list that this patch confounds two 
> settings: (1) whether to collect all hits, ensuring correct hit counts, and 
> (2) whether to guarantee that the top K hits are precisely the top K.
> The current patch treats this as the same thing. It takes the position that 
> if the user says it's OK to have approximate counts, then it's also OK to 
> introduce some small chance of ranking error; occasionally some of the top K 
> we return may draw from the top K + epsilon.
> Instead we could provide some additional knobs to the user. Currently the 
> public API is {{TopFieldCOllector.create(Sort, int, FieldDoc, int 
> threshold)}}. The threshold parameter controls when to apply early 
> termination; it allows the collector to terminate once the given number of 
> documents have been collected.
> Instead of using the same threshold to control leaf-level early termination, 
> we could provide an additional leaf-level parameter. For example, this could 
> be a scale factor on the error bound, eg a number of standard deviations to 
> apply. The patch uses 3, but a much more conservative bound would be 4 or 
> even 5. With these values, some speedup would still result, but with a much 
> lower level of ranking errors. A value of MAX_INT would ensure no leaf-level 
> termination would ever occur.
> We could also hide the precise numerical bound and offer users a three-way 
> enum (EXACT, APPROXIMATE_COUNT, APPROXIMATE_RANK) that controls whether to 
> apply this optimization, using some predetermined error bound.
> I posted the patch without any user-level tuning since I think the user has 
> already indicated a preference for speed over precision by specifying a 
> finite (global) threshold, but if we want to provide finer control, these two 
> options seem to make the most sense to me. Providing access to the number of 
> standard deviation to allow from the expected distribution gives the user the 
> finest control, but it could be hard to explain its proper use.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8681) Prorated early termination

2019-02-07 Thread Mike Sokolov (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16762656#comment-16762656
 ] 

Mike Sokolov commented on LUCENE-8681:
--

bq. However I wonder if this could be implemented directly in a custom 
CollectorManager, currently the CollectorManager creates a Collector for each 
leaf if the executor is not null and merge all the 

Yes, that's totally do-able, but I think doing this here has value since it is 
a good default for anyone performing multithreaded collection with 
approximation enabled, and doesn't really impact single-threaded collection in 
a significant way.

> Prorated early termination
> --
>
> Key: LUCENE-8681
> URL: https://issues.apache.org/jira/browse/LUCENE-8681
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Mike Sokolov
>Priority: Major
>
> In this issue we'll exploit the distribution of top K documents among 
> segments to extract performance gains when using early termination. The basic 
> idea is we do not need to collect K documents from every segment and then 
> merge. Rather we can collect a number of documents that is proportional to 
> the segment's size plus an error bound derived from the combinatorics seen as 
> a (multinomial) probability distribution.
> https://github.com/apache/lucene-solr/pull/564 has the proposed change.
> [~rcmuir] pointed out on the mailing list that this patch confounds two 
> settings: (1) whether to collect all hits, ensuring correct hit counts, and 
> (2) whether to guarantee that the top K hits are precisely the top K.
> The current patch treats this as the same thing. It takes the position that 
> if the user says it's OK to have approximate counts, then it's also OK to 
> introduce some small chance of ranking error; occasionally some of the top K 
> we return may draw from the top K + epsilon.
> Instead we could provide some additional knobs to the user. Currently the 
> public API is {{TopFieldCOllector.create(Sort, int, FieldDoc, int 
> threshold)}}. The threshold parameter controls when to apply early 
> termination; it allows the collector to terminate once the given number of 
> documents have been collected.
> Instead of using the same threshold to control leaf-level early termination, 
> we could provide an additional leaf-level parameter. For example, this could 
> be a scale factor on the error bound, eg a number of standard deviations to 
> apply. The patch uses 3, but a much more conservative bound would be 4 or 
> even 5. With these values, some speedup would still result, but with a much 
> lower level of ranking errors. A value of MAX_INT would ensure no leaf-level 
> termination would ever occur.
> We could also hide the precise numerical bound and offer users a three-way 
> enum (EXACT, APPROXIMATE_COUNT, APPROXIMATE_RANK) that controls whether to 
> apply this optimization, using some predetermined error bound.
> I posted the patch without any user-level tuning since I think the user has 
> already indicated a preference for speed over precision by specifying a 
> finite (global) threshold, but if we want to provide finer control, these two 
> options seem to make the most sense to me. Providing access to the number of 
> standard deviation to allow from the expected distribution gives the user the 
> finest control, but it could be hard to explain its proper use.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

1 2 3 4 >

1 - 100 of 374 matches

Mail list logo