[jira] [Created] (LUCENE-9942) CLONE - Proper ASCII folding of Danish/Norwegian characters Ø, Å

2021-04-28 Thread Jacob Lauritzen (Jira)
Jacob Lauritzen created LUCENE-9942:
---

 Summary: CLONE - Proper ASCII folding of Danish/Norwegian 
characters Ø, Å
 Key: LUCENE-9942
 URL: https://issues.apache.org/jira/browse/LUCENE-9942
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/analysis
Reporter: Jacob Lauritzen


The current version of the ASCIIFoldingFilter sets Å, å to A, a and Ø, ø to O, 
o which I believe is incorrect.

Å was added by Norway as a replacement for the Aa (which is mapped to aa in the 
AsciiFoldingFilter) in 1917 and by Denmark in 1948. Aa is still used in a lot 
of names (as an example the second largest city in Denmark was originally named 
Aarhus, renamed to Århus in 1948 and named back to AArhus in 2010 for 
internationalization purposes).

The story of Ø is similar. It's equivalent to Œ (which is mapped to oe), not ö 
(which is mapped to o) and is generally mapped to oe in ascii text.

The third Danish character Æ is already properly mapped to AE.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9939) Proper ASCII folding of Danish/Norwegian characters Ø, Å

2021-04-28 Thread Jacob Lauritzen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17334562#comment-17334562
 ] 

Jacob Lauritzen commented on LUCENE-9939:
-

Understood, thanks.

> Proper ASCII folding of Danish/Norwegian characters Ø, Å
> 
>
> Key: LUCENE-9939
> URL: https://issues.apache.org/jira/browse/LUCENE-9939
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Jacob Lauritzen
>Priority: Minor
>  Labels: easyfix, patch, patch-available
> Attachments: LUCENE-9939.patch
>
>
> The current version of the ASCIIFoldingFilter sets Å, å to A, a and Ø, ø to 
> O, o which I believe is incorrect.
> Å was added by Norway as a replacement for the Aa (which is mapped to aa in 
> the AsciiFoldingFilter) in 1917 and by Denmark in 1948. Aa is still used in a 
> lot of names (as an example the second largest city in Denmark was originally 
> named Aarhus, renamed to Århus in 1948 and named back to AArhus in 2010 for 
> internationalization purposes).
> The story of Ø is similar. It's equivalent to Œ (which is mapped to oe), not 
> ö (which is mapped to o) and is generally mapped to oe in ascii text.
> The third Danish character Æ is already properly mapped to AE.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9939) Proper ASCII folding of Danish/Norwegian characters Ø, Å

2021-04-28 Thread Jacob Lauritzen (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacob Lauritzen updated LUCENE-9939:

Status: Open  (was: Patch Available)

> Proper ASCII folding of Danish/Norwegian characters Ø, Å
> 
>
> Key: LUCENE-9939
> URL: https://issues.apache.org/jira/browse/LUCENE-9939
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Jacob Lauritzen
>Priority: Minor
>  Labels: easyfix, patch, patch-available
> Attachments: LUCENE-9939.patch
>
>
> The current version of the ASCIIFoldingFilter sets Å, å to A, a and Ø, ø to 
> O, o which I believe is incorrect.
> Å was added by Norway as a replacement for the Aa (which is mapped to aa in 
> the AsciiFoldingFilter) in 1917 and by Denmark in 1948. Aa is still used in a 
> lot of names (as an example the second largest city in Denmark was originally 
> named Aarhus, renamed to Århus in 1948 and named back to AArhus in 2010 for 
> internationalization purposes).
> The story of Ø is similar. It's equivalent to Œ (which is mapped to oe), not 
> ö (which is mapped to o) and is generally mapped to oe in ascii text.
> The third Danish character Æ is already properly mapped to AE.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-9942) CLONE - Proper ASCII folding of Danish/Norwegian characters Ø, Å

2021-04-28 Thread Jacob Lauritzen (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacob Lauritzen resolved LUCENE-9942.
-
Resolution: Duplicate

> CLONE - Proper ASCII folding of Danish/Norwegian characters Ø, Å
> 
>
> Key: LUCENE-9942
> URL: https://issues.apache.org/jira/browse/LUCENE-9942
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Jacob Lauritzen
>Priority: Minor
>  Labels: easyfix, patch, patch-available
>
> The current version of the ASCIIFoldingFilter sets Å, å to A, a and Ø, ø to 
> O, o which I believe is incorrect.
> Å was added by Norway as a replacement for the Aa (which is mapped to aa in 
> the AsciiFoldingFilter) in 1917 and by Denmark in 1948. Aa is still used in a 
> lot of names (as an example the second largest city in Denmark was originally 
> named Aarhus, renamed to Århus in 1948 and named back to AArhus in 2010 for 
> internationalization purposes).
> The story of Ø is similar. It's equivalent to Œ (which is mapped to oe), not 
> ö (which is mapped to o) and is generally mapped to oe in ascii text.
> The third Danish character Æ is already properly mapped to AE.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-9939) Proper ASCII folding of Danish/Norwegian characters Ø, Å

2021-04-28 Thread Jacob Lauritzen (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacob Lauritzen resolved LUCENE-9939.
-
Resolution: Won't Fix

> Proper ASCII folding of Danish/Norwegian characters Ø, Å
> 
>
> Key: LUCENE-9939
> URL: https://issues.apache.org/jira/browse/LUCENE-9939
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Jacob Lauritzen
>Priority: Minor
>  Labels: easyfix, patch, patch-available
> Attachments: LUCENE-9939.patch
>
>
> The current version of the ASCIIFoldingFilter sets Å, å to A, a and Ø, ø to 
> O, o which I believe is incorrect.
> Å was added by Norway as a replacement for the Aa (which is mapped to aa in 
> the AsciiFoldingFilter) in 1917 and by Denmark in 1948. Aa is still used in a 
> lot of names (as an example the second largest city in Denmark was originally 
> named Aarhus, renamed to Århus in 1948 and named back to AArhus in 2010 for 
> internationalization purposes).
> The story of Ø is similar. It's equivalent to Œ (which is mapped to oe), not 
> ö (which is mapped to o) and is generally mapped to oe in ascii text.
> The third Danish character Æ is already properly mapped to AE.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] romseygeek opened a new pull request #109: LUCENE-9930: Only load Ukrainian morfologik dictionary once per JVM

2021-04-28 Thread GitBox


romseygeek opened a new pull request #109:
URL: https://github.com/apache/lucene/pull/109


   The UkrainianMorfologikAnalyzer was reloading its dictionary every
   time it created a new TokenStreamComponents, which meant that
   while the analyzer was open it would hold onto one copy of the
   dictionary per thread.
   
   This commit loads the dictionary in a lazy static initializer, alongside
   its stopword set.  It also makes the normalizer charmap a singleton
   so that we do not rebuild the same immutable object on every call
   to initReader.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] romseygeek opened a new pull request #110: LUCENE-9940: DisjunctionMaxQuery shouldn't depend on disjunct order for equals checks

2021-04-28 Thread GitBox


romseygeek opened a new pull request #110:
URL: https://github.com/apache/lucene/pull/110


   DisjunctionMaxQuery stores its disjuncts in a `Query[]`, and uses
   `Arrays.equals()` for comparisons in its `equals()` implementation.
   This means that the order in which disjuncts are added to the query
   matters for equality checks.
   
   This commit changes DMQ to instead store its disjuncts in a Multiset,
   meaning that ordering no longer matters.  The `getDisjuncts()`
   method now returns a `Collection` rather than a `List`, and
   some tests are changed to use query equality checks rather than 
   iterating over disjuncts and expecting a particular order.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] neoremind edited a comment on pull request #91: LUCENE-9932: Performance improvement for BKD index building

2021-04-28 Thread GitBox


neoremind edited a comment on pull request #91:
URL: https://github.com/apache/lucene/pull/91#issuecomment-827678981


   Note: Below benchmark is based on TimSort fallback sorter, I will re-run 
test case later.
   
   I spent some time trying to use the real case benchmark. The speedup of 
`IndexWriter` is what we expected, faster than main branch, total time elapsed 
(include adding doc, building index and merging) decreased by about 20%. If we 
only consider `flush_time`, the speedup is more obvious, time cost drops about 
40% - 50%.
   
   1) Run 
[IndexAndSearchOpenStreetMaps1D.java](https://github.com/neoremind/luceneutil/blob/master/src/main/perf/IndexAndSearchOpenStreetMaps1D.java)
 against the two branches and take down the 
[log](https://github.com/neoremind/luceneutil/tree/master/log/OpenStreetMaps).
   _note: comment query stage, modify some of the code to adapt to latest 
Lucene main branch._
   
   main branch:
   ```
   # egrep "flush time|sec to build index" open-street-maps.log
   DWPT 0 [2021-04-27T11:33:04.518908Z; main]: flush time 17284.537739 msec
   DWPT 0 [2021-04-27T11:33:37.888449Z; main]: flush time 12039.476885 msec
   72.49147722 sec to build index
   ```
   PR branch:
   ```
   #egrep "flush time|sec to build index" open-street-maps-optimized.log
   DWPT 0 [2021-04-27T11:35:00.619683Z; main]: flush time 9313.007647 msec
   DWPT 0 [2021-04-27T11:35:29.575254Z; main]: flush time 8631.820226 msec
   59.252797133 sec to build index
   ```
   
   2) Further more, I come up with an idea to use TPC-H LINEITEM to verify. I 
have a 10GB TPC-H dataset and develop a new test case to import the first 5 INT 
fields, which is more typical in real case.
   
   Run 
[IndexAndSearchTpcHLineItem.java](https://github.com/neoremind/luceneutil/blob/master/src/main/perf/IndexAndSearchTpcHLineItem.java)
 against the two branches and take down the 
[log](https://github.com/neoremind/luceneutil/tree/master/log/TPC-H-LINEITEM).
   
   main branch:
   ```
   egrep "flush time|sec to build index" tpch-lineitem.log
   DWPT 0 [2021-04-27T11:17:25.329006Z; main]: flush time 13850.23328 msec
   DWPT 0 [2021-04-27T11:17:50.289370Z; main]: flush time 12228.723665 msec
   DWPT 0 [2021-04-27T11:18:15.546002Z; main]: flush time 12537.085005 msec
   DWPT 0 [2021-04-27T11:18:40.140413Z; main]: flush time 11819.225223 msec
   DWPT 0 [2021-04-27T11:19:04.850989Z; main]: flush time 12004.380921 msec
   DWPT 0 [2021-04-27T11:19:29.435183Z; main]: flush time 11850.273453 msec
   DWPT 0 [2021-04-27T11:19:54.016951Z; main]: flush time 11882.316067 msec
   DWPT 0 [2021-04-27T11:20:18.932727Z; main]: flush time 12223.151464 msec
   DWPT 0 [2021-04-27T11:20:43.522117Z; main]: flush time 11871.276323 msec
   DWPT 0 [2021-04-27T11:20:52.060300Z; main]: flush time 3422.434221 msec
   271.188917715 sec to build index
   ```
   PR branch:
   ```
egrep "flush time|sec to build index" tpch-lineitem-optimized.log
   DWPT 0 [2021-04-27T11:24:00.362128Z; main]: flush time 7573.05091 msec
   DWPT 0 [2021-04-27T11:24:19.498948Z; main]: flush time 7355.376016 msec
   DWPT 0 [2021-04-27T11:24:38.602117Z; main]: flush time 7287.306154 msec
   DWPT 0 [2021-04-27T11:24:57.541930Z; main]: flush time 7227.514396 msec
   DWPT 0 [2021-04-27T11:25:16.474158Z; main]: flush time 7236.208865 msec
   DWPT 0 [2021-04-27T11:25:35.339855Z; main]: flush time 7152.876269 msec
   DWPT 0 [2021-04-27T11:25:54.10Z; main]: flush time 7080.405571 msec
   DWPT 0 [2021-04-27T11:26:12.985489Z; main]: flush time 7188.012278 msec
   DWPT 0 [2021-04-27T11:26:31.857053Z; main]: flush time 7176.303704 msec
   DWPT 0 [2021-04-27T11:26:38.838771Z; main]: flush time 2185.742347 msec
   213.175509249 sec to build index
   ```
   
   For benchmark command, please refer to [my 
document](https://github.com/neoremind/luceneutil/tree/master/command). 
   
   Test environment:
   ```
   CPU: 
   Architecture:  x86_64
   CPU op-mode(s):32-bit, 64-bit
   Byte Order:Little Endian
   CPU(s):32
   On-line CPU(s) list:   0-31
   Thread(s) per core:2
   Core(s) per socket:16
   Socket(s): 1
   NUMA node(s):  1
   Vendor ID: GenuineIntel
   CPU family:6
   Model: 85
   Model name:Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz
   Stepping:  4
   CPU MHz:   2500.000
   BogoMIPS:  5000.00
   Hypervisor vendor: KVM
   Virtualization type:   full
   L1d cache: 32K
   L1i cache: 32K
   L2 cache:  1024K
   L3 cache:  33792K
   NUMA node0 CPU(s): 0-31
   
   Memory: 
   $cat /proc/meminfo
   MemTotal:   65703704 kB
   
   Disk: SATA 
   $fdisk -l | grep Disk
   Disk /dev/vdb: 35184.4 GB, 35184372088832 bytes, 68719476736 sectors
   
   OS: 
   Linux 4.19.57-15.1.al7.x86_64
   
   JDK:
   openjdk version "11.0.11" 2021-04-20 LTS
   OpenJDK Runtime Environment 18.9 (buil

[GitHub] [lucene] neoremind edited a comment on pull request #91: LUCENE-9932: Performance improvement for BKD index building

2021-04-28 Thread GitBox


neoremind edited a comment on pull request #91:
URL: https://github.com/apache/lucene/pull/91#issuecomment-827678981


   Note: Below benchmark is based on TimSort fallback sorter, I will re-run 
test case later to see if it is even better.
   
   I spent some time trying to use the real case benchmark. The speedup of 
`IndexWriter` is what we expected, faster than main branch, total time elapsed 
(include adding doc, building index and merging) decreased by about 20%. If we 
only consider `flush_time`, the speedup is more obvious, time cost drops about 
40% - 50%.
   
   1) Run 
[IndexAndSearchOpenStreetMaps1D.java](https://github.com/neoremind/luceneutil/blob/master/src/main/perf/IndexAndSearchOpenStreetMaps1D.java)
 against the two branches and take down the 
[log](https://github.com/neoremind/luceneutil/tree/master/log/OpenStreetMaps).
   _note: comment query stage, modify some of the code to adapt to latest 
Lucene main branch._
   
   main branch:
   ```
   # egrep "flush time|sec to build index" open-street-maps.log
   DWPT 0 [2021-04-27T11:33:04.518908Z; main]: flush time 17284.537739 msec
   DWPT 0 [2021-04-27T11:33:37.888449Z; main]: flush time 12039.476885 msec
   72.49147722 sec to build index
   ```
   PR branch:
   ```
   #egrep "flush time|sec to build index" open-street-maps-optimized.log
   DWPT 0 [2021-04-27T11:35:00.619683Z; main]: flush time 9313.007647 msec
   DWPT 0 [2021-04-27T11:35:29.575254Z; main]: flush time 8631.820226 msec
   59.252797133 sec to build index
   ```
   
   2) Further more, I come up with an idea to use TPC-H LINEITEM to verify. I 
have a 10GB TPC-H dataset and develop a new test case to import the first 5 INT 
fields, which is more typical in real case.
   
   Run 
[IndexAndSearchTpcHLineItem.java](https://github.com/neoremind/luceneutil/blob/master/src/main/perf/IndexAndSearchTpcHLineItem.java)
 against the two branches and take down the 
[log](https://github.com/neoremind/luceneutil/tree/master/log/TPC-H-LINEITEM).
   
   main branch:
   ```
   egrep "flush time|sec to build index" tpch-lineitem.log
   DWPT 0 [2021-04-27T11:17:25.329006Z; main]: flush time 13850.23328 msec
   DWPT 0 [2021-04-27T11:17:50.289370Z; main]: flush time 12228.723665 msec
   DWPT 0 [2021-04-27T11:18:15.546002Z; main]: flush time 12537.085005 msec
   DWPT 0 [2021-04-27T11:18:40.140413Z; main]: flush time 11819.225223 msec
   DWPT 0 [2021-04-27T11:19:04.850989Z; main]: flush time 12004.380921 msec
   DWPT 0 [2021-04-27T11:19:29.435183Z; main]: flush time 11850.273453 msec
   DWPT 0 [2021-04-27T11:19:54.016951Z; main]: flush time 11882.316067 msec
   DWPT 0 [2021-04-27T11:20:18.932727Z; main]: flush time 12223.151464 msec
   DWPT 0 [2021-04-27T11:20:43.522117Z; main]: flush time 11871.276323 msec
   DWPT 0 [2021-04-27T11:20:52.060300Z; main]: flush time 3422.434221 msec
   271.188917715 sec to build index
   ```
   PR branch:
   ```
egrep "flush time|sec to build index" tpch-lineitem-optimized.log
   DWPT 0 [2021-04-27T11:24:00.362128Z; main]: flush time 7573.05091 msec
   DWPT 0 [2021-04-27T11:24:19.498948Z; main]: flush time 7355.376016 msec
   DWPT 0 [2021-04-27T11:24:38.602117Z; main]: flush time 7287.306154 msec
   DWPT 0 [2021-04-27T11:24:57.541930Z; main]: flush time 7227.514396 msec
   DWPT 0 [2021-04-27T11:25:16.474158Z; main]: flush time 7236.208865 msec
   DWPT 0 [2021-04-27T11:25:35.339855Z; main]: flush time 7152.876269 msec
   DWPT 0 [2021-04-27T11:25:54.10Z; main]: flush time 7080.405571 msec
   DWPT 0 [2021-04-27T11:26:12.985489Z; main]: flush time 7188.012278 msec
   DWPT 0 [2021-04-27T11:26:31.857053Z; main]: flush time 7176.303704 msec
   DWPT 0 [2021-04-27T11:26:38.838771Z; main]: flush time 2185.742347 msec
   213.175509249 sec to build index
   ```
   
   For benchmark command, please refer to [my 
document](https://github.com/neoremind/luceneutil/tree/master/command). 
   
   Test environment:
   ```
   CPU: 
   Architecture:  x86_64
   CPU op-mode(s):32-bit, 64-bit
   Byte Order:Little Endian
   CPU(s):32
   On-line CPU(s) list:   0-31
   Thread(s) per core:2
   Core(s) per socket:16
   Socket(s): 1
   NUMA node(s):  1
   Vendor ID: GenuineIntel
   CPU family:6
   Model: 85
   Model name:Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz
   Stepping:  4
   CPU MHz:   2500.000
   BogoMIPS:  5000.00
   Hypervisor vendor: KVM
   Virtualization type:   full
   L1d cache: 32K
   L1i cache: 32K
   L2 cache:  1024K
   L3 cache:  33792K
   NUMA node0 CPU(s): 0-31
   
   Memory: 
   $cat /proc/meminfo
   MemTotal:   65703704 kB
   
   Disk: SATA 
   $fdisk -l | grep Disk
   Disk /dev/vdb: 35184.4 GB, 35184372088832 bytes, 68719476736 sectors
   
   OS: 
   Linux 4.19.57-15.1.al7.x86_64
   
   JDK:
   openjdk version "11.0.11" 2021-04-20 LTS
   OpenJDK Ru

[GitHub] [lucene] msokolov closed pull request #106: LUCENE-9905: rename VectorValues.SearchStrategy to VectorValues.SimilarityFunction

2021-04-28 Thread GitBox


msokolov closed pull request #106:
URL: https://github.com/apache/lucene/pull/106


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on pull request #106: LUCENE-9905: rename VectorValues.SearchStrategy to VectorValues.SimilarityFunction

2021-04-28 Thread GitBox


msokolov commented on pull request #106:
URL: https://github.com/apache/lucene/pull/106#issuecomment-828422200


   I merged this separately


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] romseygeek merged pull request #109: LUCENE-9930: Only load Ukrainian morfologik dictionary once per JVM

2021-04-28 Thread GitBox


romseygeek merged pull request #109:
URL: https://github.com/apache/lucene/pull/109


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9930) UkrainianMorfologikAnalyzer reloads its Dictionary for every new TokenStreamComponents instance

2021-04-28 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17334710#comment-17334710
 ] 

ASF subversion and git services commented on LUCENE-9930:
-

Commit 90d363ece7116954c530a74a014487fcbdee7610 in lucene's branch 
refs/heads/main from Alan Woodward
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=90d363e ]

LUCENE-9930: Only load Ukrainian morfologik dictionary once per JVM (#109)

The UkrainianMorfologikAnalyzer was reloading its dictionary every
time it created a new TokenStreamComponents, which meant that
while the analyzer was open it would hold onto one copy of the
dictionary per thread.

This commit loads the dictionary in a lazy static initializer, alongside
its stopword set. It also makes the normalizer charmap a singleton
so that we do not rebuild the same immutable object on every call
to initReader.

> UkrainianMorfologikAnalyzer reloads its Dictionary for every new 
> TokenStreamComponents instance
> ---
>
> Key: LUCENE-9930
> URL: https://issues.apache.org/jira/browse/LUCENE-9930
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Large static data structures should be loaded in Analyzer constructors and 
> shared between threads, but the UkrainianMorfologikAnalyzer is loading its 
> dictionary in `createComponents`, which means it is reloaded and stored on 
> every new analysis thread.  If you have a large dictionary and highly 
> concurrent indexing then this can lead to you running out of memory as 
> multiple copies of the dictionary are held in thread locals.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] iverase commented on a change in pull request #107: LUCENE-9047: Move the Directory APIs to be little endian (take 2)

2021-04-28 Thread GitBox


iverase commented on a change in pull request #107:
URL: https://github.com/apache/lucene/pull/107#discussion_r622242751



##
File path: lucene/core/src/java/org/apache/lucene/codecs/CodecUtil.java
##
@@ -640,6 +640,33 @@ static void writeCRC(IndexOutput output) throws 
IOException {
   throw new IllegalStateException(
   "Illegal CRC-32 checksum: " + value + " (resource=" + output + ")");
 }
-output.writeLong(value);
+writeLong(output, value);
+  }
+
+  /** write int value on header / footer */
+  public static void writeInt(DataOutput out, int i) throws IOException {

Review comment:
   I rename methods like writeBE / readBE to make it more explicit.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] iverase commented on a change in pull request #107: LUCENE-9047: Move the Directory APIs to be little endian (take 2)

2021-04-28 Thread GitBox


iverase commented on a change in pull request #107:
URL: https://github.com/apache/lucene/pull/107#discussion_r622243643



##
File path: 
lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/packed/LegacyDirectWriter.java
##
@@ -0,0 +1,177 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.backward_codecs.packed;
+
+import java.io.EOFException;
+import java.io.IOException;
+import java.util.Arrays;
+import org.apache.lucene.store.DataOutput;
+import org.apache.lucene.util.packed.PackedInts;
+
+/**
+ * Class for writing packed integers to be directly read from Directory. 
Integers can be read
+ * on-the-fly via {@link LegacyDirectReader}.
+ *
+ * Unlike PackedInts, it optimizes for read i/o operations and supports 
> 2B values. Example
+ * usage:
+ *
+ * 

Review comment:
   This is a copy / paste from DirectWriter so I would prefer to change it 
in a follow up PR.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] iverase commented on a change in pull request #107: LUCENE-9047: Move the Directory APIs to be little endian (take 2)

2021-04-28 Thread GitBox


iverase commented on a change in pull request #107:
URL: https://github.com/apache/lucene/pull/107#discussion_r622244082



##
File path: 
lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/packed/LegacyDirectReader.java
##
@@ -0,0 +1,368 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.backward_codecs.packed;
+
+import java.io.IOException;
+import org.apache.lucene.store.RandomAccessInput;
+import org.apache.lucene.util.LongValues;
+
+/**
+ * Retrieves an instance previously written by {@link LegacyDirectWriter}
+ *
+ * Example usage:
+ *
+ * 
+ *   int bitsPerValue = 100;
+ *   IndexInput in = dir.openInput("packed", IOContext.DEFAULT);
+ *   LongValues values = 
LegacyDirectReader.getInstance(in.randomAccessSlice(start, end), bitsPerValue);
+ *   for (int i = 0; i < numValues; i++) {
+ * long value = values.get(i);
+ *   }
+ * 
+ *
+ * @see LegacyDirectWriter
+ */
+public class LegacyDirectReader {
+
+  private LegacyDirectReader() {
+// no instances
+  }
+
+  /**
+   * Retrieves an instance from the specified slice written decoding {@code 
bitsPerValue} for each
+   * value
+   */
+  public static LongValues getInstance(RandomAccessInput slice, int 
bitsPerValue) {
+return getInstance(slice, bitsPerValue, 0);
+  }
+
+  /**
+   * Retrieves an instance from the specified {@code offset} of the given 
slice decoding {@code
+   * bitsPerValue} for each value
+   */
+  public static LongValues getInstance(RandomAccessInput slice, int 
bitsPerValue, long offset) {
+switch (bitsPerValue) {
+  case 1:
+return new DirectPackedReader1(slice, offset);
+  case 2:
+return new DirectPackedReader2(slice, offset);
+  case 4:
+return new DirectPackedReader4(slice, offset);
+  case 8:
+return new DirectPackedReader8(slice, offset);
+  case 12:
+return new DirectPackedReader12(slice, offset);
+  case 16:
+return new DirectPackedReader16(slice, offset);
+  case 20:
+return new DirectPackedReader20(slice, offset);
+  case 24:
+return new DirectPackedReader24(slice, offset);
+  case 28:
+return new DirectPackedReader28(slice, offset);
+  case 32:
+return new DirectPackedReader32(slice, offset);
+  case 40:
+return new DirectPackedReader40(slice, offset);
+  case 48:
+return new DirectPackedReader48(slice, offset);
+  case 56:
+return new DirectPackedReader56(slice, offset);
+  case 64:
+return new DirectPackedReader64(slice, offset);
+  default:
+throw new IllegalArgumentException("unsupported bitsPerValue: " + 
bitsPerValue);
+}
+  }
+
+  static final class DirectPackedReader1 extends LongValues {
+final RandomAccessInput in;
+final long offset;
+
+DirectPackedReader1(RandomAccessInput in, long offset) {
+  this.in = in;
+  this.offset = offset;
+}
+
+@Override
+public long get(long index) {
+  try {
+int shift = 7 - (int) (index & 7);
+return (in.readByte(offset + (index >>> 3)) >>> shift) & 0x1;
+  } catch (IOException e) {
+throw new RuntimeException(e);
+  }
+}
+  }
+
+  static final class DirectPackedReader2 extends LongValues {
+final RandomAccessInput in;
+final long offset;
+
+DirectPackedReader2(RandomAccessInput in, long offset) {
+  this.in = in;
+  this.offset = offset;
+}
+
+@Override
+public long get(long index) {
+  try {
+int shift = (3 - (int) (index & 3)) << 1;
+return (in.readByte(offset + (index >>> 2)) >>> shift) & 0x3;
+  } catch (IOException e) {
+throw new RuntimeException(e);
+  }
+}
+  }
+
+  static final class DirectPackedReader4 extends LongValues {
+final RandomAccessInput in;
+final long offset;
+
+DirectPackedReader4(RandomAccessInput in, long offset) {
+  this.in = in;
+  this.offset = offset;
+}
+
+@Override
+public long get(long index) {
+  try {
+int shift = (int) ((index + 1) & 1) << 2;
+return (in.readByte(offset + (index >>> 1)) >>> shift) & 0xF;
+  } catch (IOException e) {
+  

[GitHub] [lucene] iverase commented on pull request #107: LUCENE-9047: Move the Directory APIs to be little endian (take 2)

2021-04-28 Thread GitBox


iverase commented on pull request #107:
URL: https://github.com/apache/lucene/pull/107#issuecomment-828513235


   I will set the PR ready to review as feedback has been positive so far. I 
want to stress that the most interesting part is how to deal with reading 
`segment.gen`. This file does not belong to a codec so we open it blind without 
knowing the endianness. Therefore the approach I have taken is to write that 
file always big endian as we are doing until now.
   
   In addition as this files have a codec header / footer, then all headers / 
footers will be written using BE order. 
   
   Maybe @rmuir and @mikemccand have an opinion here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-9943) DOC: Fix spelling(camelCase it like GitHub )

2021-04-28 Thread AYUSHMAN SINGH CHAUHAN (Jira)
AYUSHMAN SINGH CHAUHAN created LUCENE-9943:
--

 Summary:  DOC: Fix spelling(camelCase it like GitHub )
 Key: LUCENE-9943
 URL: https://issues.apache.org/jira/browse/LUCENE-9943
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/other
Affects Versions: 8.8.1
Reporter: AYUSHMAN SINGH CHAUHAN


docs update => spelling: github



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] ayushman17 opened a new pull request #111: LUCENE-9943 DOC: Fix spelling (camelCase it like GitHub)

2021-04-28 Thread GitBox


ayushman17 opened a new pull request #111:
URL: https://github.com/apache/lucene/pull/111


   
   
   
   # Description
   
   Please provide a short description of the changes you're making with this 
pull request.
   
   - [x] Docs have been updated 
   
   # Solution
   
   Please provide a short description of the approach taken to implement your 
solution.
   
   docs update => spelling: github
   
   # Tests
   
   Please describe the tests you've developed or run to confirm this patch 
implements the feature or solves the problem.
   
   # Checklist
   
   Please review the following and check all that apply:
   
   - [x] I have reviewed the guidelines for [How to 
Contribute](https://wiki.apache.org/lucene/HowToContribute) and my code 
conforms to the standards described there to the best of my ability.
   - [x] I have created a Jira issue and added the issue ID to my pull request 
title.
   - [x] I have given Lucene maintainers 
[access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork)
 to contribute to my PR branch. (optional but recommended)
   - [x] I have developed this patch against the `main` branch.
   - [x] I have run `./gradlew check`.
   - [x] I have added tests for my changes.
   
   ### Other information:
   Signed-off-by: Ayushman Singh Chauhan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gus-asf merged pull request #111: LUCENE-9943 DOC: Fix spelling (camelCase it like GitHub)

2021-04-28 Thread GitBox


gus-asf merged pull request #111:
URL: https://github.com/apache/lucene/pull/111


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-9943) DOC: Fix spelling(camelCase it like GitHub )

2021-04-28 Thread Gus Heck (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gus Heck resolved LUCENE-9943.
--
Fix Version/s: 9.0
   Resolution: Fixed

Thanks  :)

>  DOC: Fix spelling(camelCase it like GitHub )
> -
>
> Key: LUCENE-9943
> URL: https://issues.apache.org/jira/browse/LUCENE-9943
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/other
>Affects Versions: 8.8.1
>Reporter: AYUSHMAN SINGH CHAUHAN
>Priority: Minor
>  Labels: documentation
> Fix For: 9.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> docs update => spelling: github



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #107: LUCENE-9047: Move the Directory APIs to be little endian (take 2)

2021-04-28 Thread GitBox


rmuir commented on pull request #107:
URL: https://github.com/apache/lucene/pull/107#issuecomment-828654863


   > I will set the PR ready to review as feedback has been positive so far. I 
want to stress that the most interesting part is how to deal with reading 
`segment.gen`. This file does not belong to a codec so we open it blind without 
knowing the endianness. Therefore the approach I have taken is to write that 
file always big endian as we are doing until now.
   
   Agreed, let's try to land the current patch! I think its fine that 
`segments_N` and codec headers just stay bigendian, as the code for this is 
nicely self-contained.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] ayushman17 commented on pull request #111: LUCENE-9943 DOC: Fix spelling (camelCase it like GitHub)

2021-04-28 Thread GitBox


ayushman17 commented on pull request #111:
URL: https://github.com/apache/lucene/pull/111#issuecomment-828658851


   Thanks!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] neoremind edited a comment on pull request #91: LUCENE-9932: Performance improvement for BKD index building

2021-04-28 Thread GitBox


neoremind edited a comment on pull request #91:
URL: https://github.com/apache/lucene/pull/91#issuecomment-827678981


   I spent some time trying to use the real case benchmark. The speedup of 
`IndexWriter` is what we expected, faster than main branch, total time elapsed 
(include adding doc, building index and merging) decreased by about 20%. If we 
only consider `flush_time`, the speedup is more obvious, time cost drops about 
40% - 50%.
   
   1) Run 
[IndexAndSearchOpenStreetMaps1D.java](https://github.com/neoremind/luceneutil/blob/master/src/main/perf/IndexAndSearchOpenStreetMaps1D.java)
 against the two branches and take down the 
[log](https://github.com/neoremind/luceneutil/tree/master/log/OpenStreetMaps).
   _note: comment query stage, modify some of the code to adapt to latest 
Lucene main branch._
   
   main branch:
   ```
   $ egrep "flush time|sec to build index" open-street-maps.log
   DWPT 0 [2021-04-27T11:33:04.518908Z; main]: flush time 17284.537739 msec
   DWPT 0 [2021-04-27T11:33:37.888449Z; main]: flush time 12039.476885 msec
   72.49147722 sec to build index
   ```
   PR branch:
   ```
   $ egrep "flush time|sec to build index" open-street-maps-optimized.log
   DWPT 0 [2021-04-28T18:06:57.931560Z; main]: flush time 9483.778536 msec
   DWPT 0 [2021-04-28T18:07:26.493593Z; main]: flush time 8145.392875 msec
   59.176608435 sec to build index
   ```
   
   2) Further more, I come up with an idea to use TPC-H LINEITEM to verify. I 
have a 10GB TPC-H dataset and develop a new test case to import the first 5 INT 
fields, which is more typical in real case.
   
   Run 
[IndexAndSearchTpcHLineItem.java](https://github.com/neoremind/luceneutil/blob/master/src/main/perf/IndexAndSearchTpcHLineItem.java)
 against the two branches and take down the 
[log](https://github.com/neoremind/luceneutil/tree/master/log/TPC-H-LINEITEM).
   
   main branch:
   ```
   $ egrep "flush time|sec to build index" tpch-lineitem.log
   DWPT 0 [2021-04-27T11:17:25.329006Z; main]: flush time 13850.23328 msec
   DWPT 0 [2021-04-27T11:17:50.289370Z; main]: flush time 12228.723665 msec
   DWPT 0 [2021-04-27T11:18:15.546002Z; main]: flush time 12537.085005 msec
   DWPT 0 [2021-04-27T11:18:40.140413Z; main]: flush time 11819.225223 msec
   DWPT 0 [2021-04-27T11:19:04.850989Z; main]: flush time 12004.380921 msec
   DWPT 0 [2021-04-27T11:19:29.435183Z; main]: flush time 11850.273453 msec
   DWPT 0 [2021-04-27T11:19:54.016951Z; main]: flush time 11882.316067 msec
   DWPT 0 [2021-04-27T11:20:18.932727Z; main]: flush time 12223.151464 msec
   DWPT 0 [2021-04-27T11:20:43.522117Z; main]: flush time 11871.276323 msec
   DWPT 0 [2021-04-27T11:20:52.060300Z; main]: flush time 3422.434221 msec
   271.188917715 sec to build index
   ```
   PR branch:
   ```
   $ egrep "flush time|sec to build index" tpch-lineitem-optimized.log
   DWPT 0 [2021-04-28T18:09:17.063371Z; main]: flush time 7547.521814 msec
   DWPT 0 [2021-04-28T18:09:36.070457Z; main]: flush time 7226.72845 msec
   DWPT 0 [2021-04-28T18:09:55.085997Z; main]: flush time 7275.426344 msec
   DWPT 0 [2021-04-28T18:10:13.928021Z; main]: flush time 7140.31387 msec
   DWPT 0 [2021-04-28T18:10:32.788150Z; main]: flush time 7173.103266 msec
   DWPT 0 [2021-04-28T18:10:51.830926Z; main]: flush time 7371.514576 msec
   DWPT 0 [2021-04-28T18:11:10.644303Z; main]: flush time 7132.407293 msec
   DWPT 0 [2021-04-28T18:11:29.586830Z; main]: flush time 7150.281669 msec
   DWPT 0 [2021-04-28T18:11:48.268161Z; main]: flush time 7009.686475 msec
   DWPT 0 [2021-04-28T18:11:55.172851Z; main]: flush time 2115.221804 msec
   213.240120432 sec to build index
   ```
   
   For benchmark command, please refer to [my 
document](https://github.com/neoremind/luceneutil/tree/master/command). 
   
   Test environment:
   ```
   CPU: 
   Architecture:  x86_64
   CPU op-mode(s):32-bit, 64-bit
   Byte Order:Little Endian
   CPU(s):32
   On-line CPU(s) list:   0-31
   Thread(s) per core:2
   Core(s) per socket:16
   Socket(s): 1
   NUMA node(s):  1
   Vendor ID: GenuineIntel
   CPU family:6
   Model: 85
   Model name:Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz
   Stepping:  4
   CPU MHz:   2500.000
   BogoMIPS:  5000.00
   Hypervisor vendor: KVM
   Virtualization type:   full
   L1d cache: 32K
   L1i cache: 32K
   L2 cache:  1024K
   L3 cache:  33792K
   NUMA node0 CPU(s): 0-31
   
   Memory: 
   $cat /proc/meminfo
   MemTotal:   65703704 kB
   
   Disk: SATA 
   $fdisk -l | grep Disk
   Disk /dev/vdb: 35184.4 GB, 35184372088832 bytes, 68719476736 sectors
   
   OS: 
   Linux 4.19.57-15.1.al7.x86_64
   
   JDK:
   openjdk version "11.0.11" 2021-04-20 LTS
   OpenJDK Runtime Environment 18.9 (build 11.0.11+9-LTS)
   OpenJDK 64-Bit Server VM 18.9 (build 11.0.11+9-LTS, mixed mode, sharing)
  

[GitHub] [lucene-solr] gus-asf opened a new pull request #2485: LUCENE-9572 Backport from 9.0

2021-04-28 Thread GitBox


gus-asf opened a new pull request #2485:
URL: https://github.com/apache/lucene-solr/pull/2485


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] gus-asf merged pull request #2485: LUCENE-9572 Backport from 9.0

2021-04-28 Thread GitBox


gus-asf merged pull request #2485:
URL: https://github.com/apache/lucene-solr/pull/2485


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9572) Allow TypeAsSynonymFilter to propagate selected flags and Ignore some types

2021-04-28 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17335078#comment-17335078
 ] 

ASF subversion and git services commented on LUCENE-9572:
-

Commit d7e4be6fc621a95165eb32d1d6f02bea4f5010ef in lucene-solr's branch 
refs/heads/branch_8x from Gus Heck
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=d7e4be6 ]

LUCENE-9572 Backport from 9.0 (#2485)



> Allow TypeAsSynonymFilter to propagate selected flags and Ignore some types
> ---
>
> Key: LUCENE-9572
> URL: https://issues.apache.org/jira/browse/LUCENE-9572
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis, modules/test-framework
>Reporter: Gus Heck
>Assignee: Gus Heck
>Priority: Major
>  Time Spent: 5.5h
>  Remaining Estimate: 0h
>
> (Breaking this off of SOLR-14597 for independent review)
> TypeAsSynonymFilter converts types attributes to a synonym. In some cases the 
> original token may have already had flags set on it and it may be useful to 
> propagate some or all of those flags to the synonym we are generating. This 
> ticket provides that ability and allows the user to specify a bitmask to 
> specify which flags are retained.
> Additionally there may be some set of types that should not be converted to 
> synonyms, and this change allows the user to specify a comma separated list 
> of types to ignore (most common case will be to ignore a common default type 
> of 'word' I suspect)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9572) Allow TypeAsSynonymFilter to propagate selected flags and Ignore some types

2021-04-28 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17335086#comment-17335086
 ] 

ASF subversion and git services commented on LUCENE-9572:
-

Commit 142779fa92ab19f6e7d05e9a8901fec28a585d08 in lucene's branch 
refs/heads/LUCENE-9572.changes.txt from Gus Heck
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=142779f ]

LUCENE-9572 adjust changes entry


> Allow TypeAsSynonymFilter to propagate selected flags and Ignore some types
> ---
>
> Key: LUCENE-9572
> URL: https://issues.apache.org/jira/browse/LUCENE-9572
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis, modules/test-framework
>Reporter: Gus Heck
>Assignee: Gus Heck
>Priority: Major
>  Time Spent: 5.5h
>  Remaining Estimate: 0h
>
> (Breaking this off of SOLR-14597 for independent review)
> TypeAsSynonymFilter converts types attributes to a synonym. In some cases the 
> original token may have already had flags set on it and it may be useful to 
> propagate some or all of those flags to the synonym we are generating. This 
> ticket provides that ability and allows the user to specify a bitmask to 
> specify which flags are retained.
> Additionally there may be some set of types that should not be converted to 
> synonyms, and this change allows the user to specify a comma separated list 
> of types to ignore (most common case will be to ignore a common default type 
> of 'word' I suspect)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gus-asf opened a new pull request #112: LUCENE-9572 adjust changes entry

2021-04-28 Thread GitBox


gus-asf opened a new pull request #112:
URL: https://github.com/apache/lucene/pull/112


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gautamworah96 commented on a change in pull request #108: LUCENE-9897 Change dependency checking mechanism to use gradle checksum verification

2021-04-28 Thread GitBox


gautamworah96 commented on a change in pull request #108:
URL: https://github.com/apache/lucene/pull/108#discussion_r622691393



##
File path: gradle/verification-metadata.xml
##
@@ -0,0 +1,2198 @@
+
+https://schema.gradle.org/dependency-verification"; 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"; 
xsi:schemaLocation="https://schema.gradle.org/dependency-verification 
https://schema.gradle.org/dependency-verification/dependency-verification-1.0.xsd";>
+   
+  true
+  false
+   
+   
+  
+ 
+
+ 
+ 
+
+ 
+  
+  
+ 
+
+ 
+ 
+
+ 
+  
+  
+ 
+
+ 
+  
+  
+ 
+
+ 
+  
+  
+ 
+
+ 
+  
+  
+ 
+
+ 
+ 
+
+ 
+  
+  
+ 
+
+ 
+ 
+
+ 
+  
+  

Review comment:
   Tried a few things that did not work.
   In general the docs say that,
   > In order to provide the strongest security level possible, dependency 
verification is enabled globally
   > For this purpose, Gradle provides an API which allows disabling dependency 
verification on some specific configurations.
   
   Which is the opposite of what we wanted? i.e, to enable it for just 1 
configuration?
   
   There is also a 
[way](https://docs.gradle.org/current/userguide/resolution_rules.html#sec:disabling_resolution_transitive_dependencies)
 to disable transitive resolution for a particular configuration but that will 
need some more tinkering. Not sure how that works
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gautamworah96 commented on a change in pull request #108: LUCENE-9897 Change dependency checking mechanism to use gradle checksum verification

2021-04-28 Thread GitBox


gautamworah96 commented on a change in pull request #108:
URL: https://github.com/apache/lucene/pull/108#discussion_r622691393



##
File path: gradle/verification-metadata.xml
##
@@ -0,0 +1,2198 @@
+
+https://schema.gradle.org/dependency-verification"; 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"; 
xsi:schemaLocation="https://schema.gradle.org/dependency-verification 
https://schema.gradle.org/dependency-verification/dependency-verification-1.0.xsd";>
+   
+  true
+  false
+   
+   
+  
+ 
+
+ 
+ 
+
+ 
+  
+  
+ 
+
+ 
+ 
+
+ 
+  
+  
+ 
+
+ 
+  
+  
+ 
+
+ 
+  
+  
+ 
+
+ 
+  
+  
+ 
+
+ 
+ 
+
+ 
+  
+  
+ 
+
+ 
+ 
+
+ 
+  
+  

Review comment:
   Tried a few things that did not work.
   In general the docs say that,
   > In order to provide the strongest security level possible, dependency 
verification is enabled globally
   > For this purpose, Gradle provides an API which allows disabling dependency 
verification on some specific configurations.
   
   Which is the opposite of what we wanted? i.e, to enable it for just 1 
configuration?
   
   There is also a 
[way](https://docs.gradle.org/current/userguide/resolution_rules.html#sec:disabling_resolution_transitive_dependencies)
 to disable transitive resolution for a particular configuration but that too 
did not work. Needs some more tinkering...
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zacharymorn opened a new pull request #113: LUCENE-9335: [Discussion Only] Implement BMM with BulkScorer interface

2021-04-28 Thread GitBox


zacharymorn opened a new pull request #113:
URL: https://github.com/apache/lucene/pull/113


   Implement BMM algorithm from "Optimizing Top-k Document Retrieval Strategies 
for Block-Max Indexes" by Dimopoulos, Nepomnyachiy and Suel, using BulkScorer 
interface. 
   
   This BMM implementation passes all existing tests run by `./gradlew check` 
as well as luceneutil benchmark
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9335) Add a bulk scorer for disjunctions that does dynamic pruning

2021-04-28 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17335125#comment-17335125
 ] 

Zach Chen commented on LUCENE-9335:
---

I've implemented the above strategy and opened a new PR for it 
[https://github.com/apache/lucene/pull/113.] I was using a _BulkScorer_ on top 
of a collection of _Scorers_ though, instead of a _BulkScorer_ on top of a 
collection of _BulkScorers_ like _BooleanScorer_, and hope the difference is 
due to algorithm difference rather than me misunderstanding the intended usage 
of BulkScorer interface :D . The result from benchmark util still shows it's 
slower than _WANDScorer_ for 2 clauses queries, especially for OrHighHigh task.

 

During the implementation of this BulkScorer I also realized there were some 
issues with the other PR I published earlier, so I'll fix them next and see if 
that will give us better result.

> Add a bulk scorer for disjunctions that does dynamic pruning
> 
>
> Key: LUCENE-9335
> URL: https://issues.apache.org/jira/browse/LUCENE-9335
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Lucene often gets benchmarked against other engines, e.g. against Tantivy and 
> PISA at [https://tantivy-search.github.io/bench/] or against research 
> prototypes in Table 1 of 
> [https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf].
>  Given that top-level disjunctions of term queries are commonly used for 
> benchmarking, it would be nice to optimize this case a bit more, I suspect 
> that we could make fewer per-document decisions by implementing a BulkScorer 
> instead of a Scorer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gus-asf merged pull request #112: LUCENE-9572 adjust changes entry

2021-04-28 Thread GitBox


gus-asf merged pull request #112:
URL: https://github.com/apache/lucene/pull/112


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9572) Allow TypeAsSynonymFilter to propagate selected flags and Ignore some types

2021-04-28 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17335140#comment-17335140
 ] 

ASF subversion and git services commented on LUCENE-9572:
-

Commit 043ed3a91f74246fbc2e4a1a8fea38cb61d7d68a in lucene's branch 
refs/heads/main from Gus Heck
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=043ed3a ]

LUCENE-9572 adjust changes entry (#112)



> Allow TypeAsSynonymFilter to propagate selected flags and Ignore some types
> ---
>
> Key: LUCENE-9572
> URL: https://issues.apache.org/jira/browse/LUCENE-9572
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis, modules/test-framework
>Reporter: Gus Heck
>Assignee: Gus Heck
>Priority: Major
>  Time Spent: 5h 50m
>  Remaining Estimate: 0h
>
> (Breaking this off of SOLR-14597 for independent review)
> TypeAsSynonymFilter converts types attributes to a synonym. In some cases the 
> original token may have already had flags set on it and it may be useful to 
> propagate some or all of those flags to the synonym we are generating. This 
> ticket provides that ability and allows the user to specify a bitmask to 
> specify which flags are retained.
> Additionally there may be some set of types that should not be converted to 
> synonyms, and this change allows the user to specify a comma separated list 
> of types to ignore (most common case will be to ignore a common default type 
> of 'word' I suspect)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org