date:20201106

[jira] [Commented] (LUCENE-8982) Make NativeUnixDirectory pure java now that direct IO is possible

2020-11-06 Thread Zach Chen (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227746#comment-17227746
 ] 

Zach Chen commented on LUCENE-8982:
---

[~mdrob] It does seems *ant build-native-unix* is no longer available, and 
running *./gradlew helpAnt* doesn't help either. I'm not super familiar with 
cpp development, but to facilitate this I've done some research and after some 
trial and errors, come up with a separate PR to build the cpp code (as 
decoupling the two changes might help benchmarking) 
[https://github.com/apache/lucene-solr/pull/2068] . Could you please take a 
look and let me know if this is a good direction to take?

Please note that in the PR, I created a new native module to host 
*NativefPosixUtil.cpp*, as there's compatibility issue between cpp-library and 
java-library Gradle plugins. For details, please see 
[https://github.com/gradle/gradle-native/issues/352#issuecomment-461724948.] 

> Make NativeUnixDirectory pure java now that direct IO is possible
> -
>
> Key: LUCENE-8982
> URL: https://issues.apache.org/jira/browse/LUCENE-8982
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/misc
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> {{NativeUnixDirectory}} is a {{Directory}} implementation that uses direct IO 
> to write newly merged segments.  Direct IO bypasses the kernel's buffer cache 
> and write cache, making merge writes "invisible" to the kernel, though the 
> reads for merging the N segments are still going through the kernel.
> But today, {{NativeUnixDirectory}} uses a small JNI wrapper to access the 
> {{O_DIRECT}} flag to {{open}} ... since JDK9 we can now pass that flag in 
> pure java code, so we should now fix {{NativeUnixDirectory}} to not use JNI 
> anymore.
> We should also run some more realistic benchmarks seeing if this option 
> really helps nodes that are doing concurrent indexing (merging) and searching.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] zacharymorn opened a new pull request #2068: LUCENE-8982: Separate out native code to another module to allow cpp build with gradle

2020-11-06 Thread GitBox



zacharymorn opened a new pull request #2068:
URL: https://github.com/apache/lucene-solr/pull/2068


   # Description
   When working on https://issues.apache.org/jira/browse/LUCENE-8982, we 
realized that there may no longer be a way to build native cpp code due to 
migration away from ant. This patch separate out native code to another module 
to allow cpp build with gradle.
   
   # Solution
   Create a new lucene module for cpp code compilation.
   
   # Tests
   Pending
   
   # Checklist
   
   Please review the following and check all that apply:
   
   - [x] I have reviewed the guidelines for [How to 
Contribute](https://wiki.apache.org/solr/HowToContribute) and my code conforms 
to the standards described there to the best of my ability.
   - [x] I have created a Jira issue and added the issue ID to my pull request 
title.
   - [x] I have given Solr maintainers 
[access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork)
 to contribute to my PR branch. (optional but recommended)
   - [x] I have developed this patch against the `master` branch.
   - [x] I have run `./gradlew check`.
   - [ ] I have added tests for my changes.
   - [ ] I have added documentation for the [Ref 
Guide](https://github.com/apache/lucene-solr/tree/master/solr/solr-ref-guide) 
(for Solr changes only).
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] zacharymorn commented on a change in pull request #2052: LUCENE-8982: Make NativeUnixDirectory pure java with FileChannel direct IO flag, and rename to DirectIODirectory

2020-11-06 Thread GitBox



zacharymorn commented on a change in pull request #2052:
URL: https://github.com/apache/lucene-solr/pull/2052#discussion_r519135013



##
File path: lucene/misc/src/java/org/apache/lucene/store/DirectIODirectory.java
##
@@ -66,45 +66,32 @@
  *
  * @lucene.experimental
  */
-public class NativeUnixDirectory extends FSDirectory {
+public class DirectIODirectory extends FSDirectory {
 
   // TODO: this is OS dependent, but likely 512 is the LCD
   private final static long ALIGN = 512;
   private final static long ALIGN_NOT_MASK = ~(ALIGN-1);
-  
-  /** Default buffer size before writing to disk (256 KB);
-   *  larger means less IO load but more RAM and direct
-   *  buffer storage space consumed during merging. */
-
-  public final static int DEFAULT_MERGE_BUFFER_SIZE = 262144;
 
   /** Default min expected merge size before direct IO is
*  used (10 MB): */
   public final static long DEFAULT_MIN_BYTES_DIRECT = 10*1024*1024;
 
-  private final int mergeBufferSize;
   private final long minBytesDirect;
   private final Directory delegate;
 
   /** Create a new NIOFSDirectory for the named location.
* 
* @param path the path of the directory
-   * @param lockFactory to use
-   * @param mergeBufferSize Size of buffer to use for
-   *merging.  See {@link #DEFAULT_MERGE_BUFFER_SIZE}.
* @param minBytesDirect Merges, or files to be opened for
*   reading, smaller than this will
*   not use direct IO.  See {@link
*   #DEFAULT_MIN_BYTES_DIRECT}
+   * @param lockFactory to use
* @param delegate fallback Directory for non-merges
* @throws IOException If there is a low-level I/O error
*/
-  public NativeUnixDirectory(Path path, int mergeBufferSize, long 
minBytesDirect, LockFactory lockFactory, Directory delegate) throws IOException 
{
+  public DirectIODirectory(Path path, long minBytesDirect, LockFactory 
lockFactory, Directory delegate) throws IOException {
 super(path, lockFactory);
-if ((mergeBufferSize & ALIGN) != 0) {
-  throw new IllegalArgumentException("mergeBufferSize must be 0 mod " + 
ALIGN + " (got: " + mergeBufferSize + ")");
-}
-this.mergeBufferSize = mergeBufferSize;

Review comment:
   This section was removed since `mergeBufferSize` was no longer being 
used in the constructors for `DirectIODirectory$DirectIOIndexInput` and 
`DirectIODirectory$DirectIOIndexOutput` (previously passed in to fill the 
`bufferSize` constructor parameter), as those are now replaced by the dynamic 
block size allocation:
   
   ```
   blockSize = Math.toIntExact(Files.getFileStore(path).getBlockSize());
   bufferSize = Math.addExact(blockSize, blockSize - 1);
   ```
   
   I assume the new bufferSize being 2 x  dynamic blockSize will still provide 
that buffering?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] zacharymorn commented on a change in pull request #2052: LUCENE-8982: Make NativeUnixDirectory pure java with FileChannel direct IO flag, and rename to DirectIODirectory

2020-11-06 Thread GitBox



zacharymorn commented on a change in pull request #2052:
URL: https://github.com/apache/lucene-solr/pull/2052#discussion_r519134822



##
File path: lucene/CHANGES.txt
##
@@ -156,6 +156,9 @@ Improvements
 * LUCENE-9531: Consolidated CharStream and FastCharStream classes: these have 
been moved
   from each query parser package to org.apache.lucene.queryparser.charstream 
(Dawid Weiss).
 
+* LUCENE-8982: Make NativeUnixDirectory pure java with FileChannel direct IO 
flag,

Review comment:
   Yes according to https://bugs.openjdk.java.net/browse/JDK-8189192, 
`com.sun.nio.file.ExtendedOpenOption.DIRECT` was added since JDK 10, so this is 
Lucene 9.0 only feature.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] zacharymorn commented on a change in pull request #2052: LUCENE-8982: Make NativeUnixDirectory pure java with FileChannel direct IO flag, and rename to DirectIODirectory

2020-11-06 Thread GitBox



zacharymorn commented on a change in pull request #2052:
URL: https://github.com/apache/lucene-solr/pull/2052#discussion_r519134971



##
File path: lucene/misc/src/java/org/apache/lucene/store/DirectIODirectory.java
##
@@ -164,15 +149,16 @@ public IndexOutput createOutput(String name, IOContext 
context) throws IOExcepti
 private long fileLength;
 private boolean isOpen;
 
-public NativeUnixIndexOutput(Path path, String name, int bufferSize) 
throws IOException {
-  super("NativeUnixIndexOutput(path=\"" + path.toString() + "\")", name);
-  //this.path = path;
-  final FileDescriptor fd = NativePosixUtil.open_direct(path.toString(), 
false);
-  fos = new FileOutputStream(fd);
-  //fos = new FileOutputStream(path);
-  channel = fos.getChannel();
-  buffer = ByteBuffer.allocateDirect(bufferSize);
-  this.bufferSize = bufferSize;
+  @SuppressForbidden(reason = "com.sun.nio.file.ExtendedOpenOption: Direct I/O 
with FileChannel requires the use of internal proprietary API 
ExtendedOpenOption.DIRECT")
+  public DirectIOIndexOutput(Path path, String name) throws IOException {
+  super("DirectIOIndexOutput(path=\"" + path.toString() + "\")", name);
+
+  int blockSize = Math.toIntExact(Files.getFileStore(path).getBlockSize());
+  bufferSize = Math.addExact(blockSize, blockSize - 1);
+  buffer = ByteBuffer.allocateDirect(bufferSize).alignedSlice(blockSize);

Review comment:
   `ByteBuffer.alignedSlice` was added since JDK 9 which was released in 
2017, so I guess it's just the code hasn't been updated to take advantage of 
the new API?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] zacharymorn commented on pull request #2052: LUCENE-8982: Make NativeUnixDirectory pure java with FileChannel direct IO flag, and rename to DirectIODirectory

2020-11-06 Thread GitBox



zacharymorn commented on pull request #2052:
URL: https://github.com/apache/lucene-solr/pull/2052#issuecomment-723400516


   > Thank you @zacharymorn! It's so nice we can move this to pure java now -- 
this makes it much easier and lower risk for Lucene users to test.
   > 
   > I suspect for Lucene applications that do both indexing and searching on a 
single box (e.g. Elasticsearch), this might be a big improvement in preventing 
large merges from hurting concurrent queries.
   
   No problem @mikemccand ! I'm glad to be able to contribute as well. 
   
   One thing I notice while working on this, and also related to your comment 
is, it seems Lucene core doesn't seems to use this facility via any 
configuration at all (no reference to `NativeUnixDirectory`). So I'm wondering 
if it makes sense to have a follow up task to add some kind of configuration 
(through Merge policy?) to use it as well once the changes stabilize ?  



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (SOLR-14977) Container plugins need a way to be configured

2020-11-06 Thread Noble Paul (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-14977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17225848#comment-17225848
 ] 

Noble Paul edited comment on SOLR-14977 at 11/7/20, 5:08 AM:
-

Let's have a simple interface as follows (change the name as you wish).

This will help us do a type-safe initialization of any plugin. 
{code:java}
public interface ConfigurablePlugin {
  public void initConfig(T initVals);
}
{code}
for instance the {{PlacementPluginConfig}} can have only the following 
attributes
{code:java}
{
"myfirstString": "a text value",
"aLong": 50,
"aDoubleConfig": 3.1415928,
"shouldIStay": true}
{code}
It does not have to care about the attributes in {{PluginMeta}} if it does not 
need it

BTW I would prefer not to have something ugly like 
[this|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cluster/placement/impl/PlacementPluginConfigImpl.java]

 Lets have something as follows
{code:java}
public class PlacementPluginConfig {

@JsonProperty
 public String myfirstString;
@JsonProperty
 public Long aLong;
@JsonProperty
 public Double aDoubleConfig;
@JsonProperty
 public Boolean shouldIStay";
}
{code}


was (Author: noble.paul):
Let's have a simple interface as follows (change the name as you wish).

This will help us do a type-safe initialization of any plugin. 
{code:java}
public interface SimplePlugin {
  public void initPlugin(T initVals);
}
{code}
for instance the {{PlacementPluginConfig}} can have only the following 
attributes
{code:java}
{
"myfirstString": "a text value",
"aLong": 50,
"aDoubleConfig": 3.1415928,
"shouldIStay": true}
{code}
It does not have to care about the attributes in {{PluginMeta}} if it does not 
need it

BTW I would prefer not to have something ugly like 
[this|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cluster/placement/impl/PlacementPluginConfigImpl.java]

 Lets have something as follows
{code:java}
public class PlacementPluginConfig {

@JsonProperty
 public String myfirstString;
@JsonProperty
 public Long aLong;
@JsonProperty
 public Double aDoubleConfig;
@JsonProperty
 public Boolean shouldIStay";
}
{code}

> Container plugins need a way to be configured
> -
>
> Key: SOLR-14977
> URL: https://issues.apache.org/jira/browse/SOLR-14977
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Plugin system
>Reporter: Andrzej Bialecki
>Priority: Major
> Attachments: SOLR-14977.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Container plugins are defined in {{/clusterprops.json:/plugin}} using a 
> simple {{PluginMeta}} bean. This is sufficient for implementations that don't 
> need any configuration except for the {{pathPrefix}} but insufficient for 
> anything else that needs more configuration parameters.
> An example would be a {{CollectionsRepairEventListener}} plugin proposed in 
> PR-1962, which needs parameters such as the list of collections, {{waitFor}}, 
> maximum operations allowed, etc. to properly function.
> This issue proposes to extend the {{PluginMeta}} bean to allow a 
> {{Map}} configuration parameters.
> There is an interface that we could potentially use ({{MapInitializedPlugin}} 
> but it works only with {{String}} values. This is not optimal because it 
> requires additional type-safety validation from the consumers. The existing 
> {{PluginInfo}} / {{PluginInfoInitialized}} interface is too complex for this 
> purpose.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-9956) Solr java.lang.ArrayIndexOutOfBoundsException when indexed large amount of documents

2020-11-06 Thread sajit vasudevan (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-9956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227727#comment-17227727
 ] 

sajit vasudevan commented on SOLR-9956:
---

We are encountering a similar issue for large document and metadata with SOLR 
6.6.5. Was this issue fixed?

> Solr java.lang.ArrayIndexOutOfBoundsException when indexed large amount of 
> documents
> 
>
> Key: SOLR-9956
> URL: https://issues.apache.org/jira/browse/SOLR-9956
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 6.2.1, 6.3
> Environment: Ubuntu 14.04.4 LTS
>Reporter: Zhu JiaJun
>Priority: Critical
>  Labels: query, solr, stats
> Attachments: solr.log
>
>
> I'm using solr 6.3.0. I indexed a big amount of docuements into one solr 
> collection with one shard, it's 60G in the disk, which has around 2506889 
> documents. 
> I frequently get the ArrayIndexOutOfBoundsException when I send a simple 
> stats request, for example:
> http://localhost:8983/solr/staging-update/select?start=0&rows=0&version=2.2&q=*:*&stats=true&timeAllowed=6&wt=json&stats.field=asp_community_facet&stats.field=asp_group_facet
> The solr log capture following exception as well as in the response like 
> below:
> {code}
> {
> "responseHeader": {
> "zkConnected": true,
> "status": 500,
> "QTime": 11,
> "params": {
> "q": "*:*",
> "stats": "true",
> "start": "0",
> "timeAllowed": "6",
> "rows": "0",
> "version": "2.2",
> "wt": "json",
> "stats.field": [
> "asp_community_facet",
> "asp_group_facet"
> ]
> }
> },
> "response": {
> "numFound": 2506211,
> "start": 0,
> "docs": [ ]
> },
> "error": {
> "msg": "28",
> "trace": "java.lang.ArrayIndexOutOfBoundsException: 28\n\tat 
> org.apache.solr.request.DocValuesStats.accumMulti(DocValuesStats.java:213)\n\tat
>  
> org.apache.solr.request.DocValuesStats.getCounts(DocValuesStats.java:135)\n\tat
>  
> org.apache.solr.handler.component.StatsField.computeLocalStatsValues(StatsField.java:424)\n\tat
>  
> org.apache.solr.handler.component.StatsComponent.process(StatsComponent.java:58)\n\tat
>  
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:295)\n\tat
>  
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:153)\n\tat
>  org.apache.solr.core.SolrCore.execute(SolrCore.java:2213)\n\tat 
> org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:654)\n\tat 
> org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:460)\n\tat 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:303)\n\tat
>  
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:254)\n\tat
>  
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1668)\n\tat
>  
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:581)\n\tat
>  
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)\n\tat
>  
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)\n\tat
>  
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)\n\tat
>  
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1160)\n\tat
>  
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511)\n\tat
>  
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)\n\tat
>  
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1092)\n\tat
>  
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)\n\tat
>  
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)\n\tat
>  
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)\n\tat
>  
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)\n\tat
>  org.eclipse.jetty.server.Server.handle(Server.java:518)\n\tat 
> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:308)\n\tat 
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:244)\n\tat
>  
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)\n\tat
>  org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)\n\tat 
> org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)\n\tat
>  
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceAndRun(ExecuteProduceConsume.java:246)\n\ta

[jira] [Created] (SOLR-14988) CoreContainer.getCores should should return transient cores that are loaded

2020-11-06 Thread David Smiley (Jira)

David Smiley created SOLR-14988:
---

 Summary: CoreContainer.getCores should should return transient 
cores that are loaded
 Key: SOLR-14988
 URL: https://issues.apache.org/jira/browse/SOLR-14988
 Project: Solr
  Issue Type: Improvement
  Security Level: Public (Default Security Level. Issues are Public)
Reporter: David Smiley


CoreContainer.getCores() returns loaded SolrCores _except_ the transient ones.  
The callers of this method all assume it's all loaded cores, however.  I think 
getCores should be renamed getLoadedCores for clarity as to _which_ cores are 
being returned.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-14975) Optimize CoreContainer.getAllCoreNames and getLoadedCoreNames

2020-11-06 Thread David Smiley (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-14975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227654#comment-17227654
 ] 

David Smiley commented on SOLR-14975:
-

Queries are blocked because a SolrCore needs to be looked up for dispatch 
(ultimately, {{SolrCores.getCoreFromAnyList}}) which grabs modifyLock, possibly 
waiting on someone else using/abusing getAllCoreNames that has the lock.

Use of getCoreDescriptor to detect for existing core presence is fine here; 
it's a total superset from what I see.

> Optimize CoreContainer.getAllCoreNames and getLoadedCoreNames 
> --
>
> Key: SOLR-14975
> URL: https://issues.apache.org/jira/browse/SOLR-14975
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: David Smiley
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The methods CoreContainer.getAllCoreNames and getLoadedCoreNames hold a lock 
> while they grab core names to put into a TreeSet.  When there are *many* 
> cores, this delay is noticeable.  Holding this lock effectively blocks 
> queries since queries lookup a core; so it's critically important that these 
> methods are *fast*.  The tragedy here is that some callers merely want to 
> know if a particular name is in the set, or what the aggregated size is.  
> Some callers want to iterate the names but don't really care what the 
> iteration order is.
> I propose that some callers of these two methods find suitable alternatives, 
> like getCoreDescriptor to check for null.  And I propose that these methods 
> return a HashSet -- no order.  If the caller wants it sorted, it can do so 
> itself.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (SOLR-14987) SolrStream ends up creating a new HttpSolrClient for every replica being queried instead of reusing for the same node

2020-11-06 Thread Timothy Potter (Jira)



 [ 
https://issues.apache.org/jira/browse/SOLR-14987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Potter updated SOLR-14987:
--
Status: Patch Available  (was: Open)

> SolrStream ends up creating a new HttpSolrClient for every replica being 
> queried instead of reusing for the same node
> -
>
> Key: SOLR-14987
> URL: https://issues.apache.org/jira/browse/SOLR-14987
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: streaming expressions
>Reporter: Timothy Potter
>Assignee: Timothy Potter
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Looking into some streaming expression performance issues when there are many 
> collections with many shards being queried and I found that SolrStream's open 
> method creates a new \{{HttpSolrClient}} for every replica being queried. For 
> instance, in my test case, I have 10 collections with 100 shards each (rf=1) 
> and I get 1000 HttpSolrClient instances in my SolrClientCache. If I reuse 
> HttpSolrClient's per node hosting a replica (so 10 in my case), the query 
> time for my expression drops by half (not too mention the reduced allocation 
> load on the JVM).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-9378) Configurable compression for BinaryDocValues

2020-11-06 Thread David Smiley (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-9378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Smiley updated LUCENE-9378:
-
Priority: Major  (was: Minor)

We normally don't pay attention much to the "Priority" Major/Minor etc. JIRA 
issue status but IMO this is so definitely not a "Minor" matter so I changed it 
to our default of Major.

> Configurable compression for BinaryDocValues
> 
>
> Key: LUCENE-9378
> URL: https://issues.apache.org/jira/browse/LUCENE-9378
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Viral Gandhi
>Priority: Major
> Attachments: hotspots-v76x.png, hotspots-v76x.png, hotspots-v76x.png, 
> hotspots-v76x.png, hotspots-v76x.png, hotspots-v77x.png, hotspots-v77x.png, 
> hotspots-v77x.png, hotspots-v77x.png, image-2020-06-12-22-17-30-339.png, 
> image-2020-06-12-22-17-53-961.png, image-2020-06-12-22-18-24-527.png, 
> image-2020-06-12-22-18-48-919.png, snapshot-v77x.nps, snapshot-v77x.nps, 
> snapshot-v77x.nps, snapshots-v76x.nps, snapshots-v76x.nps, snapshots-v76x.nps
>
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> Lucene 8.5.1 includes a change to always [compress 
> BinaryDocValues|https://issues.apache.org/jira/browse/LUCENE-9211]. This 
> caused (~30%) reduction in our red-line QPS (throughput). 
> We think users should be given some way to opt-in for this compression 
> feature instead of always being enabled which can have a substantial query 
> time cost as we saw during our upgrade. [~mikemccand] suggested one possible 
> approach by introducing a *mode* in Lucene80DocValuesFormat (COMPRESSED and 
> UNCOMPRESSED) and allowing users to create a custom Codec subclassing the 
> default Codec and pick the format they want.
> Idea is similar to Lucene50StoredFieldsFormat which has two modes, 
> Mode.BEST_SPEED and Mode.BEST_COMPRESSION.
> Here's related issues for adding benchmark covering BINARY doc values 
> query-time performance - [https://github.com/mikemccand/luceneutil/issues/61]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-14942) Reduce leader election time on node shutdown

2020-11-06 Thread David Smiley (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-14942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227639#comment-17227639
 ] 

David Smiley commented on SOLR-14942:
-

I am seeing this just now... and after looking at the instanceof check in 
HttpSolrCall to see if the handler is an UpdateRequestHandler... *wow*, that's 
hack-ish alright _(we agree, as seen in comments above)_.  Hoss maybe suggested 
going far -- fully generalizing the notion to any request handler.  That notion 
was shot down.  Okay, but couldn't we move this instanceof check down deeper 
where it's less hack-ish than way up in HttpSolrCall?  Like, couldn't 
UpdateRequestHandler in particular do this check?  Then we don't have a cast at 
all.

> Reduce leader election time on node shutdown
> 
>
> Key: SOLR-14942
> URL: https://issues.apache.org/jira/browse/SOLR-14942
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Affects Versions: 7.7.3, 8.6.3
>Reporter: Shalin Shekhar Mangar
>Assignee: Shalin Shekhar Mangar
>Priority: Major
> Fix For: master (9.0), 8.8
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> The credit for this issue and investigation belongs to [~caomanhdat]. I am 
> merely reporting the issue and creating PRs based on his work.
> The shutdown process waits for all replicas/cores to be closed before 
> removing the election node of the leader. This can take some time due to 
> index flush or merge activities on the leader cores and delays new leaders 
> from being elected.
> This process happens at CoreContainer.shutdown():
> # zkController.preClose(): remove current node from live_node and change 
> states of all cores in this node to DOWN state. Assuming that the current 
> node hosting a leader of a shard, the shard becomes leaderless after calling 
> this method, since the state of the leader is DOWN now. The leader election 
> process is not triggered for the shard since the election node is still 
> on-hold by the current node.
> # Waiting for all cores to be loaded (if there are any).
> # SolrCores.close(): close all cores.
> # zkController.close(): this is where all ephemeral nodes are removed from ZK 
> which include election nodes created by this node. Therefore other replicas 
> in the shard can take part in the leader election from now.
> Note that CoreContainer.shutdown() is invoked when Jetty/Solr nodes receive 
> SIGTERM signal. 
> On receiving SIGTERM, Jetty will also stop accepting new connections and new 
> requests. This is a very important factor, since even if the leader replica 
> is ACTIVE and its node in live_nodes, the shard will be considered as 
> leaderless if no-one can index to that shard. Therefore shards become 
> leaderless as soon as the node (which contains shard’s leader) receives 
> SIGTERM.
> Therefore the longer time step 1, 2 and 3 needed to finish, the longer shards 
> remain leaderless. The time needed for step 3 scales with the number of cores 
> so the more cores a node has, the worse. This time is spent in 
> IndexWriter.close() where the system will 
> # Flush all pending updates to disk
> # Waiting for all merge finish (this most likely is the meaty part)
> The shutdown process is proposed to changed to:
> # Wait for all in-flight indexing requests and replication requests to 
> complete
> # Remove election nodes
> # Close all replicas/cores
> This ensures that index flush or merges do not block new leader elections 
> anymore.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] thelabdude opened a new pull request #2067: SOLR-14987: Reuse HttpSolrClient per node vs. one per Solr core when using CloudSolrStream

2020-11-06 Thread GitBox

thelabdude opened a new pull request #2067:
URL: https://github.com/apache/lucene-solr/pull/2067

# Description

For collections with many shards (or aliases with many collections and some
shards), `CloudSolrStream` will end up creating a new `HttpSolrClient` for
every `SolrStream` it opens because the cache key is the full core URL, such
as: `http://127.0.0.1:63460/solr/collection4_shard4_replica_n6/`

In addition, `CloudSolrStream#getSlices` was calling
`clusterState.getCollectionsMap()` which pre-emptively loads all
`LazyCollectionRef` from ZK unnecessarily. This could cause an issue with
clusters with many collections and slow down the streaming expression execution.

# Solution

In this PR, I've introduced a new ctor in `SolrStream` to pass the `core`
which is then used to determine the URL of the node hosting the core to be
queried when the stream is opened. This leads to reusing the same
`HttpSolrClient` for the same node because the cache key is now
`http://127.0.0.1:63460/solr/`. I chose this new ctor approach because
`CloudSolrStream` is not the only consumer of SolrStream and it knows how the
the list of URLs where constructed from cluster state, so it can safely make
the decision about passing the core and reusing clients. We may also be able to
see if `distrib=false` in the params passed to the ctor but then we'd still
have to parse out the core name, which `CloudSolrStream` can do with confidence
but I'm not so sure from the other paths in the code.

When the request is sent to the remote core, we need to add the core name
back to the path (since we stripped it from the baseUrl used by the
HttpSolrClient). This happens in `SolrStream#constructParser`. This method is
public and takes a SolrClient (even though SolrStream already has an
HttpSolrClient created in the open method); hence, I've had to be a little more
paranoid around checking the type of SolrClient `server instanceof
HttpSolrClient` when determining if the core needs to be added to the path.
Perhaps we can change this method's signature in Solr 9?

# Tests

Test still under construction ... I have a manual test that creates 10
collections with 100 shards and with this fix, a simple query streaming
expression execution time is cut by at least half (500 ms vs. >1000) just for
the `openStreams` operation.

# Checklist

Please review the following and check all that apply:

- [ ] I have reviewed the guidelines for [How to
Contribute](https://wiki.apache.org/solr/HowToContribute) and my code conforms
to the standards described there to the best of my ability.
- [ ] I have created a Jira issue and added the issue ID to my pull request
title.
- [ ] I have given Solr maintainers
[access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork)
to contribute to my PR branch. (optional but recommended)
- [ ] I have developed this patch against the `master` branch.
- [ ] I have run `./gradlew check`.
- [ ] I have added tests for my changes.
- [ ] I have added documentation for the [Ref
Guide](https://github.com/apache/lucene-solr/tree/master/solr/solr-ref-guide)
(for Solr changes only).

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] msokolov commented on a change in pull request #2022: LUCENE-9004: KNN vector search using NSW graphs

2020-11-06 Thread GitBox



msokolov commented on a change in pull request #2022:
URL: https://github.com/apache/lucene-solr/pull/2022#discussion_r519007187



##
File path: lucene/core/src/java/org/apache/lucene/util/hnsw/BoundsChecker.java
##
@@ -0,0 +1,74 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.hnsw;
+
+abstract class BoundsChecker {
+
+float bound;
+
+/**
+ * Update the bound if sample is better
+ */
+abstract void update(float sample);
+
+/**
+ * Return whether the sample exceeds (is worse than) the bound
+ */
+abstract boolean check(float sample);
+
+static BoundsChecker create(boolean reversed) {
+if (reversed) {
+return new Min();
+} else {
+return new Max();
+}
+}
+
+static class Max extends BoundsChecker {
+Max() {
+bound = -Float.MAX_VALUE;

Review comment:
   Well, `-Float.MAX_VALUE` is reliable since the sign bit is separate from 
all the rest of the bits, but `Float.NEGATIVE_INFINITY` is less than that, so 
if we use it then any finite float will be a valid score, so +1 for that, and 
for +infinity as well





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] msokolov commented on a change in pull request #2022: LUCENE-9004: KNN vector search using NSW graphs

2020-11-06 Thread GitBox



msokolov commented on a change in pull request #2022:
URL: https://github.com/apache/lucene-solr/pull/2022#discussion_r519006572



##
File path: lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java
##
@@ -0,0 +1,235 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.hnsw;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Comparator;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Random;
+import java.util.Set;
+import java.util.TreeSet;
+
+import org.apache.lucene.index.KnnGraphValues;
+import org.apache.lucene.index.VectorValues;
+
+import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS;
+import static org.apache.lucene.util.VectorUtil.dotProduct;
+import static org.apache.lucene.util.VectorUtil.squareDistance;
+
+/**
+ * Navigable Small-world graph. Provides efficient approximate nearest neighbor
+ * search for high dimensional vectors.  See https://doi.org/10.1016/j.is.2013.10.006";>Approximate nearest
+ * neighbor algorithm based on navigable small world graphs [2014] and https://arxiv.org/abs/1603.09320";>this paper [2018] for details.
+ *
+ * This implementation is actually more like the one in the same authors' 
earlier 2014 paper in that
+ * there is no hierarchy (just one layer), and no fanout restriction on the 
graph: nodes are allowed to accumulate
+ * an unbounded number of outbound links, but it does incorporate some of the 
innovations of the later paper, like
+ * using a priority queue to perform a beam search while traversing the graph. 
The nomenclature is a bit different
+ * here from what's used in those papers:
+ *
+ * Hyperparameters
+ * 
+ *   numSeed is the equivalent of m in the 2012 
paper; it controls the number of random entry points to sample.
+ *   beamWidth in {@link HnswGraphBuilder} has the same 
meaning as efConst in the 2016 paper. It is the number of
+ *   nearest neighbor candidates to track while searching the graph for each 
newly inserted node.
+ *   maxConn has the same meaning as M in the 
later paper; it controls how many of the efConst neighbors are
+ *   connected to the new node
+ *   fanout the fanout parameter of {@link 
VectorValues#search(float[], int, int)}
+ *   is used to control the values of numSeed and 
topK that are passed to this API.
+ *   Thus fanout is like a combination of ef (search 
beam width) from the 2016 paper and m from the 2014 paper.
+ *   
+ * 
+ *
+ * Note: The graph may be searched by multiple threads concurrently, but 
updates are not thread-safe. Also note: there is no notion of
+ * deletions. Document searching built on top of this must do its own 
deletion-filtering.
+ */
+public final class HnswGraph {
+
+  // each entry lists the neighbors of a node, in node order
+  private final List> graph;
+
+  HnswGraph() {
+graph = new ArrayList<>();
+graph.add(new ArrayList<>());
+  }
+
+  /**
+   * Searches for the nearest neighbors of a query vector.
+   * @param query search query vector
+   * @param topK the number of nodes to be returned
+   * @param numSeed the number of random entry points to sample
+   * @param vectors vector values
+   * @param graphValues the graph values. May represent the entire graph, or a 
level in a hierarchical graph.
+   * @param random a source of randomness, used for generating entry points to 
the graph
+   * @return a priority queue holding the neighbors found
+   */
+  public static Neighbors search(float[] query, int topK, int numSeed, 
VectorValues.RandomAccess vectors, KnnGraphValues graphValues,
+ Random random) throws IOException {
+VectorValues.SearchStrategy searchStrategy = vectors.searchStrategy();
+boolean scoreReversed = isReversed(searchStrategy);
+TreeSet candidates;
+if (scoreReversed) {
+  candidates = new TreeSet<>(Comparator.reverseOrder());
+} else {
+  candidates = new TreeSet<>();
+}
+int size = vectors.size();
+for (int i = 0; i < numSeed && i < size; i++) {
+  int entryPoint = random.nextInt(size);
+  candidates.add(new Neighbor(entryPoint, compare(query, 
vectors.vectorValue(entryPoint), searchStrategy)));
+

[GitHub] [lucene-solr] msokolov commented on a change in pull request #2022: LUCENE-9004: KNN vector search using NSW graphs

2020-11-06 Thread GitBox



msokolov commented on a change in pull request #2022:
URL: https://github.com/apache/lucene-solr/pull/2022#discussion_r519004731



##
File path: lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java
##
@@ -0,0 +1,235 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.hnsw;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Comparator;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Random;
+import java.util.Set;
+import java.util.TreeSet;
+
+import org.apache.lucene.index.KnnGraphValues;
+import org.apache.lucene.index.VectorValues;
+
+import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS;
+import static org.apache.lucene.util.VectorUtil.dotProduct;
+import static org.apache.lucene.util.VectorUtil.squareDistance;
+
+/**
+ * Navigable Small-world graph. Provides efficient approximate nearest neighbor
+ * search for high dimensional vectors.  See https://doi.org/10.1016/j.is.2013.10.006";>Approximate nearest
+ * neighbor algorithm based on navigable small world graphs [2014] and https://arxiv.org/abs/1603.09320";>this paper [2018] for details.
+ *
+ * This implementation is actually more like the one in the same authors' 
earlier 2014 paper in that
+ * there is no hierarchy (just one layer), and no fanout restriction on the 
graph: nodes are allowed to accumulate
+ * an unbounded number of outbound links, but it does incorporate some of the 
innovations of the later paper, like
+ * using a priority queue to perform a beam search while traversing the graph. 
The nomenclature is a bit different
+ * here from what's used in those papers:
+ *
+ * Hyperparameters
+ * 
+ *   numSeed is the equivalent of m in the 2012 
paper; it controls the number of random entry points to sample.
+ *   beamWidth in {@link HnswGraphBuilder} has the same 
meaning as efConst in the 2016 paper. It is the number of
+ *   nearest neighbor candidates to track while searching the graph for each 
newly inserted node.
+ *   maxConn has the same meaning as M in the 
later paper; it controls how many of the efConst neighbors are
+ *   connected to the new node
+ *   fanout the fanout parameter of {@link 
VectorValues#search(float[], int, int)}
+ *   is used to control the values of numSeed and 
topK that are passed to this API.
+ *   Thus fanout is like a combination of ef (search 
beam width) from the 2016 paper and m from the 2014 paper.
+ *   
+ * 
+ *
+ * Note: The graph may be searched by multiple threads concurrently, but 
updates are not thread-safe. Also note: there is no notion of
+ * deletions. Document searching built on top of this must do its own 
deletion-filtering.
+ */
+public final class HnswGraph {
+
+  // each entry lists the neighbors of a node, in node order
+  private final List> graph;
+
+  HnswGraph() {

Review comment:
   Right, we do not instantiate this class at search time, although we do 
use the static `search` method, which only relies on the abstract vector and 
graph values. Heap usage is quite high! It will scale with graph fanout * 
number of vectors. For each node, we store a PriorityQueue of Neighbor objects; 
each Neighbor has an int node id and a float score. I think for 1M vectors, 
with fanout of 64, we'll see around 1.5G heap usage, with lots of tiny objects 
on the heap during flush/merge. This can be dramatically reduced by eliminating 
the Neighbor objects in favor of parallel arrays, and the priority queues can 
then store int ordinals rather than object pointers, but I think that can be a 
fast follow?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] msokolov commented on a change in pull request #2022: LUCENE-9004: KNN vector search using NSW graphs

2020-11-06 Thread GitBox



msokolov commented on a change in pull request #2022:
URL: https://github.com/apache/lucene-solr/pull/2022#discussion_r519001123



##
File path: lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java
##
@@ -0,0 +1,235 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.hnsw;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Comparator;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Random;
+import java.util.Set;
+import java.util.TreeSet;
+
+import org.apache.lucene.index.KnnGraphValues;
+import org.apache.lucene.index.VectorValues;
+
+import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS;
+import static org.apache.lucene.util.VectorUtil.dotProduct;
+import static org.apache.lucene.util.VectorUtil.squareDistance;
+
+/**
+ * Navigable Small-world graph. Provides efficient approximate nearest neighbor
+ * search for high dimensional vectors.  See https://doi.org/10.1016/j.is.2013.10.006";>Approximate nearest
+ * neighbor algorithm based on navigable small world graphs [2014] and https://arxiv.org/abs/1603.09320";>this paper [2018] for details.
+ *
+ * This implementation is actually more like the one in the same authors' 
earlier 2014 paper in that
+ * there is no hierarchy (just one layer), and no fanout restriction on the 
graph: nodes are allowed to accumulate
+ * an unbounded number of outbound links, but it does incorporate some of the 
innovations of the later paper, like
+ * using a priority queue to perform a beam search while traversing the graph. 
The nomenclature is a bit different
+ * here from what's used in those papers:
+ *
+ * Hyperparameters
+ * 
+ *   numSeed is the equivalent of m in the 2012 
paper; it controls the number of random entry points to sample.
+ *   beamWidth in {@link HnswGraphBuilder} has the same 
meaning as efConst in the 2016 paper. It is the number of
+ *   nearest neighbor candidates to track while searching the graph for each 
newly inserted node.
+ *   maxConn has the same meaning as M in the 
later paper; it controls how many of the efConst neighbors are
+ *   connected to the new node
+ *   fanout the fanout parameter of {@link 
VectorValues#search(float[], int, int)}
+ *   is used to control the values of numSeed and 
topK that are passed to this API.
+ *   Thus fanout is like a combination of ef (search 
beam width) from the 2016 paper and m from the 2014 paper.
+ *   
+ * 
+ *
+ * Note: The graph may be searched by multiple threads concurrently, but 
updates are not thread-safe. Also note: there is no notion of
+ * deletions. Document searching built on top of this must do its own 
deletion-filtering.
+ */
+public final class HnswGraph {
+
+  // each entry lists the neighbors of a node, in node order
+  private final List> graph;
+
+  HnswGraph() {
+graph = new ArrayList<>();
+graph.add(new ArrayList<>());
+  }
+
+  /**
+   * Searches for the nearest neighbors of a query vector.
+   * @param query search query vector
+   * @param topK the number of nodes to be returned
+   * @param numSeed the number of random entry points to sample
+   * @param vectors vector values
+   * @param graphValues the graph values. May represent the entire graph, or a 
level in a hierarchical graph.
+   * @param random a source of randomness, used for generating entry points to 
the graph
+   * @return a priority queue holding the neighbors found
+   */
+  public static Neighbors search(float[] query, int topK, int numSeed, 
VectorValues.RandomAccess vectors, KnnGraphValues graphValues,
+ Random random) throws IOException {
+VectorValues.SearchStrategy searchStrategy = vectors.searchStrategy();
+boolean scoreReversed = isReversed(searchStrategy);
+TreeSet candidates;
+if (scoreReversed) {
+  candidates = new TreeSet<>(Comparator.reverseOrder());
+} else {
+  candidates = new TreeSet<>();
+}
+int size = vectors.size();
+for (int i = 0; i < numSeed && i < size; i++) {
+  int entryPoint = random.nextInt(size);
+  candidates.add(new Neighbor(entryPoint, compare(query, 
vectors.vectorValue(entryPoint), searchStrategy)));
+

[GitHub] [lucene-solr] msokolov commented on a change in pull request #2022: LUCENE-9004: KNN vector search using NSW graphs

2020-11-06 Thread GitBox



msokolov commented on a change in pull request #2022:
URL: https://github.com/apache/lucene-solr/pull/2022#discussion_r519001030



##
File path: 
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java
##
@@ -0,0 +1,186 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.hnsw;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Comparator;
+import java.util.List;
+import java.util.Random;
+
+import org.apache.lucene.index.KnnGraphValues;
+import org.apache.lucene.index.VectorValues;
+import org.apache.lucene.util.BytesRef;
+
+/**
+ * Builder for HNSW graph. See {@link HnswGraph} for a gloss on the algorithm 
and the meaning of the hyperparameters.
+ */
+public final class HnswGraphBuilder {
+
+  // default random seed for level generation
+  private static final long DEFAULT_RAND_SEED = System.currentTimeMillis();
+
+  // expose for testing.
+  public static long randSeed = DEFAULT_RAND_SEED;

Review comment:
   yes!





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] msokolov commented on a change in pull request #2022: LUCENE-9004: KNN vector search using NSW graphs

2020-11-06 Thread GitBox



msokolov commented on a change in pull request #2022:
URL: https://github.com/apache/lucene-solr/pull/2022#discussion_r519000323



##
File path: lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java
##
@@ -0,0 +1,235 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.hnsw;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Comparator;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Random;
+import java.util.Set;
+import java.util.TreeSet;
+
+import org.apache.lucene.index.KnnGraphValues;
+import org.apache.lucene.index.VectorValues;
+
+import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS;
+import static org.apache.lucene.util.VectorUtil.dotProduct;
+import static org.apache.lucene.util.VectorUtil.squareDistance;
+
+/**
+ * Navigable Small-world graph. Provides efficient approximate nearest neighbor
+ * search for high dimensional vectors.  See https://doi.org/10.1016/j.is.2013.10.006";>Approximate nearest
+ * neighbor algorithm based on navigable small world graphs [2014] and https://arxiv.org/abs/1603.09320";>this paper [2018] for details.
+ *
+ * This implementation is actually more like the one in the same authors' 
earlier 2014 paper in that
+ * there is no hierarchy (just one layer), and no fanout restriction on the 
graph: nodes are allowed to accumulate
+ * an unbounded number of outbound links, but it does incorporate some of the 
innovations of the later paper, like
+ * using a priority queue to perform a beam search while traversing the graph. 
The nomenclature is a bit different
+ * here from what's used in those papers:
+ *
+ * Hyperparameters
+ * 
+ *   numSeed is the equivalent of m in the 2012 
paper; it controls the number of random entry points to sample.
+ *   beamWidth in {@link HnswGraphBuilder} has the same 
meaning as efConst in the 2016 paper. It is the number of
+ *   nearest neighbor candidates to track while searching the graph for each 
newly inserted node.
+ *   maxConn has the same meaning as M in the 
later paper; it controls how many of the efConst neighbors are
+ *   connected to the new node
+ *   fanout the fanout parameter of {@link 
VectorValues#search(float[], int, int)}
+ *   is used to control the values of numSeed and 
topK that are passed to this API.
+ *   Thus fanout is like a combination of ef (search 
beam width) from the 2016 paper and m from the 2014 paper.
+ *   
+ * 
+ *
+ * Note: The graph may be searched by multiple threads concurrently, but 
updates are not thread-safe. Also note: there is no notion of
+ * deletions. Document searching built on top of this must do its own 
deletion-filtering.
+ */
+public final class HnswGraph {
+
+  // each entry lists the neighbors of a node, in node order
+  private final List> graph;
+
+  HnswGraph() {
+graph = new ArrayList<>();
+graph.add(new ArrayList<>());
+  }
+
+  /**
+   * Searches for the nearest neighbors of a query vector.

Review comment:
   No this method is intended for use both when indexing and when 
searching. I guess this method could go in some other class (KnnGraphValues??) 
since it doesn't rely on Hnsw in any direct way (it only needs an abstract 
source of vectors and graph values). But it does use Neighbor/Neighbors and is 
heavily used *by* HnswGraphBuilder, so it seems natural to have it here.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] msokolov commented on a change in pull request #2022: LUCENE-9004: KNN vector search using NSW graphs

2020-11-06 Thread GitBox



msokolov commented on a change in pull request #2022:
URL: https://github.com/apache/lucene-solr/pull/2022#discussion_r518999576



##
File path: lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java
##
@@ -0,0 +1,235 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.hnsw;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Comparator;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Random;
+import java.util.Set;
+import java.util.TreeSet;
+
+import org.apache.lucene.index.KnnGraphValues;
+import org.apache.lucene.index.VectorValues;
+
+import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS;
+import static org.apache.lucene.util.VectorUtil.dotProduct;
+import static org.apache.lucene.util.VectorUtil.squareDistance;
+
+/**
+ * Navigable Small-world graph. Provides efficient approximate nearest neighbor
+ * search for high dimensional vectors.  See https://doi.org/10.1016/j.is.2013.10.006";>Approximate nearest
+ * neighbor algorithm based on navigable small world graphs [2014] and https://arxiv.org/abs/1603.09320";>this paper [2018] for details.
+ *
+ * This implementation is actually more like the one in the same authors' 
earlier 2014 paper in that
+ * there is no hierarchy (just one layer), and no fanout restriction on the 
graph: nodes are allowed to accumulate
+ * an unbounded number of outbound links, but it does incorporate some of the 
innovations of the later paper, like
+ * using a priority queue to perform a beam search while traversing the graph. 
The nomenclature is a bit different
+ * here from what's used in those papers:
+ *
+ * Hyperparameters
+ * 
+ *   numSeed is the equivalent of m in the 2012 
paper; it controls the number of random entry points to sample.
+ *   beamWidth in {@link HnswGraphBuilder} has the same 
meaning as efConst in the 2016 paper. It is the number of
+ *   nearest neighbor candidates to track while searching the graph for each 
newly inserted node.
+ *   maxConn has the same meaning as M in the 
later paper; it controls how many of the efConst neighbors are
+ *   connected to the new node
+ *   fanout the fanout parameter of {@link 
VectorValues#search(float[], int, int)}
+ *   is used to control the values of numSeed and 
topK that are passed to this API.
+ *   Thus fanout is like a combination of ef (search 
beam width) from the 2016 paper and m from the 2014 paper.
+ *   
+ * 
+ *
+ * Note: The graph may be searched by multiple threads concurrently, but 
updates are not thread-safe. Also note: there is no notion of
+ * deletions. Document searching built on top of this must do its own 
deletion-filtering.
+ */
+public final class HnswGraph {
+
+  // each entry lists the neighbors of a node, in node order
+  private final List> graph;
+
+  HnswGraph() {
+graph = new ArrayList<>();
+graph.add(new ArrayList<>());
+  }
+
+  /**
+   * Searches for the nearest neighbors of a query vector.
+   * @param query search query vector
+   * @param topK the number of nodes to be returned
+   * @param numSeed the number of random entry points to sample
+   * @param vectors vector values
+   * @param graphValues the graph values. May represent the entire graph, or a 
level in a hierarchical graph.
+   * @param random a source of randomness, used for generating entry points to 
the graph
+   * @return a priority queue holding the neighbors found
+   */
+  public static Neighbors search(float[] query, int topK, int numSeed, 
VectorValues.RandomAccess vectors, KnnGraphValues graphValues,
+ Random random) throws IOException {
+VectorValues.SearchStrategy searchStrategy = vectors.searchStrategy();
+boolean scoreReversed = isReversed(searchStrategy);
+TreeSet candidates;
+if (scoreReversed) {
+  candidates = new TreeSet<>(Comparator.reverseOrder());
+} else {
+  candidates = new TreeSet<>();
+}
+int size = vectors.size();
+for (int i = 0; i < numSeed && i < size; i++) {
+  int entryPoint = random.nextInt(size);
+  candidates.add(new Neighbor(entryPoint, compare(query, 
vectors.vectorValue(entryPoint), searchStrategy)));
+

[GitHub] [lucene-solr] msokolov commented on a change in pull request #2022: LUCENE-9004: KNN vector search using NSW graphs

2020-11-06 Thread GitBox



msokolov commented on a change in pull request #2022:
URL: https://github.com/apache/lucene-solr/pull/2022#discussion_r518997811



##
File path: lucene/core/src/java/org/apache/lucene/index/VectorValues.java
##
@@ -74,6 +74,18 @@ public BytesRef binaryValue() throws IOException {
 throw new UnsupportedOperationException();
   }
 
+  /**
+   * Return the k nearest neighbor documents as determined by comparison of 
their vector values
+   * for this field, to the given vector, by the field's search strategy. If 
the search strategy is
+   * reversed, lower values indicate nearer vectors, otherwise higher scores 
indicate nearer
+   * vectors. Unlike relevance scores, vector scores may be negative.
+   * @param target the vector-valued query
+   * @param k  the number of docs to return
+   * @param fanout control the accuracy/speed tradeoff - larger values give 
better recall at higher cost

Review comment:
   Yeah this probably needs to be field level. Different strokes for 
different collections of vectors. I'm not sure how to expose since the 
parameters will be different for different ANN implementations. Might be a good 
use case for generic IndexedField.attributes?

##
File path: lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java
##
@@ -0,0 +1,235 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.hnsw;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Comparator;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Random;
+import java.util.Set;
+import java.util.TreeSet;
+
+import org.apache.lucene.index.KnnGraphValues;
+import org.apache.lucene.index.VectorValues;
+
+import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS;
+import static org.apache.lucene.util.VectorUtil.dotProduct;
+import static org.apache.lucene.util.VectorUtil.squareDistance;
+
+/**
+ * Navigable Small-world graph. Provides efficient approximate nearest neighbor
+ * search for high dimensional vectors.  See https://doi.org/10.1016/j.is.2013.10.006";>Approximate nearest
+ * neighbor algorithm based on navigable small world graphs [2014] and https://arxiv.org/abs/1603.09320";>this paper [2018] for details.
+ *
+ * This implementation is actually more like the one in the same authors' 
earlier 2014 paper in that
+ * there is no hierarchy (just one layer), and no fanout restriction on the 
graph: nodes are allowed to accumulate
+ * an unbounded number of outbound links, but it does incorporate some of the 
innovations of the later paper, like
+ * using a priority queue to perform a beam search while traversing the graph. 
The nomenclature is a bit different
+ * here from what's used in those papers:
+ *
+ * Hyperparameters
+ * 
+ *   numSeed is the equivalent of m in the 2012 
paper; it controls the number of random entry points to sample.
+ *   beamWidth in {@link HnswGraphBuilder} has the same 
meaning as efConst in the 2016 paper. It is the number of
+ *   nearest neighbor candidates to track while searching the graph for each 
newly inserted node.
+ *   maxConn has the same meaning as M in the 
later paper; it controls how many of the efConst neighbors are
+ *   connected to the new node
+ *   fanout the fanout parameter of {@link 
VectorValues#search(float[], int, int)}
+ *   is used to control the values of numSeed and 
topK that are passed to this API.
+ *   Thus fanout is like a combination of ef (search 
beam width) from the 2016 paper and m from the 2014 paper.
+ *   
+ * 
+ *
+ * Note: The graph may be searched by multiple threads concurrently, but 
updates are not thread-safe. Also note: there is no notion of
+ * deletions. Document searching built on top of this must do its own 
deletion-filtering.
+ */
+public final class HnswGraph {
+
+  // each entry lists the neighbors of a node, in node order

Review comment:
   Yes, although this class is agnostic as to the interpretation of these 
nodes. They could just as well be docIds, or social security numbers. But yeah, 
the only usage is as you describe, so I'll add a note to the javadoc.





This is an

[GitHub] [lucene-solr] dsmiley commented on a change in pull request #2066: SOLR-14975: Optimize CoreContainer.getAllCoreNames and getLoadedCoreNames.

2020-11-06 Thread GitBox



dsmiley commented on a change in pull request #2066:
URL: https://github.com/apache/lucene-solr/pull/2066#discussion_r518937449



##
File path: solr/core/src/java/org/apache/solr/cloud/ReplicateFromLeader.java
##
@@ -69,7 +71,10 @@ public void startReplication(boolean switchTransactionLog) {
 if (cc.isShutDown()) {
   return;
 } else {
-  throw new SolrException(SolrException.ErrorCode.SERVER_ERROR, 
"SolrCore not found:" + coreName + " in " + cc.getLoadedCoreNames());
+  Collection loadedCoreNames = cc.getLoadedCoreNames();
+  throw new SolrException(SolrException.ErrorCode.SERVER_ERROR, 
"SolrCore not found:" + coreName + " in "

Review comment:
   Nice!

##
File path: solr/core/src/java/org/apache/solr/cloud/ReplicateFromLeader.java
##
@@ -69,7 +71,10 @@ public void startReplication(boolean switchTransactionLog) {
 if (cc.isShutDown()) {
   return;
 } else {
-  throw new SolrException(SolrException.ErrorCode.SERVER_ERROR, 
"SolrCore not found:" + coreName + " in " + cc.getLoadedCoreNames());
+  Collection loadedCoreNames = cc.getLoadedCoreNames();
+  throw new SolrException(SolrException.ErrorCode.SERVER_ERROR, 
"SolrCore not found:" + coreName + " in "

Review comment:
   If there are fewer than 20, list them sorted?  Since this error/log 
construction logic exists in two places, maybe put this in one spot.

##
File path: solr/core/src/java/org/apache/solr/core/SolrCores.java
##
@@ -198,35 +188,91 @@ protected SolrCore putCore(CoreDescriptor cd, SolrCore 
core) {
* 
* Put another way, this will not return any names of cores that are lazily 
loaded but have not been called for yet
* or are transient and either not loaded or have been swapped out.
-   * 
-   * @return List of currently loaded cores.
+   *
+   * @return A unsorted collection.
*/
-  Set getLoadedCoreNames() {
-Set set;
-
+  Collection getLoadedCoreNames() {
 synchronized (modifyLock) {
-  set = new TreeSet<>(cores.keySet());
-  if (getTransientCacheHandler() != null) {
-set.addAll(getTransientCacheHandler().getLoadedCoreNames());
-  }
+  TransientSolrCoreCache transientCoreCache = getTransientCacheHandler();
+  Set transientCoreNames = transientCoreCache == null ? 
Collections.emptySet() : transientCoreCache.getLoadedCoreNames();
+  return distinctSetsUnion(cores.keySet(), transientCoreNames);
 }
-return set;
   }
+
   /**
-   * Gets a list of all cores, loaded and unloaded 
+   * Gets a collection of all cores names, loaded and unloaded.
+   * For efficiency, prefer to check {@link #getCoreDescriptor(String)} != 
null instead of {@link #getAllCoreNames()}.contains(String)
*
-   * @return all cores names, whether loaded or unloaded, transient or 
permanent.
+   * @return A unsorted collection.
*/
   public Collection getAllCoreNames() {
-Set set;
 synchronized (modifyLock) {
-  set = new TreeSet<>(cores.keySet());
-  if (getTransientCacheHandler() != null) {
-set.addAll(getTransientCacheHandler().getAllCoreNames());
-  }
-  set.addAll(residentDesciptors.keySet());
+  TransientSolrCoreCache transientCoreCache = getTransientCacheHandler();
+  Set transientCoreNames = transientCoreCache == null ? 
Collections.emptySet() : transientCoreCache.getAllCoreNames();
+  return distinctSetsUnion(residentDescriptors.keySet(), 
transientCoreNames);
+}
+  }
+
+  /**
+   * Makes the union of two distinct sets.
+   */
+  private static  Collection distinctSetsUnion(Set set1, Set set2) 
{
+assert areSetsDistinct(set1, set2);

Review comment:
   Further comment that this guarantees to return a new copy, as it must do 
so because the caller needs to return a thread-safe snapshot.  Before I 
recalled that, I was going to suggest returning set1 or set2 if the opposite is 
empty, but we can't.

##
File path: solr/core/src/java/org/apache/solr/core/SolrCores.java
##
@@ -198,35 +188,91 @@ protected SolrCore putCore(CoreDescriptor cd, SolrCore 
core) {
* 
* Put another way, this will not return any names of cores that are lazily 
loaded but have not been called for yet
* or are transient and either not loaded or have been swapped out.
-   * 
-   * @return List of currently loaded cores.
+   *
+   * @return A unsorted collection.
*/
-  Set getLoadedCoreNames() {
-Set set;
-
+  Collection getLoadedCoreNames() {
 synchronized (modifyLock) {
-  set = new TreeSet<>(cores.keySet());
-  if (getTransientCacheHandler() != null) {
-set.addAll(getTransientCacheHandler().getLoadedCoreNames());
-  }
+  TransientSolrCoreCache transientCoreCache = getTransientCacheHandler();

Review comment:
   There is always a TransientSolrCoreCache, even if it's not used.  Right? 
 CC @ErickErickson   
   So why all these null checks if I'm right?

[jira] [Commented] (LUCENE-9537) Add Indri Search Engine Functionality to Lucene

2020-11-06 Thread Cameron VandenBerg (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227581#comment-17227581
 ] 

Cameron VandenBerg commented on LUCENE-9537:


Hi [~mikemccand] ,

Thank you for your feedback!  I like the idea of putting smoothingScore in the 
abstract Scorer base class, and I will see if I can make that work.  I will 
work on opening a PR and make sure I follow Lucene's coding style next week.

Thanks again!

> Add Indri Search Engine Functionality to Lucene
> ---
>
> Key: LUCENE-9537
> URL: https://issues.apache.org/jira/browse/LUCENE-9537
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Cameron VandenBerg
>Priority: Major
>  Labels: patch
> Attachments: LUCENE-9537.patch, LUCENE-INDRI.patch
>
>
> Indri ([http://lemurproject.org/indri.php]) is an academic search engine 
> developed by The University of Massachusetts and Carnegie Mellon University.  
> The major difference between Lucene and Indri is that Indri will give a 
> document a "smoothing score" to a document that does not contain the search 
> term, which has improved the search ranking accuracy in our experiments.  I 
> have created an Indri patch, which adds the search code needed to implement 
> the Indri AND logic as well as Indri's implementation of Dirichlet Smoothing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] dsmiley commented on a change in pull request #2064: LUCENE-9600: Clean up package name conflicts between misc and core modules

2020-11-06 Thread GitBox



dsmiley commented on a change in pull request #2064:
URL: https://github.com/apache/lucene-solr/pull/2064#discussion_r518931251



##
File path: 
lucene/test-framework/src/java/org/apache/lucene/util/fst/FSTTester.java
##
@@ -95,7 +95,7 @@ private static BytesRef toBytesRef(IntsRef ir) {
 return br;
   }
 
-  static String getRandomString(Random random) {
+  public static String getRandomString(Random random) {

Review comment:
   If you're going to make this method public... I think even a tiny bit of 
documentation would be helpful.  Like maybe recommend callers consider the 
methods Dawid mentioned in RandomizedTesting _instead_.  After all, we're only 
making this public for some test in misc to call it.  We aren't signaling to 
all of Lucene, Solr, and any of the plugin writers, that we've got some awesome 
random string generating that everyone should use :-)
   
   FWIW I prefer that we maintain less code and use whatever's in 
RandomizedTesting when possible, but I won't stand in the way of your judgement 
here.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-14975) Optimize CoreContainer.getAllCoreNames and getLoadedCoreNames

2020-11-06 Thread Bruno Roustant (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-14975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227546#comment-17227546
 ] 

Bruno Roustant commented on SOLR-14975:
---

I added a PR.

I believe descriptors are superset of loaded cores. And also permanent cores 
are distinct from transient cores.

I simplified the logic to create lists based on distinct sets. And I added 
assertions to verify the sets are indeed distincts. All tests passed.

> Optimize CoreContainer.getAllCoreNames and getLoadedCoreNames 
> --
>
> Key: SOLR-14975
> URL: https://issues.apache.org/jira/browse/SOLR-14975
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: David Smiley
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The methods CoreContainer.getAllCoreNames and getLoadedCoreNames hold a lock 
> while they grab core names to put into a TreeSet.  When there are *many* 
> cores, this delay is noticeable.  Holding this lock effectively blocks 
> queries since queries lookup a core; so it's critically important that these 
> methods are *fast*.  The tragedy here is that some callers merely want to 
> know if a particular name is in the set, or what the aggregated size is.  
> Some callers want to iterate the names but don't really care what the 
> iteration order is.
> I propose that some callers of these two methods find suitable alternatives, 
> like getCoreDescriptor to check for null.  And I propose that these methods 
> return a HashSet -- no order.  If the caller wants it sorted, it can do so 
> itself.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-site] atris commented on pull request #33: Fix change log URL

2020-11-06 Thread GitBox



atris commented on pull request #33:
URL: https://github.com/apache/lucene-site/pull/33#issuecomment-723212628


   I think this is just a realignment -- I did it from commandline and has the 
same changes. I manually verifies and saw that the overall structure is still 
the same.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-site] atris merged pull request #33: Fix change log URL

2020-11-06 Thread GitBox



atris merged pull request #33:
URL: https://github.com/apache/lucene-site/pull/33


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] bruno-roustant opened a new pull request #2066: SOLR-14975: Optimize CoreContainer.getAllCoreNames and getLoadedCoreNames.

2020-11-06 Thread GitBox



bruno-roustant opened a new pull request #2066:
URL: https://github.com/apache/lucene-solr/pull/2066


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Created] (SOLR-14987) SolrStream ends up creating a new HttpSolrClient for every replica being queried instead of reusing for the same node

2020-11-06 Thread Timothy Potter (Jira)

Timothy Potter created SOLR-14987:
-

 Summary: SolrStream ends up creating a new HttpSolrClient for 
every replica being queried instead of reusing for the same node
 Key: SOLR-14987
 URL: https://issues.apache.org/jira/browse/SOLR-14987
 Project: Solr
  Issue Type: Bug
  Security Level: Public (Default Security Level. Issues are Public)
  Components: streaming expressions
Reporter: Timothy Potter
Assignee: Timothy Potter


Looking into some streaming expression performance issues when there are many 
collections with many shards being queried and I found that SolrStream's open 
method creates a new \{{HttpSolrClient}} for every replica being queried. For 
instance, in my test case, I have 10 collections with 100 shards each (rf=1) 
and I get 1000 HttpSolrClient instances in my SolrClientCache. If I reuse 
HttpSolrClient's per node hosting a replica (so 10 in my case), the query time 
for my expression drops by half (not too mention the reduced allocation load on 
the JVM).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] msokolov commented on a change in pull request #2022: LUCENE-9004: KNN vector search using NSW graphs

2020-11-06 Thread GitBox



msokolov commented on a change in pull request #2022:
URL: https://github.com/apache/lucene-solr/pull/2022#discussion_r518902276



##
File path: lucene/core/src/java/org/apache/lucene/index/VectorValues.java
##
@@ -74,6 +74,18 @@ public BytesRef binaryValue() throws IOException {
 throw new UnsupportedOperationException();
   }
 
+  /**
+   * Return the k nearest neighbor documents as determined by comparison of 
their vector values
+   * for this field, to the given vector, by the field's search strategy. If 
the search strategy is
+   * reversed, lower values indicate nearer vectors, otherwise higher scores 
indicate nearer
+   * vectors. Unlike relevance scores, vector scores may be negative.
+   * @param target the vector-valued query
+   * @param k  the number of docs to return
+   * @param fanout control the accuracy/speed tradeoff - larger values give 
better recall at higher cost

Review comment:
   Yeah that's a good point. While experimenting with GloVe I'm learning 
that different settings are appropriate for different vectors, so field-level 
control might be needed. I'm not sure how codec-level controls are exposed. 
Don't Codecs get created automatically using no-args constructors and service 
autodiscovery? Did you mean something like perFieldVectorFormat? Except I doubt 
we need a new format; it's more about some metadata values that we would store 
in the field, so I think yeah it would go in FieldInfo. But I'm reluctant to 
expose hnsw-specific hyperparameters in `VectorField`, which we want to support 
other algorithms as well. Maybe this is a good use case for 
`IndexableField.getAttributes()`?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9583) How should we expose VectorValues.RandomAccess?

2020-11-06 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227534#comment-17227534
 ] 

Michael Sokolov commented on LUCENE-9583:
-

I worked up a version of LUCENE-9004 that uses docids in all public APIs, while 
still exposing random access via docid. This led to a measurable slowdown since 
it requires mapping back-and-forth between docids and ordinals.

These are results on an internal dataset. I'm showing the median latency of 
three runs in each case. The net is about a 10% increase in query latency and 
about 28% increase in indexing time.

h3. using ordinal API

||recall|| latency|| nDoc|| fanout|| maxConn|| beamWidth|| visited|| index ms||
|0.914| 1.73| 100 |0| 64| 500| 2126| 4628273|
|0.921 |1.86| 100 |10| 64| 500| 2260| 0|
|0.924 |2.02 |100 |20 |64 |500 |2389 |0|
|0.941 | 2.27 |100 |40 |64| 500 |2644| 0|

h3. using docId-only API

||recall|| latency|| nDoc|| fanout|| maxConn|| beamWidth|| visited|| index ms||
|0.910| 1.92| 100| 0| 64| 500| 2084| 5929137|
|0.920| 2.05| 100| 10| 64| 500| 2217| 0|
|0.949| 2.21| 100| 20| 64| 500| 2399| 0|
|0.959| 2.51| 100| 40| 64| 500| 2671| 0|

Please note that there is precedence for exposing "internal" ordinals as part 
of our API in \{SortedDocValues}, so we shouldn't shy away from that if it 
brings value.

I haven't had time to try out forward-only iteration, but I do expect it would 
introduce some marginal performance regression and considerably complicate the 
implementation of hnsw at least. Finally I'll remind everyone that we have a 
perfectly good forward-only iteration API for fetching binary data 
(BinaryDocValues), and that the genesis of this format was indeed the need for 
random access over vectors.

I'd appreciate it if folks with concerns could review the attached PR, which I 
think does a credible job of moving the random-access API into a place where it 
doesn't intrude on the main VectorValues API. That patch has been out for a 
week or so and I plan to push it soon if there are no further comments there 
(thanks for approving, @mccandless!). I recognize this topic is somewhat 
controversial, but I believe we can make rapid progress by iterating on code 
and measuring results.

> How should we expose VectorValues.RandomAccess?
> ---
>
> Key: LUCENE-9583
> URL: https://issues.apache.org/jira/browse/LUCENE-9583
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael Sokolov
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In the newly-added {{VectorValues}} API, we have a {{RandomAccess}} 
> sub-interface. [~jtibshirani] pointed out this is not needed by some 
> vector-indexing strategies which can operate solely using a forward-iterator 
> (it is needed by HNSW), and so in the interest of simplifying the public API 
> we should not expose this internal detail (which by the way surfaces internal 
> ordinals that are somewhat uninteresting outside the random access API).
> I looked into how to move this inside the HNSW-specific code and remembered 
> that we do also currently make use of the RA API when merging vector fields 
> over sorted indexes. Without it, we would need to load all vectors into RAM  
> while flushing/merging, as we currently do in 
> {{BinaryDocValuesWriter.BinaryDVs}}. I wonder if it's worth paying this cost 
> for the simpler API.
> Another thing I noticed while reviewing this is that I moved the KNN 
> {{search(float[] target, int topK, int fanout)}} method from {{VectorValues}} 
>  to {{VectorValues.RandomAccess}}. This I think we could move back, and 
> handle the HNSW requirements for search elsewhere. I wonder if that would 
> alleviate the major concern here? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] alessandrobenedetti commented on a change in pull request #1571: SOLR-14560: Interleaving for Learning To Rank

2020-11-06 Thread GitBox



alessandrobenedetti commented on a change in pull request #1571:
URL: https://github.com/apache/lucene-solr/pull/1571#discussion_r518869895



##
File path: 
solr/contrib/ltr/src/java/org/apache/solr/ltr/interleaving/TeamDraftInterleaving.java
##
@@ -0,0 +1,87 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.solr.ltr.interleaving;
+
+import java.util.ArrayList;
+import java.util.HashSet;
+import java.util.LinkedHashSet;
+import java.util.Random;
+import java.util.Set;
+
+import org.apache.lucene.search.ScoreDoc;
+
+public class TeamDraftInterleaving implements Interleaving{
+  public static Random RANDOM;
+
+  static {
+// We try to make things reproducible in the context of our tests by 
initializing the random instance
+// based on the current seed
+String seed = System.getProperty("tests.seed");

Review comment:
   Now I remember, I worked on this a lot and I ended up with the current 
solution even if including the test seed get property.
   I have not introduced that approach on my own, but I took inspiration from 
other areas of Solr code:
   such as:
   
org/apache/solr/cloud/api/collections/OverseerCollectionMessageHandler.java:174
   org/apache/solr/core/BlobRepository.java:73
   org/apache/solr/servlet/HttpSolrCall.java:152
   org/apache/solr/util/tracing/GlobalTracer.java:35
   
   So it is already used quite widely across Solr production code.
   Unless you have stronger concerns we can probably leave it as it is and 
resolve the conversation here :)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9378) Configurable compression for BinaryDocValues

2020-11-06 Thread Michael McCandless (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227469#comment-17227469
 ] 

Michael McCandless commented on LUCENE-9378:


{quote}I'm wondering if there is anything I should know about if/how it would 
be a bad idea to continue using the `Lucene70DocValuesFormat` until this issue 
is resolved?
{quote}
Just be aware that you don't have the backwards compatibility guarantee you 
would normally have.

E.g. in the future when you upgrade to Lucene 9.x release, it will not be able 
to read these indices you are creating with {{Lucene70DocValuesFormat}}.

Also, few people are fixing bugs in that format, though it is/was widely used 
so is likely nearly bug free ;)

> Configurable compression for BinaryDocValues
> 
>
> Key: LUCENE-9378
> URL: https://issues.apache.org/jira/browse/LUCENE-9378
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Viral Gandhi
>Priority: Minor
> Attachments: hotspots-v76x.png, hotspots-v76x.png, hotspots-v76x.png, 
> hotspots-v76x.png, hotspots-v76x.png, hotspots-v77x.png, hotspots-v77x.png, 
> hotspots-v77x.png, hotspots-v77x.png, image-2020-06-12-22-17-30-339.png, 
> image-2020-06-12-22-17-53-961.png, image-2020-06-12-22-18-24-527.png, 
> image-2020-06-12-22-18-48-919.png, snapshot-v77x.nps, snapshot-v77x.nps, 
> snapshot-v77x.nps, snapshots-v76x.nps, snapshots-v76x.nps, snapshots-v76x.nps
>
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> Lucene 8.5.1 includes a change to always [compress 
> BinaryDocValues|https://issues.apache.org/jira/browse/LUCENE-9211]. This 
> caused (~30%) reduction in our red-line QPS (throughput). 
> We think users should be given some way to opt-in for this compression 
> feature instead of always being enabled which can have a substantial query 
> time cost as we saw during our upgrade. [~mikemccand] suggested one possible 
> approach by introducing a *mode* in Lucene80DocValuesFormat (COMPRESSED and 
> UNCOMPRESSED) and allowing users to create a custom Codec subclassing the 
> default Codec and pick the format they want.
> Idea is similar to Lucene50StoredFieldsFormat which has two modes, 
> Mode.BEST_SPEED and Mode.BEST_COMPRESSION.
> Here's related issues for adding benchmark covering BINARY doc values 
> query-time performance - [https://github.com/mikemccand/luceneutil/issues/61]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] mikemccand commented on a change in pull request #2052: LUCENE-8982: Make NativeUnixDirectory pure java with FileChannel direct IO flag, and rename to DirectIODirectory

2020-11-06 Thread GitBox



mikemccand commented on a change in pull request #2052:
URL: https://github.com/apache/lucene-solr/pull/2052#discussion_r518846067



##
File path: lucene/misc/src/java/org/apache/lucene/store/DirectIODirectory.java
##
@@ -164,15 +149,16 @@ public IndexOutput createOutput(String name, IOContext 
context) throws IOExcepti
 private long fileLength;
 private boolean isOpen;
 
-public NativeUnixIndexOutput(Path path, String name, int bufferSize) 
throws IOException {
-  super("NativeUnixIndexOutput(path=\"" + path.toString() + "\")", name);
-  //this.path = path;
-  final FileDescriptor fd = NativePosixUtil.open_direct(path.toString(), 
false);
-  fos = new FileOutputStream(fd);
-  //fos = new FileOutputStream(path);
-  channel = fos.getChannel();
-  buffer = ByteBuffer.allocateDirect(bufferSize);
-  this.bufferSize = bufferSize;
+  @SuppressForbidden(reason = "com.sun.nio.file.ExtendedOpenOption: Direct I/O 
with FileChannel requires the use of internal proprietary API 
ExtendedOpenOption.DIRECT")
+  public DirectIOIndexOutput(Path path, String name) throws IOException {
+  super("DirectIOIndexOutput(path=\"" + path.toString() + "\")", name);
+
+  int blockSize = Math.toIntExact(Files.getFileStore(path).getBlockSize());
+  bufferSize = Math.addExact(blockSize, blockSize - 1);
+  buffer = ByteBuffer.allocateDirect(bufferSize).alignedSlice(blockSize);

Review comment:
   Hmm I wonder why we got away without using `.alignedSlice` before?

##
File path: lucene/CHANGES.txt
##
@@ -156,6 +156,9 @@ Improvements
 * LUCENE-9531: Consolidated CharStream and FastCharStream classes: these have 
been moved
   from each query parser package to org.apache.lucene.queryparser.charstream 
(Dawid Weiss).
 
+* LUCENE-8982: Make NativeUnixDirectory pure java with FileChannel direct IO 
flag,

Review comment:
   Will this be Lucene 9.0 only?
   
   The `com.sun.nio.file.ExtendedOpenOption.DIRECT` was added as of JDK 10 
right?
   
   Since Lucene 8.x still allows Java 8.x (and Lucene 9.x will require Java 
11.x minimum), I think this improvement must be Lucene 9.x only?

##
File path: lucene/misc/src/java/org/apache/lucene/store/DirectIODirectory.java
##
@@ -66,45 +66,32 @@
  *
  * @lucene.experimental
  */
-public class NativeUnixDirectory extends FSDirectory {
+public class DirectIODirectory extends FSDirectory {
 
   // TODO: this is OS dependent, but likely 512 is the LCD
   private final static long ALIGN = 512;
   private final static long ALIGN_NOT_MASK = ~(ALIGN-1);
-  
-  /** Default buffer size before writing to disk (256 KB);
-   *  larger means less IO load but more RAM and direct
-   *  buffer storage space consumed during merging. */
-
-  public final static int DEFAULT_MERGE_BUFFER_SIZE = 262144;
 
   /** Default min expected merge size before direct IO is
*  used (10 MB): */
   public final static long DEFAULT_MIN_BYTES_DIRECT = 10*1024*1024;
 
-  private final int mergeBufferSize;
   private final long minBytesDirect;
   private final Directory delegate;
 
   /** Create a new NIOFSDirectory for the named location.
* 
* @param path the path of the directory
-   * @param lockFactory to use
-   * @param mergeBufferSize Size of buffer to use for
-   *merging.  See {@link #DEFAULT_MERGE_BUFFER_SIZE}.
* @param minBytesDirect Merges, or files to be opened for
*   reading, smaller than this will
*   not use direct IO.  See {@link
*   #DEFAULT_MIN_BYTES_DIRECT}
+   * @param lockFactory to use
* @param delegate fallback Directory for non-merges
* @throws IOException If there is a low-level I/O error
*/
-  public NativeUnixDirectory(Path path, int mergeBufferSize, long 
minBytesDirect, LockFactory lockFactory, Directory delegate) throws IOException 
{
+  public DirectIODirectory(Path path, long minBytesDirect, LockFactory 
lockFactory, Directory delegate) throws IOException {
 super(path, lockFactory);
-if ((mergeBufferSize & ALIGN) != 0) {
-  throw new IllegalArgumentException("mergeBufferSize must be 0 mod " + 
ALIGN + " (got: " + mergeBufferSize + ")");
-}
-this.mergeBufferSize = mergeBufferSize;

Review comment:
   Hmm does this mean we are losing the byte buffering during merging?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] mayya-sharipova commented on pull request #2063: LUCENE-9599 Make comparator aware of index sorting

2020-11-06 Thread GitBox



mayya-sharipova commented on pull request #2063:
URL: https://github.com/apache/lucene-solr/pull/2063#issuecomment-723131924


   @msokolov Thanks a lot for  another round of review, I tried to address it 
in c52d9a73



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9537) Add Indri Search Engine Functionality to Lucene

2020-11-06 Thread Michael McCandless (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227444#comment-17227444
 ] 

Michael McCandless commented on LUCENE-9537:


Thank you for all the hard work here [~cvandenberg]!  A few quick questions:
 * Could you open a GitHub PR for this?  It'd make inline commenting easier.
 * Maybe the new {{smoothingScore}} method could be implemented in the abstract 
{{Scorer}} base class with default implementation returning 0?
 * Please try to match Lucene's coding style, e.g. 2-spaced indents, if 
possible.

> Add Indri Search Engine Functionality to Lucene
> ---
>
> Key: LUCENE-9537
> URL: https://issues.apache.org/jira/browse/LUCENE-9537
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Cameron VandenBerg
>Priority: Major
>  Labels: patch
> Attachments: LUCENE-9537.patch, LUCENE-INDRI.patch
>
>
> Indri ([http://lemurproject.org/indri.php]) is an academic search engine 
> developed by The University of Massachusetts and Carnegie Mellon University.  
> The major difference between Lucene and Indri is that Indri will give a 
> document a "smoothing score" to a document that does not contain the search 
> term, which has improved the search ranking accuracy in our experiments.  I 
> have created an Indri patch, which adds the search code needed to implement 
> the Indri AND logic as well as Indri's implementation of Dirichlet Smoothing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] mayya-sharipova commented on a change in pull request #2063: LUCENE-9599 Make comparator aware of index sorting

2020-11-06 Thread GitBox



mayya-sharipova commented on a change in pull request #2063:
URL: https://github.com/apache/lucene-solr/pull/2063#discussion_r518813515



##
File path: 
lucene/core/src/test/org/apache/lucene/search/TestFieldSortOptimizationSkipping.java
##
@@ -485,4 +491,249 @@ public void testDocSort() throws IOException {
 dir.close();
   }
 
+  public void testNumericSortOptimizationIndexSort() throws IOException {
+final Directory dir = newDirectory();
+IndexWriterConfig iwc = new IndexWriterConfig(new MockAnalyzer(random()));
+boolean reverseSort = randomBoolean();
+final SortField sortField = new SortField("field1", SortField.Type.LONG, 
reverseSort);
+Sort indexSort = new Sort(sortField);
+iwc.setIndexSort(indexSort);
+final IndexWriter writer = new IndexWriter(dir, iwc);
+
+final int numDocs = atLeast(50);
+int[] sortedValues = initializeNumericValues(numDocs, reverseSort, 0);
+int[] randomIdxs = randomIdxs(numDocs);
+
+for (int i = 0; i < numDocs; i++) {
+  final Document doc = new Document();
+  if (sortedValues[randomIdxs[i]] > 0) {
+doc.add(new NumericDocValuesField("field1", 
sortedValues[randomIdxs[i]]));
+  }
+  writer.addDocument(doc);
+  if (i == 30) {
+writer.flush();
+  }
+}
+final IndexReader reader = DirectoryReader.open(writer);
+writer.close();
+
+IndexSearcher searcher = newSearcher(reader);
+final int numHits = randomIntBetween(1, numDocs - 10);
+final int totalHitsThreshold = randomIntBetween(1, numDocs - 10);
+{
+  // test that optimization is run when search sort is equal to the index 
sort
+  TopFieldCollector collector = TopFieldCollector.create(indexSort, 
numHits, null, totalHitsThreshold);
+  searcher.search(new MatchAllDocsQuery(), collector);
+  TopDocs topDocs = collector.topDocs();
+  assertTrue(collector.isEarlyTerminated());

Review comment:
   You are right, they don't distinguish this. The idea that if we remove 
early termination in TopFieldCollector, these tests should still work.
   
   If all this confusing, I can also upgrade this PR to remove early 
termination in TopFieldCollector. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] mayya-sharipova commented on a change in pull request #2063: LUCENE-9599 Make comparator aware of index sorting

2020-11-06 Thread GitBox



mayya-sharipova commented on a change in pull request #2063:
URL: https://github.com/apache/lucene-solr/pull/2063#discussion_r518811869



##
File path: lucene/core/src/java/org/apache/lucene/search/FieldComparator.java
##
@@ -387,125 +346,225 @@ protected SortedDocValues 
getSortedDocValues(LeafReaderContext context, String f
 
 @Override
 public LeafFieldComparator getLeafComparator(LeafReaderContext context) 
throws IOException {
-  termsIndex = getSortedDocValues(context, field);
-  currentReaderGen++;
+  return new TermOrdValLeafComparator(context);
+}
 
-  if (topValue != null) {
-// Recompute topOrd/SameReader
-int ord = termsIndex.lookupTerm(topValue);
-if (ord >= 0) {
-  topSameReader = true;
-  topOrd = ord;
-} else {
-  topSameReader = false;
-  topOrd = -ord-2;
+@Override
+public BytesRef value(int slot) {
+  return values[slot];
+}
+
+@Override
+public int compareValues(BytesRef val1, BytesRef val2) {
+  if (val1 == null) {
+if (val2 == null) {
+  return 0;
 }
-  } else {
-topOrd = missingOrd;
-topSameReader = true;
+return missingSortCmp;
+  } else if (val2 == null) {
+return -missingSortCmp;
   }
-  //System.out.println("  getLeafComparator topOrd=" + topOrd + " 
topSameReader=" + topSameReader);
+  return val1.compareTo(val2);
+}
 
-  if (bottomSlot != -1) {
-// Recompute bottomOrd/SameReader
-setBottom(bottomSlot);
+/**
+ * Leaf comparator for {@link TermOrdValComparator} that provides skipping 
functionality when index is sorted
+ */
+public class TermOrdValLeafComparator implements LeafFieldComparator {
+  private final SortedDocValues termsIndex;
+  private boolean indexSort = false; // true if a query sort is a part of 
the index sort
+  private DocIdSetIterator competitiveIterator;
+  private boolean collectedAllCompetitiveHits = false;
+  private boolean iteratorUpdated = false;
+
+  public TermOrdValLeafComparator(LeafReaderContext context) throws 
IOException {
+termsIndex = getSortedDocValues(context, field);
+currentReaderGen++;
+if (topValue != null) {
+  // Recompute topOrd/SameReader
+  int ord = termsIndex.lookupTerm(topValue);
+  if (ord >= 0) {
+topSameReader = true;
+topOrd = ord;
+  } else {
+topSameReader = false;
+topOrd = -ord-2;
+  }
+} else {
+  topOrd = missingOrd;
+  topSameReader = true;
+}
+if (bottomSlot != -1) {
+  // Recompute bottomOrd/SameReader
+  setBottom(bottomSlot);
+}
+this.competitiveIterator = 
DocIdSetIterator.all(context.reader().maxDoc());
   }
 
-  return this;
-}
-
-@Override
-public void setBottom(final int bottom) throws IOException {
-  bottomSlot = bottom;
+  protected SortedDocValues getSortedDocValues(LeafReaderContext context, 
String field) throws IOException {
+return DocValues.getSorted(context.reader(), field);
+  }
 
-  bottomValue = values[bottomSlot];
-  if (currentReaderGen == readerGen[bottomSlot]) {
-bottomOrd = ords[bottomSlot];
-bottomSameReader = true;
-  } else {
-if (bottomValue == null) {
-  // missingOrd is null for all segments
-  assert ords[bottomSlot] == missingOrd;
-  bottomOrd = missingOrd;
+  @Override
+  public void setBottom(final int slot) throws IOException {
+bottomSlot = slot;
+bottomValue = values[bottomSlot];
+if (currentReaderGen == readerGen[bottomSlot]) {
+  bottomOrd = ords[bottomSlot];
   bottomSameReader = true;
-  readerGen[bottomSlot] = currentReaderGen;
 } else {
-  final int ord = termsIndex.lookupTerm(bottomValue);
-  if (ord < 0) {
-bottomOrd = -ord - 2;
-bottomSameReader = false;
-  } else {
-bottomOrd = ord;
-// exact value match
+  if (bottomValue == null) {
+// missingOrd is null for all segments
+assert ords[bottomSlot] == missingOrd;
+bottomOrd = missingOrd;
 bottomSameReader = true;
-readerGen[bottomSlot] = currentReaderGen;
-ords[bottomSlot] = bottomOrd;
+readerGen[bottomSlot] = currentReaderGen;
+  } else {
+final int ord = termsIndex.lookupTerm(bottomValue);
+if (ord < 0) {
+  bottomOrd = -ord - 2;
+  bottomSameReader = false;
+} else {
+  bottomOrd = ord;
+  // exact value match
+  bottomSameReader = true;
+  readerGen[bottomSlot] = currentReaderGen;
+  ords[bottom

[GitHub] [lucene-solr] mayya-sharipova commented on a change in pull request #2063: LUCENE-9599 Make comparator aware of index sorting

2020-11-06 Thread GitBox



mayya-sharipova commented on a change in pull request #2063:
URL: https://github.com/apache/lucene-solr/pull/2063#discussion_r518803324



##
File path: lucene/core/src/java/org/apache/lucene/search/FieldComparator.java
##
@@ -387,125 +346,225 @@ protected SortedDocValues 
getSortedDocValues(LeafReaderContext context, String f
 
 @Override
 public LeafFieldComparator getLeafComparator(LeafReaderContext context) 
throws IOException {
-  termsIndex = getSortedDocValues(context, field);
-  currentReaderGen++;
+  return new TermOrdValLeafComparator(context);
+}
 
-  if (topValue != null) {
-// Recompute topOrd/SameReader
-int ord = termsIndex.lookupTerm(topValue);
-if (ord >= 0) {
-  topSameReader = true;
-  topOrd = ord;
-} else {
-  topSameReader = false;
-  topOrd = -ord-2;
+@Override
+public BytesRef value(int slot) {
+  return values[slot];
+}
+
+@Override
+public int compareValues(BytesRef val1, BytesRef val2) {
+  if (val1 == null) {
+if (val2 == null) {
+  return 0;
 }
-  } else {
-topOrd = missingOrd;
-topSameReader = true;
+return missingSortCmp;
+  } else if (val2 == null) {
+return -missingSortCmp;
   }
-  //System.out.println("  getLeafComparator topOrd=" + topOrd + " 
topSameReader=" + topSameReader);
+  return val1.compareTo(val2);
+}
 
-  if (bottomSlot != -1) {
-// Recompute bottomOrd/SameReader
-setBottom(bottomSlot);
+/**
+ * Leaf comparator for {@link TermOrdValComparator} that provides skipping 
functionality when index is sorted
+ */
+public class TermOrdValLeafComparator implements LeafFieldComparator {
+  private final SortedDocValues termsIndex;
+  private boolean indexSort = false; // true if a query sort is a part of 
the index sort
+  private DocIdSetIterator competitiveIterator;
+  private boolean collectedAllCompetitiveHits = false;
+  private boolean iteratorUpdated = false;
+
+  public TermOrdValLeafComparator(LeafReaderContext context) throws 
IOException {
+termsIndex = getSortedDocValues(context, field);
+currentReaderGen++;
+if (topValue != null) {
+  // Recompute topOrd/SameReader
+  int ord = termsIndex.lookupTerm(topValue);
+  if (ord >= 0) {
+topSameReader = true;
+topOrd = ord;
+  } else {
+topSameReader = false;
+topOrd = -ord-2;
+  }
+} else {
+  topOrd = missingOrd;
+  topSameReader = true;
+}
+if (bottomSlot != -1) {
+  // Recompute bottomOrd/SameReader
+  setBottom(bottomSlot);
+}
+this.competitiveIterator = 
DocIdSetIterator.all(context.reader().maxDoc());
   }
 
-  return this;
-}
-
-@Override
-public void setBottom(final int bottom) throws IOException {
-  bottomSlot = bottom;
+  protected SortedDocValues getSortedDocValues(LeafReaderContext context, 
String field) throws IOException {
+return DocValues.getSorted(context.reader(), field);
+  }
 
-  bottomValue = values[bottomSlot];
-  if (currentReaderGen == readerGen[bottomSlot]) {
-bottomOrd = ords[bottomSlot];
-bottomSameReader = true;
-  } else {
-if (bottomValue == null) {
-  // missingOrd is null for all segments
-  assert ords[bottomSlot] == missingOrd;
-  bottomOrd = missingOrd;
+  @Override
+  public void setBottom(final int slot) throws IOException {
+bottomSlot = slot;
+bottomValue = values[bottomSlot];
+if (currentReaderGen == readerGen[bottomSlot]) {
+  bottomOrd = ords[bottomSlot];
   bottomSameReader = true;
-  readerGen[bottomSlot] = currentReaderGen;
 } else {
-  final int ord = termsIndex.lookupTerm(bottomValue);
-  if (ord < 0) {
-bottomOrd = -ord - 2;
-bottomSameReader = false;
-  } else {
-bottomOrd = ord;
-// exact value match
+  if (bottomValue == null) {
+// missingOrd is null for all segments
+assert ords[bottomSlot] == missingOrd;
+bottomOrd = missingOrd;
 bottomSameReader = true;
-readerGen[bottomSlot] = currentReaderGen;
-ords[bottomSlot] = bottomOrd;
+readerGen[bottomSlot] = currentReaderGen;
+  } else {
+final int ord = termsIndex.lookupTerm(bottomValue);
+if (ord < 0) {
+  bottomOrd = -ord - 2;
+  bottomSameReader = false;
+} else {
+  bottomOrd = ord;
+  // exact value match
+  bottomSameReader = true;
+  readerGen[bottomSlot] = currentReaderGen;
+  ords[bottom

[GitHub] [lucene-solr] mayya-sharipova commented on a change in pull request #2063: LUCENE-9599 Make comparator aware of index sorting

2020-11-06 Thread GitBox



mayya-sharipova commented on a change in pull request #2063:
URL: https://github.com/apache/lucene-solr/pull/2063#discussion_r518800810



##
File path: 
lucene/core/src/test/org/apache/lucene/search/TestFieldSortOptimizationSkipping.java
##
@@ -485,4 +491,249 @@ public void testDocSort() throws IOException {
 dir.close();
   }
 
+  public void testNumericSortOptimizationIndexSort() throws IOException {
+final Directory dir = newDirectory();
+IndexWriterConfig iwc = new IndexWriterConfig(new MockAnalyzer(random()));
+boolean reverseSort = randomBoolean();
+final SortField sortField = new SortField("field1", SortField.Type.LONG, 
reverseSort);
+Sort indexSort = new Sort(sortField);
+iwc.setIndexSort(indexSort);
+final IndexWriter writer = new IndexWriter(dir, iwc);
+
+final int numDocs = atLeast(50);
+int[] sortedValues = initializeNumericValues(numDocs, reverseSort, 0);
+int[] randomIdxs = randomIdxs(numDocs);
+
+for (int i = 0; i < numDocs; i++) {
+  final Document doc = new Document();
+  if (sortedValues[randomIdxs[i]] > 0) {
+doc.add(new NumericDocValuesField("field1", 
sortedValues[randomIdxs[i]]));
+  }
+  writer.addDocument(doc);
+  if (i == 30) {
+writer.flush();
+  }
+}
+final IndexReader reader = DirectoryReader.open(writer);
+writer.close();

Review comment:
   Thank you, good suggestion.  Addressed in c52d9a7

##
File path: 
lucene/core/src/test/org/apache/lucene/search/TestFieldSortOptimizationSkipping.java
##
@@ -485,4 +491,249 @@ public void testDocSort() throws IOException {
 dir.close();
   }
 
+  public void testNumericSortOptimizationIndexSort() throws IOException {
+final Directory dir = newDirectory();
+IndexWriterConfig iwc = new IndexWriterConfig(new MockAnalyzer(random()));
+boolean reverseSort = randomBoolean();
+final SortField sortField = new SortField("field1", SortField.Type.LONG, 
reverseSort);
+Sort indexSort = new Sort(sortField);
+iwc.setIndexSort(indexSort);
+final IndexWriter writer = new IndexWriter(dir, iwc);
+
+final int numDocs = atLeast(50);
+int[] sortedValues = initializeNumericValues(numDocs, reverseSort, 0);
+int[] randomIdxs = randomIdxs(numDocs);
+
+for (int i = 0; i < numDocs; i++) {
+  final Document doc = new Document();
+  if (sortedValues[randomIdxs[i]] > 0) {
+doc.add(new NumericDocValuesField("field1", 
sortedValues[randomIdxs[i]]));
+  }
+  writer.addDocument(doc);
+  if (i == 30) {
+writer.flush();
+  }
+}
+final IndexReader reader = DirectoryReader.open(writer);
+writer.close();
+
+IndexSearcher searcher = newSearcher(reader);
+final int numHits = randomIntBetween(1, numDocs - 10);
+final int totalHitsThreshold = randomIntBetween(1, numDocs - 10);
+{

Review comment:
   Addressed in c52d9a7





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] mayya-sharipova commented on a change in pull request #2063: LUCENE-9599 Make comparator aware of index sorting

2020-11-06 Thread GitBox



mayya-sharipova commented on a change in pull request #2063:
URL: https://github.com/apache/lucene-solr/pull/2063#discussion_r518799913



##
File path: 
lucene/core/src/java/org/apache/lucene/search/comparators/NumericComparator.java
##
@@ -218,7 +237,7 @@ public void visit(int docID, byte[] packedValue) {
 
 @Override
 public DocIdSetIterator competitiveIterator() {
-if (enableSkipping == false) return null;
+if (enableSkipping == false && indexSort == false) return null;

Review comment:
   Addressed in c52d9a7.  I also have a plan to a do a  separate PR to just 
fix formatting.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] mayya-sharipova commented on a change in pull request #2063: LUCENE-9599 Make comparator aware of index sorting

2020-11-06 Thread GitBox



mayya-sharipova commented on a change in pull request #2063:
URL: https://github.com/apache/lucene-solr/pull/2063#discussion_r518799074



##
File path: lucene/core/src/java/org/apache/lucene/search/TopFieldCollector.java
##
@@ -139,6 +140,7 @@ void collectAnyHit(int doc, int hitsCollected) throws 
IOException {
 
 @Override
 public void setScorer(Scorable scorer) throws IOException {
+  if (canEarlyTerminate) comparator.usesIndexSort();

Review comment:
   Addressed in c52d9a7





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] mayya-sharipova commented on a change in pull request #2063: LUCENE-9599 Make comparator aware of index sorting

2020-11-06 Thread GitBox



mayya-sharipova commented on a change in pull request #2063:
URL: https://github.com/apache/lucene-solr/pull/2063#discussion_r518798756



##
File path: lucene/core/src/java/org/apache/lucene/search/TopFieldCollector.java
##
@@ -100,6 +100,7 @@ boolean thresholdCheck(int doc) throws IOException {
 // since docs are visited in doc Id order, if compare is 0, it means
 // this document is largest than anything else in the queue, and
 // therefore not competitive.
+// TODO: remove early termination in TopFieldCollector, as this should 
be managed by comparators

Review comment:
   Indeed, this is a no-op operation, and will be removed in a following PR.
   I left it for now, as we also need to ensure that all necessary BulkScorers 
use collectors/comparators iterators.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] mikemccand commented on a change in pull request #2022: LUCENE-9004: KNN vector search using NSW graphs

2020-11-06 Thread GitBox



mikemccand commented on a change in pull request #2022:
URL: https://github.com/apache/lucene-solr/pull/2022#discussion_r518797861



##
File path: lucene/core/src/java/org/apache/lucene/index/VectorValues.java
##
@@ -74,6 +74,18 @@ public BytesRef binaryValue() throws IOException {
 throw new UnsupportedOperationException();
   }
 
+  /**
+   * Return the k nearest neighbor documents as determined by comparison of 
their vector values
+   * for this field, to the given vector, by the field's search strategy. If 
the search strategy is
+   * reversed, lower values indicate nearer vectors, otherwise higher scores 
indicate nearer
+   * vectors. Unlike relevance scores, vector scores may be negative.
+   * @param target the vector-valued query
+   * @param k  the number of docs to return
+   * @param fanout control the accuracy/speed tradeoff - larger values give 
better recall at higher cost

Review comment:
   > Yeah, I think there needs to be a follow-on exposing the index-time 
controls, which indeed are much more potent than this search-time fanout, which 
has only a small impact on recall and latency. In this patch they are globals 
in HnswGraphBuilder, but there is no API for setting them.
   
   OK, makes sense.
   
   > I am thinking the index-time hyperparameters would be specified in 
IndexWriterConfig?
   
   Hmm, maybe these could be codec level controls?  Or maybe `FieldInfo`?  They 
would be per-vector-field configuration right?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-8673) o.a.s.search.facet classes not public/extendable

2020-11-06 Thread Tim Owen (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-8673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227421#comment-17227421
 ] 

Tim Owen commented on SOLR-8673:


Good idea, I've updated my patch with a new test - it simply confirms that a 
subclass in another package can extend AggValueSource and create SlotAcc and 
Merger objects.

> o.a.s.search.facet classes not public/extendable
> 
>
> Key: SOLR-8673
> URL: https://issues.apache.org/jira/browse/SOLR-8673
> Project: Solr
>  Issue Type: Improvement
>  Components: Facet Module
>Affects Versions: 5.4.1
>Reporter: Markus Jelsma
>Priority: Major
> Fix For: 6.2, 7.0
>
> Attachments: SOLR-8673.patch, SOLR-8673.patch
>
>
> It is not easy to create a custom JSON facet function. A simple function 
> based on AvgAgg quickly results in the following compilation failures:
> {code}
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-compiler-plugin:3.3:compile (default-compile) 
> on project openindex-solr: Compilation failure: Compilation failure:
> [ERROR] 
> /home/markus/projects/openindex/solr/trunk/src/main/java/i.o.s.search/facet/CustomAvgAgg.java:[22,36]
>  org.apache.solr.search.facet.FacetContext is not public in 
> org.apache.solr.search.facet; cannot be accessed from outside package
> [ERROR] 
> /home/markus/projects/openindex/solr/trunk/src/main/java/i.o.s.search/facet/CustomAvgAgg.java:[23,36]
>  org.apache.solr.search.facet.FacetDoubleMerger is not public in 
> org.apache.solr.search.facet; cannot be accessed from outside package
> [ERROR] 
> /home/markus/projects/openindex/solr/trunk/src/main/java/i.o.s.search/facet/CustomAvgAgg.java:[40,32]
>  cannot find symbol
> [ERROR] symbol:   class FacetContext
> [ERROR] location: class i.o.s.search.facet.CustomAvgAgg
> [ERROR] 
> /home/markus/projects/openindex/solr/trunk/src/main/java/i.o.s.search/facet/CustomAvgAgg.java:[49,39]
>  cannot find symbol
> [ERROR] symbol:   class FacetDoubleMerger
> [ERROR] location: class i.o.s.search.facet.CustomAvgAgg
> [ERROR] 
> /home/markus/projects/openindex/solr/trunk/src/main/java/i.o.s.search/facet/CustomAvgAgg.java:[54,43]
>  cannot find symbol
> [ERROR] symbol:   class Context
> [ERROR] location: class i.o.s.search.facet.CustomAvgAgg.Merger
> [ERROR] 
> /home/markus/projects/openindex/solr/trunk/src/main/java/i.o.s.search/facet/CustomAvgAgg.java:[41,16]
>  cannot find symbol
> [ERROR] symbol:   class AvgSlotAcc
> [ERROR] location: class i.o.s.search.facet.CustomAvgAgg
> [ERROR] 
> /home/markus/projects/openindex/solr/trunk/src/main/java/i.o.s.search/facet/CustomAvgAgg.java:[46,12]
>  incompatible types: i.o.s.search.facet.CustomAvgAgg.Merger cannot be 
> converted to org.apache.solr.search.facet.FacetMerger
> [ERROR] 
> /home/markus/projects/openindex/solr/trunk/src/main/java/i.o.s.search/facet/CustomAvgAgg.java:[53,5]
>  method does not override or implement a method from a supertype
> [ERROR] 
> /home/markus/projects/openindex/solr/trunk/src/main/java/i.o.s.search/facet/CustomAvgAgg.java:[60,5]
>  method does not override or implement a method from a supertype
> {code}
> It seems lots of classes are tucked away in FacetModule, which we can't reach 
> from outside.
> Originates from this thread: 
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201602.mbox/%3ccab_8yd9ldbg_0zxm_h1igkfm6bqeypd5ilyy7tty8cztscv...@mail.gmail.com%3E
>  ( also available at 
> https://lists.apache.org/thread.html/9fddcad3136ec908ce1c57881f8d3069e5d153f08b71f80f3e18d995%401455019826%40%3Csolr-user.lucene.apache.org%3E
>  )



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (SOLR-8673) o.a.s.search.facet classes not public/extendable

2020-11-06 Thread Tim Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SOLR-8673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Owen updated SOLR-8673:
---
Attachment: SOLR-8673.patch

> o.a.s.search.facet classes not public/extendable
> 
>
> Key: SOLR-8673
> URL: https://issues.apache.org/jira/browse/SOLR-8673
> Project: Solr
>  Issue Type: Improvement
>  Components: Facet Module
>Affects Versions: 5.4.1
>Reporter: Markus Jelsma
>Priority: Major
> Fix For: 6.2, 7.0
>
> Attachments: SOLR-8673.patch, SOLR-8673.patch
>
>
> It is not easy to create a custom JSON facet function. A simple function 
> based on AvgAgg quickly results in the following compilation failures:
> {code}
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-compiler-plugin:3.3:compile (default-compile) 
> on project openindex-solr: Compilation failure: Compilation failure:
> [ERROR] 
> /home/markus/projects/openindex/solr/trunk/src/main/java/i.o.s.search/facet/CustomAvgAgg.java:[22,36]
>  org.apache.solr.search.facet.FacetContext is not public in 
> org.apache.solr.search.facet; cannot be accessed from outside package
> [ERROR] 
> /home/markus/projects/openindex/solr/trunk/src/main/java/i.o.s.search/facet/CustomAvgAgg.java:[23,36]
>  org.apache.solr.search.facet.FacetDoubleMerger is not public in 
> org.apache.solr.search.facet; cannot be accessed from outside package
> [ERROR] 
> /home/markus/projects/openindex/solr/trunk/src/main/java/i.o.s.search/facet/CustomAvgAgg.java:[40,32]
>  cannot find symbol
> [ERROR] symbol:   class FacetContext
> [ERROR] location: class i.o.s.search.facet.CustomAvgAgg
> [ERROR] 
> /home/markus/projects/openindex/solr/trunk/src/main/java/i.o.s.search/facet/CustomAvgAgg.java:[49,39]
>  cannot find symbol
> [ERROR] symbol:   class FacetDoubleMerger
> [ERROR] location: class i.o.s.search.facet.CustomAvgAgg
> [ERROR] 
> /home/markus/projects/openindex/solr/trunk/src/main/java/i.o.s.search/facet/CustomAvgAgg.java:[54,43]
>  cannot find symbol
> [ERROR] symbol:   class Context
> [ERROR] location: class i.o.s.search.facet.CustomAvgAgg.Merger
> [ERROR] 
> /home/markus/projects/openindex/solr/trunk/src/main/java/i.o.s.search/facet/CustomAvgAgg.java:[41,16]
>  cannot find symbol
> [ERROR] symbol:   class AvgSlotAcc
> [ERROR] location: class i.o.s.search.facet.CustomAvgAgg
> [ERROR] 
> /home/markus/projects/openindex/solr/trunk/src/main/java/i.o.s.search/facet/CustomAvgAgg.java:[46,12]
>  incompatible types: i.o.s.search.facet.CustomAvgAgg.Merger cannot be 
> converted to org.apache.solr.search.facet.FacetMerger
> [ERROR] 
> /home/markus/projects/openindex/solr/trunk/src/main/java/i.o.s.search/facet/CustomAvgAgg.java:[53,5]
>  method does not override or implement a method from a supertype
> [ERROR] 
> /home/markus/projects/openindex/solr/trunk/src/main/java/i.o.s.search/facet/CustomAvgAgg.java:[60,5]
>  method does not override or implement a method from a supertype
> {code}
> It seems lots of classes are tucked away in FacetModule, which we can't reach 
> from outside.
> Originates from this thread: 
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201602.mbox/%3ccab_8yd9ldbg_0zxm_h1igkfm6bqeypd5ilyy7tty8cztscv...@mail.gmail.com%3E
>  ( also available at 
> https://lists.apache.org/thread.html/9fddcad3136ec908ce1c57881f8d3069e5d153f08b71f80f3e18d995%401455019826%40%3Csolr-user.lucene.apache.org%3E
>  )



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-14986) Restrict the properties possible to define with "property.name=value" when creating a collection

2020-11-06 Thread Erick Erickson (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-14986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227418#comment-17227418
 ] 

Erick Erickson commented on SOLR-14986:
---

I want some opinions on what to do here. Thinking about this, there are two 
sets of values that could potentially conflict with what Solr expects that can 
be specified with *property.name=value*.

One set is all the stuff you can put on the URL. For the CREATE command, things 
like *numShards*, *shards*, *replicationFactor*. For ADDREPLICA, things like 
*dataDir*, *type*

Should the following call fail on the theory that "the user is demonstrating a 
fundamental lack of understanding so let's make them stop and read the manual?"
{code:java}
 admin/collection?action=CREATE&numShards=3&property.numShards=5...
{code}
Even though *numShards* doesn't go into core.properties, the above is weird.

The other set is those entries in core.properties that we put there 
automagically, things like *coreNodeName*, *shard* and the like. A command like
{code:java}
admin/collection?action=ADDREPLICA&property.coreNodeName=someting...
{code}
is dangerous, 'cause if it succeeds and is done more than once, the results 
would be "interesting". I believe it fails BTW, but not by explicit check. I 
can say for certainty that other mixed-up calls like this fail with opaque 
errors. There's an example in the original description.

I propose to fail both situations with an informative error message very early 
in processing.

> Restrict the properties possible to define with "property.name=value" when 
> creating a collection
> 
>
> Key: SOLR-14986
> URL: https://issues.apache.org/jira/browse/SOLR-14986
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Erick Erickson
>Assignee: Erick Erickson
>Priority: Major
>
> This came to light when I was looking at two user-list questions where people 
> try to manually define core.properties to define _replicas_ in SolrCloud. 
> There are two related issues:
> 1> You can do things like "action=CREATE&name=eoe&property.collection=blivet" 
> which results in an opaque error about "could not create replica." I 
> propose we return a better error here like "property.collection should not be 
> specified when creating a collection". What do people think about the rest of 
> the auto-created properties on collection creation? 
> coreNodeName
> collection.configName
> name
> numShards
> shard
> collection
> replicaType
> "name" seems to be OK to change, although i don't see anyplace anyone can 
> actually see it afterwards
> 2> Change the ref guide to steer people away from attempting to manually 
> create a core.properties file to define cores/replicas in SolrCloud. There's 
> no warning on the "defining-core-properties.adoc" for instance. Additionally 
> there should be some kind of message on the collections API documentation 
> about not trying to set the properties in <1> on the CREATE command.
> <2> used to actually work (apparently) with legacyCloud...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] alessandrobenedetti commented on a change in pull request #1571: SOLR-14560: Interleaving for Learning To Rank

2020-11-06 Thread GitBox



alessandrobenedetti commented on a change in pull request #1571:
URL: https://github.com/apache/lucene-solr/pull/1571#discussion_r518736297



##
File path: 
solr/contrib/ltr/src/java/org/apache/solr/ltr/search/LTRQParserPlugin.java
##
@@ -146,93 +149,114 @@ public LTRQParser(String qstr, SolrParams localParams, 
SolrParams params,
 @Override
 public Query parse() throws SyntaxError {
   // ReRanking Model
-  final String modelName = localParams.get(LTRQParserPlugin.MODEL);
-  if ((modelName == null) || modelName.isEmpty()) {
+  final String[] modelNames = 
localParams.getParams(LTRQParserPlugin.MODEL);
+  if ((modelNames == null) || modelNames.length==0 || 
modelNames[0].isEmpty()) {
 throw new SolrException(SolrException.ErrorCode.BAD_REQUEST,
 "Must provide model in the request");
   }
-
-  final LTRScoringModel ltrScoringModel = mr.getModel(modelName);
-  if (ltrScoringModel == null) {
-throw new SolrException(SolrException.ErrorCode.BAD_REQUEST,
-"cannot find " + LTRQParserPlugin.MODEL + " " + modelName);
-  }
-
-  final String modelFeatureStoreName = 
ltrScoringModel.getFeatureStoreName();
-  final boolean extractFeatures = 
SolrQueryRequestContextUtils.isExtractingFeatures(req);
-  final String fvStoreName = 
SolrQueryRequestContextUtils.getFvStoreName(req);
-  // Check if features are requested and if the model feature store and 
feature-transform feature store are the same
-  final boolean featuresRequestedFromSameStore = 
(modelFeatureStoreName.equals(fvStoreName) || fvStoreName == null) ? 
extractFeatures:false;
-  if (threadManager != null) {
-
threadManager.setExecutor(req.getCore().getCoreContainer().getUpdateShardHandler().getUpdateExecutor());
-  }
-  final LTRScoringQuery scoringQuery = new LTRScoringQuery(ltrScoringModel,
-  extractEFIParams(localParams),
-  featuresRequestedFromSameStore, threadManager);
-
-  // Enable the feature vector caching if we are extracting features, and 
the features
-  // we requested are the same ones we are reranking with
-  if (featuresRequestedFromSameStore) {
-scoringQuery.setFeatureLogger( 
SolrQueryRequestContextUtils.getFeatureLogger(req) );
+ 
+  LTRScoringQuery[] rerankingQueries = new 
LTRScoringQuery[modelNames.length];
+  for (int i = 0; i < modelNames.length; i++) {
+final LTRScoringQuery rerankingQuery;
+if (!ORIGINAL_RANKING.equals(modelNames[i])) {
+  final LTRScoringModel ltrScoringModel = mr.getModel(modelNames[i]);
+  if (ltrScoringModel == null) {
+throw new SolrException(SolrException.ErrorCode.BAD_REQUEST,
+"cannot find " + LTRQParserPlugin.MODEL + " " + modelNames[i]);
+  }
+  final String modelFeatureStoreName = 
ltrScoringModel.getFeatureStoreName();
+  final boolean extractFeatures = 
SolrQueryRequestContextUtils.isExtractingFeatures(req);
+  final String fvStoreName = 
SolrQueryRequestContextUtils.getFvStoreName(req);// Check if features 
are requested and if the model feature store and feature-transform feature 
store are the same
+  final boolean featuresRequestedFromSameStore = 
(modelFeatureStoreName.equals(fvStoreName) || fvStoreName == null) ? 
extractFeatures : false;
+  if (threadManager != null) {
+
threadManager.setExecutor(req.getCore().getCoreContainer().getUpdateShardHandler().getUpdateExecutor());
+  }
+  rerankingQuery = new LTRScoringQuery(ltrScoringModel,
+  extractEFIParams(localParams),
+  featuresRequestedFromSameStore, threadManager);
+
+  // Enable the feature vector caching if we are extracting features, 
and the features
+  // we requested are the same ones we are reranking with
+  if (featuresRequestedFromSameStore) {
+rerankingQuery.setFeatureLogger( 
SolrQueryRequestContextUtils.getFeatureLogger(req) );
+  }
+}else{
+  rerankingQuery = new LTRScoringQuery(null);
+}
+
+// External features
+rerankingQuery.setRequest(req);
+rerankingQueries[i] = rerankingQuery;
   }
-  SolrQueryRequestContextUtils.setScoringQuery(req, scoringQuery);
 
+  SolrQueryRequestContextUtils.setScoringQuery(req, rerankingQueries);
   int reRankDocs = localParams.getInt(RERANK_DOCS, DEFAULT_RERANK_DOCS);
   if (reRankDocs <= 0) {
 throw new SolrException(SolrException.ErrorCode.BAD_REQUEST,
-  "Must rerank at least 1 document");
+"Must rerank at least 1 document");
+  }
+  if (rerankingQueries.length == 1) {
+return new LTRQuery(rerankingQueries[0], reRankDocs);
+  } else {
+return new LTRQuery(rerankingQueries, reRankDocs);
   }
-
-  // External features
-  scoringQuery.setRequest(req);
-
-  return new

[GitHub] [lucene-solr] cpoerschke commented on a change in pull request #1571: SOLR-14560: Interleaving for Learning To Rank

2020-11-06 Thread GitBox



cpoerschke commented on a change in pull request #1571:
URL: https://github.com/apache/lucene-solr/pull/1571#discussion_r518725166



##
File path: 
solr/contrib/ltr/src/java/org/apache/solr/ltr/interleaving/TeamDraftInterleaving.java
##
@@ -0,0 +1,87 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.solr.ltr.interleaving;
+
+import java.util.ArrayList;
+import java.util.HashSet;
+import java.util.LinkedHashSet;
+import java.util.Random;
+import java.util.Set;
+
+import org.apache.lucene.search.ScoreDoc;
+
+public class TeamDraftInterleaving implements Interleaving{
+  public static Random RANDOM;
+
+  static {
+// We try to make things reproducible in the context of our tests by 
initializing the random instance
+// based on the current seed
+String seed = System.getProperty("tests.seed");

Review comment:
   
https://github.com/cpoerschke/lucene-solr/commit/63a190b57fdc892b98c546f005edbc096e4e4095
 is part 2 of 2: it removes the tests.seed System property use in 
TeamDraftInterleaving by passing a Random object into TeamDraftInterleaving. 
instead the tests.seed System property use is in a test class 
(LTRQParserTestPlugin) and tests continue to make their `setRANDOM` calls but 
instead of `TeamDraftInterleaving.setRANDOM` it's now 
`LTRQParserTestPlugin.setRANDOM` then.
   
   Two problems with this though:
   * a forbidden apis check fails (saying that RandomizedRunner's random() 
should be used instead but i'm unclear still on how that might be done)
   * the tests no longer pass, which baffles me though haven't looked at 
details yet; speculations:
 * was the randomness in the tests predictable before and now it's 
predictable also but the number sequence has changed and test expectations need 
to be adjusted to match?
 * something to do with order of system property setting and non-test/test 
class loading perhaps and perhaps use of the RandomizedRunner's random would 
solve it somehow?
   





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] alessandrobenedetti commented on a change in pull request #1571: SOLR-14560: Interleaving for Learning To Rank

2020-11-06 Thread GitBox



alessandrobenedetti commented on a change in pull request #1571:
URL: https://github.com/apache/lucene-solr/pull/1571#discussion_r518716331



##
File path: 
solr/contrib/ltr/src/java/org/apache/solr/ltr/LTRInterleavingRescorer.java
##
@@ -0,0 +1,156 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.solr.ltr;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.Set;
+
+import org.apache.lucene.index.LeafReaderContext;
+import org.apache.lucene.search.Explanation;
+import org.apache.lucene.search.IndexSearcher;
+import org.apache.lucene.search.ScoreDoc;
+import org.apache.lucene.search.ScoreMode;
+import org.apache.lucene.search.TopDocs;
+import org.apache.solr.ltr.interleaving.Interleaving;
+import org.apache.solr.ltr.interleaving.InterleavingResult;
+import org.apache.solr.ltr.interleaving.TeamDraftInterleaving;
+
+/**
+ * Implements the rescoring logic. The top documents returned by solr with 
their
+ * original scores, will be processed by a {@link LTRScoringQuery} that will 
assign a
+ * new score to each document. The top documents will be resorted based on the
+ * new score.
+ * */
+public class LTRInterleavingRescorer extends LTRRescorer {
+  
+  LTRScoringQuery[] rerankingQueries;
+  Interleaving interleavingAlgorithm = new TeamDraftInterleaving();

Review comment:
   I added this bit, I think is easy enough and not much invasive to 
prepare the structure now.
   
   Adding new algorithm would be as easy as :
   - fetching the param from the request
   - add the algorithm name in the switch clause
   - implement the new algorithm
   
   Take a look if you like it





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] muse-dev[bot] commented on a change in pull request #1571: SOLR-14560: Interleaving for Learning To Rank

2020-11-06 Thread GitBox



muse-dev[bot] commented on a change in pull request #1571:
URL: https://github.com/apache/lucene-solr/pull/1571#discussion_r518714285



##
File path: 
solr/contrib/ltr/src/java/org/apache/solr/ltr/response/transform/LTRInterleavingTransformerFactory.java
##
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.solr.ltr.response.transform;
+
+import java.io.IOException;
+import org.apache.solr.common.SolrDocument;
+import org.apache.solr.common.params.SolrParams;
+import org.apache.solr.common.util.NamedList;
+import org.apache.solr.ltr.LTRInterleavingScoringQuery;
+import org.apache.solr.ltr.LTRScoringQuery;
+import org.apache.solr.ltr.SolrQueryRequestContextUtils;
+import org.apache.solr.request.SolrQueryRequest;
+import org.apache.solr.response.ResultContext;
+import org.apache.solr.response.transform.DocTransformer;
+import org.apache.solr.response.transform.TransformerFactory;
+import org.apache.solr.util.SolrPluginUtils;
+
+public class LTRInterleavingTransformerFactory extends TransformerFactory {
+  
+  @Override
+  @SuppressWarnings({"unchecked"})
+  public void init(@SuppressWarnings("rawtypes") NamedList args) {
+super.init(args);
+SolrPluginUtils.invokeSetters(this, args);
+  }
+
+  @Override
+  public DocTransformer create(String name, SolrParams localparams,
+  SolrQueryRequest req) {
+return new InterleavingTransformer(name, req);
+  }
+  
+  class InterleavingTransformer extends DocTransformer {
+
+final private String name;
+final private SolrQueryRequest req;
+
+private LTRScoringQuery[] rerankingQueries;
+
+/**
+ * @param name
+ *  Name of the field to be added in a document representing the
+ *  model picked by the interleaving process
+ */
+public InterleavingTransformer(String name,
+SolrQueryRequest req) {
+  this.name = name;
+  this.req = req;
+}
+
+@Override
+public String getName() {
+  return name;
+}
+
+@Override
+public void setContext(ResultContext context) {
+  super.setContext(context);
+  if (context == null) {
+return;
+  }
+  if (context.getRequest() == null) {
+return;
+  }
+  rerankingQueries = SolrQueryRequestContextUtils.getScoringQueries(req);
+  for (int i = 0; i < rerankingQueries.length; i++) {
+LTRScoringQuery scoringQuery = rerankingQueries[i];
+
+if (scoringQuery.getOriginalQuery() == null) {
+  scoringQuery.setOriginalQuery(context.getQuery());
+}
+if (scoringQuery.getFeatureLogger() == null) {
+  scoringQuery.setFeatureLogger( 
SolrQueryRequestContextUtils.getFeatureLogger(req) );
+}
+scoringQuery.setRequest(req);
+  }
+}
+
+@Override
+public void transform(SolrDocument doc, int docid, float score)
+throws IOException {
+  implTransform(doc, docid);
+}
+
+@Override
+public void transform(SolrDocument doc, int docid)
+throws IOException {
+  implTransform(doc, docid);
+}
+
+private void implTransform(SolrDocument doc, int docid) {
+  LTRScoringQuery rerankingQuery = null;
+  if (rerankingQueries.length == 1) {
+rerankingQuery = rerankingQueries[0];
+  } else {
+for (LTRScoringQuery query : rerankingQueries) {
+  if 
(((LTRInterleavingScoringQuery)query).getPickedInterleavingDocIds().contains(docid))
 {
+rerankingQuery = query;
+break;
+  }
+}
+  }
+  doc.addField(name, rerankingQuery.getScoringModelName());

Review comment:
   *NULL_DEREFERENCE:*  object `rerankingQuery` last assigned on line 106 
could be null and is dereferenced at line 117.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

[jira] [Assigned] (SOLR-6399) Implement unloadCollection in the Collections API

2020-11-06 Thread Shalin Shekhar Mangar (Jira)



 [ 
https://issues.apache.org/jira/browse/SOLR-6399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar reassigned SOLR-6399:
---

Assignee: (was: Shalin Shekhar Mangar)

> Implement unloadCollection in the Collections API
> -
>
> Key: SOLR-6399
> URL: https://issues.apache.org/jira/browse/SOLR-6399
> Project: Solr
>  Issue Type: New Feature
>Reporter: dfdeshom
>Priority: Major
> Fix For: 6.0
>
>
> There is currently no way to unload a collection without deleting its 
> contents. There should be a way in the collections API to unload a collection 
> and reload it later, as needed.
> A use case for this is the following: you store logs by day, with each day 
> having its own collection. You are required to store up to 2 years of data, 
> which adds up to 730 collections.  Most of the time, you'll want to have 3 
> days of data loaded for search. Having just 3 collections loaded into memory, 
> instead of 730 will make managing Solr easier.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] alessandrobenedetti commented on a change in pull request #1571: SOLR-14560: Interleaving for Learning To Rank

2020-11-06 Thread GitBox



alessandrobenedetti commented on a change in pull request #1571:
URL: https://github.com/apache/lucene-solr/pull/1571#discussion_r518696688



##
File path: 
solr/contrib/ltr/src/java/org/apache/solr/ltr/search/LTRQParserPlugin.java
##
@@ -146,93 +149,114 @@ public LTRQParser(String qstr, SolrParams localParams, 
SolrParams params,
 @Override
 public Query parse() throws SyntaxError {
   // ReRanking Model
-  final String modelName = localParams.get(LTRQParserPlugin.MODEL);
-  if ((modelName == null) || modelName.isEmpty()) {
+  final String[] modelNames = 
localParams.getParams(LTRQParserPlugin.MODEL);
+  if ((modelNames == null) || modelNames.length==0 || 
modelNames[0].isEmpty()) {
 throw new SolrException(SolrException.ErrorCode.BAD_REQUEST,
 "Must provide model in the request");
   }
-
-  final LTRScoringModel ltrScoringModel = mr.getModel(modelName);
-  if (ltrScoringModel == null) {
-throw new SolrException(SolrException.ErrorCode.BAD_REQUEST,
-"cannot find " + LTRQParserPlugin.MODEL + " " + modelName);
-  }
-
-  final String modelFeatureStoreName = 
ltrScoringModel.getFeatureStoreName();
-  final boolean extractFeatures = 
SolrQueryRequestContextUtils.isExtractingFeatures(req);
-  final String fvStoreName = 
SolrQueryRequestContextUtils.getFvStoreName(req);
-  // Check if features are requested and if the model feature store and 
feature-transform feature store are the same
-  final boolean featuresRequestedFromSameStore = 
(modelFeatureStoreName.equals(fvStoreName) || fvStoreName == null) ? 
extractFeatures:false;
-  if (threadManager != null) {
-
threadManager.setExecutor(req.getCore().getCoreContainer().getUpdateShardHandler().getUpdateExecutor());
-  }
-  final LTRScoringQuery scoringQuery = new LTRScoringQuery(ltrScoringModel,
-  extractEFIParams(localParams),
-  featuresRequestedFromSameStore, threadManager);
-
-  // Enable the feature vector caching if we are extracting features, and 
the features
-  // we requested are the same ones we are reranking with
-  if (featuresRequestedFromSameStore) {
-scoringQuery.setFeatureLogger( 
SolrQueryRequestContextUtils.getFeatureLogger(req) );
+ 
+  LTRScoringQuery[] rerankingQueries = new 
LTRScoringQuery[modelNames.length];
+  for (int i = 0; i < modelNames.length; i++) {
+final LTRScoringQuery rerankingQuery;
+if (!ORIGINAL_RANKING.equals(modelNames[i])) {
+  final LTRScoringModel ltrScoringModel = mr.getModel(modelNames[i]);
+  if (ltrScoringModel == null) {
+throw new SolrException(SolrException.ErrorCode.BAD_REQUEST,
+"cannot find " + LTRQParserPlugin.MODEL + " " + modelNames[i]);
+  }
+  final String modelFeatureStoreName = 
ltrScoringModel.getFeatureStoreName();
+  final boolean extractFeatures = 
SolrQueryRequestContextUtils.isExtractingFeatures(req);
+  final String fvStoreName = 
SolrQueryRequestContextUtils.getFvStoreName(req);// Check if features 
are requested and if the model feature store and feature-transform feature 
store are the same
+  final boolean featuresRequestedFromSameStore = 
(modelFeatureStoreName.equals(fvStoreName) || fvStoreName == null) ? 
extractFeatures : false;
+  if (threadManager != null) {
+
threadManager.setExecutor(req.getCore().getCoreContainer().getUpdateShardHandler().getUpdateExecutor());
+  }
+  rerankingQuery = new LTRScoringQuery(ltrScoringModel,
+  extractEFIParams(localParams),
+  featuresRequestedFromSameStore, threadManager);
+
+  // Enable the feature vector caching if we are extracting features, 
and the features
+  // we requested are the same ones we are reranking with
+  if (featuresRequestedFromSameStore) {
+rerankingQuery.setFeatureLogger( 
SolrQueryRequestContextUtils.getFeatureLogger(req) );
+  }
+}else{
+  rerankingQuery = new LTRScoringQuery(null);
+}
+
+// External features
+rerankingQuery.setRequest(req);
+rerankingQueries[i] = rerankingQuery;
   }
-  SolrQueryRequestContextUtils.setScoringQuery(req, scoringQuery);
 
+  SolrQueryRequestContextUtils.setScoringQuery(req, rerankingQueries);
   int reRankDocs = localParams.getInt(RERANK_DOCS, DEFAULT_RERANK_DOCS);
   if (reRankDocs <= 0) {
 throw new SolrException(SolrException.ErrorCode.BAD_REQUEST,
-  "Must rerank at least 1 document");
+"Must rerank at least 1 document");
+  }
+  if (rerankingQueries.length == 1) {
+return new LTRQuery(rerankingQueries[0], reRankDocs);
+  } else {
+return new LTRQuery(rerankingQueries, reRankDocs);
   }
-
-  // External features
-  scoringQuery.setRequest(req);
-
-  return new

[GitHub] [lucene-solr] alessandrobenedetti commented on a change in pull request #1571: SOLR-14560: Interleaving for Learning To Rank

2020-11-06 Thread GitBox



alessandrobenedetti commented on a change in pull request #1571:
URL: https://github.com/apache/lucene-solr/pull/1571#discussion_r518665886



##
File path: 
solr/contrib/ltr/src/java/org/apache/solr/ltr/interleaving/TeamDraftInterleaving.java
##
@@ -0,0 +1,87 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.solr.ltr.interleaving;
+
+import java.util.ArrayList;
+import java.util.HashSet;
+import java.util.LinkedHashSet;
+import java.util.Random;
+import java.util.Set;
+
+import org.apache.lucene.search.ScoreDoc;
+
+public class TeamDraftInterleaving implements Interleaving{

Review comment:
   update the java docs, in regards to the index out of bounds, I think we 
should raise the exception instead of adding the condition:
   if the assumptions are not met, there is a bug to be fixed, so we need 
interleaving to fail instead of silently operates wrong, am I missing anything?
   Please take a look to my new commit, especially for the comments





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] cpoerschke commented on a change in pull request #1571: SOLR-14560: Interleaving for Learning To Rank

2020-11-06 Thread GitBox



cpoerschke commented on a change in pull request #1571:
URL: https://github.com/apache/lucene-solr/pull/1571#discussion_r518659433



##
File path: 
solr/contrib/ltr/src/java/org/apache/solr/ltr/interleaving/TeamDraftInterleaving.java
##
@@ -0,0 +1,87 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.solr.ltr.interleaving;
+
+import java.util.ArrayList;
+import java.util.HashSet;
+import java.util.LinkedHashSet;
+import java.util.Random;
+import java.util.Set;
+
+import org.apache.lucene.search.ScoreDoc;
+
+public class TeamDraftInterleaving implements Interleaving{
+  public static Random RANDOM;
+
+  static {
+// We try to make things reproducible in the context of our tests by 
initializing the random instance
+// based on the current seed
+String seed = System.getProperty("tests.seed");

Review comment:
   Yes, predictable randomness in tests can be tricky! I had an (unplanned) 
idea around this at breakfast this morning, trying it out now: 
https://github.com/cpoerschke/lucene-solr/commit/5fc99898f9bc82f4dabcfe188e132ba2bc727704
 is part 1 of 2 (and is independent technically speaking).





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] alessandrobenedetti commented on a change in pull request #1571: SOLR-14560: Interleaving for Learning To Rank

2020-11-06 Thread GitBox



alessandrobenedetti commented on a change in pull request #1571:
URL: https://github.com/apache/lucene-solr/pull/1571#discussion_r518653490



##
File path: 
solr/contrib/ltr/src/java/org/apache/solr/ltr/search/LTRQParserPlugin.java
##
@@ -146,93 +149,114 @@ public LTRQParser(String qstr, SolrParams localParams, 
SolrParams params,
 @Override
 public Query parse() throws SyntaxError {

Review comment:
   as you will see in my next commit, I added the empty check in the loop 
and a couple of tests for it, I leave this open, feel free to resolve this once 
you take a look.
   
   If we don't check for empty, we will get an exception anyway saying:
   "cannot find model"
   It's up to us if we want to help more the user with the empty message.
   I think is fine, telling explicitly in an exception the model is empty could 
save time for developers/integrators not understanding why a model cannot be 
found





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

57 matches

Mail list logo