date:20200303



[ 
https://issues.apache.org/jira/browse/SOLR-13942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050958#comment-17050958
 ] 

Noble Paul commented on SOLR-13942:
---

{quote}The other question is how we can secure it –
{quote}
I have chosen it to have the same permission as {{collection-admin-read}}. 
Users can set appropriate ACLs to control who gets access
{code:java}
/**Exposes the content of the Zookeeper
*
*/
@EndPoint(path = "/cluster/zk/*",
method = SolrRequest.METHOD.GET,
permission = COLL_READ_PERM)
public class ZkRead {

{code}

> /api/cluster/zk/* to fetch raw ZK data
> --
>
> Key: SOLR-13942
> URL: https://issues.apache.org/jira/browse/SOLR-13942
> Project: Solr
>  Issue Type: Bug
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Major
> Fix For: 8.5
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> example
> download the {{state.json}} of
> {code}
> GET http://localhost:8983/api/cluster/zk/collections/gettingstarted/state.json
> {code}
> get a list of all children under {{/live_nodes}}
> {code}
> GET http://localhost:8983/api/cluster/zk/live_nodes
> {code}
> If the requested path is a node with children show the list of child nodes 
> and their meta data



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] atris edited a comment on issue #1303: LUCENE-9114: Improve ValueSourceScorer's Default Cost Implementation

atris edited a comment on issue #1303: LUCENE-9114: Improve ValueSourceScorer's 
Default Cost Implementation
URL: https://github.com/apache/lucene-solr/pull/1303#issuecomment-594352993
 
 
   > OH; an idea occurred to me. We don't actually need the cost to be mutable 
(which wasn't so pretty), we just need a matchCost that a ValueSourceScorer can 
choose to provide if it wants (which you just did). This way if someone (like 
Solr) wants to force the cost, it could wrap the original ValueSource to supply 
to FunctionRangeQuery -- one with a FunctionValues that overrides 
getRangeScorer to wrap subclass VSC to specify the cost. I know this is more 
code than the mutable cost but isn't as bad as what I was originally fearing, 
having to total black box, wrap a query, weight, scorer, etc. Perhaps this 
wrapping might even be a static convenience method on FunctionRangeQuery that 
takes in a VS & cost and returns a VS that is only to be used by FRQ.
   
   @dsmiley , Agreed.
   
   I believe that the semantics here are completely driven from the use cases 
-- and yours is the use case we are chasing right now :) If the tradeoff of 
defining a new VS which overrides the cost method is fine, I am more than happy 
to remove the mutable cost. Raised an iteration for the same, please see and 
let me know your thoughts and comments.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] atris commented on issue #1303: LUCENE-9114: Improve ValueSourceScorer's Default Cost Implementation

atris commented on issue #1303: LUCENE-9114: Improve ValueSourceScorer's 
Default Cost Implementation
URL: https://github.com/apache/lucene-solr/pull/1303#issuecomment-594352993
 
 
   > OH; an idea occurred to me. We don't actually need the cost to be mutable 
(which wasn't so pretty), we just need a matchCost that a ValueSourceScorer can 
choose to provide if it wants (which you just did). This way if someone (like 
Solr) wants to force the cost, it could wrap the original ValueSource to supply 
to FunctionRangeQuery -- one with a FunctionValues that overrides 
getRangeScorer to wrap subclass VSC to specify the cost. I know this is more 
code than the mutable cost but isn't as bad as what I was originally fearing, 
having to total black box, wrap a query, weight, scorer, etc. Perhaps this 
wrapping might even be a static convenience method on FunctionRangeQuery that 
takes in a VS & cost and returns a VS that is only to be used by FRQ.
   
   Agreed.
   
   I believe that the semantics here are completely driven from the use cases 
-- and yours is the use case we are chasing right now :) If the tradeoff of 
defining a new VS which overrides the cost method is fine, I am more than happy 
to remove the mutable cost. Raised an iteration for the same, please see and 
let me know your thoughts and comments.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search

2020-03-03 Thread Tomoko Uchida (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050924#comment-17050924
 ] 

Tomoko Uchida commented on LUCENE-9136:
---

[~jtibshirani] [~irvingzhang] thanks for your hard work here!
{quote}I was thinking we could actually reuse the existing `PostingsFormat` and 
`DocValuesFormat` implementations.
{quote}
Actually the first implementation (by Michael Sokolov) for the HNSW was 
wrapping DocValuesFormat to avoid code duplication. However, this approach - 
reusing existing code - could lead another concern from the perspective of 
maintenance. (From the beginning, Adrien Grand suggested a dedicated format 
instead of hacking doc values.) This is the main reason why I introduced a new 
format for knn search in LUCENE-9004.

I'm not strongly against to the "reusing existing format" strategy if it's the 
best way here, just would like to share my feeling that it could be a bit 
controversial and you might need to convince maintainers that the (pretty new) 
feature does not cause any problems/concerns on future maintenance for Lucene 
core, if you implement it on existing formats/readers.

I have not closely looked at your PR yet - sorry if my comments completely 
besides the point (you might already talk with other committers  about the 
implementation in another channel, eg. private chats?).

> Introduce IVFFlat to Lucene for ANN similarity search
> -
>
> Key: LUCENE-9136
> URL: https://issues.apache.org/jira/browse/LUCENE-9136
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Xin-Chun Zhang
>Priority: Major
> Attachments: 1581409981369-9dea4099-4e41-4431-8f45-a3bb8cac46c0.png, 
> image-2020-02-16-15-05-02-451.png
>
>
> Representation learning (RL) has been an established discipline in the 
> machine learning space for decades but it draws tremendous attention lately 
> with the emergence of deep learning. The central problem of RL is to 
> determine an optimal representation of the input data. By embedding the data 
> into a high dimensional vector, the vector retrieval (VR) method is then 
> applied to search the relevant items.
> With the rapid development of RL over the past few years, the technique has 
> been used extensively in industry from online advertising to computer vision 
> and speech recognition. There exist many open source implementations of VR 
> algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various 
> choices for potential users. However, the aforementioned implementations are 
> all written in C++, and no plan for supporting Java interface, making it hard 
> to be integrated in Java projects or those who are not familier with C/C++  
> [[https://github.com/facebookresearch/faiss/issues/105]]. 
> The algorithms for vector retrieval can be roughly classified into four 
> categories,
>  # Tree-base algorithms, such as KD-tree;
>  # Hashing methods, such as LSH (Local Sensitive Hashing);
>  # Product quantization based algorithms, such as IVFFlat;
>  # Graph-base algorithms, such as HNSW, SSG, NSG;
> where IVFFlat and HNSW are the most popular ones among all the VR algorithms.
> IVFFlat is better for high-precision applications such as face recognition, 
> while HNSW performs better in general scenarios including recommendation and 
> personalized advertisement. *The recall ratio of IVFFlat could be gradually 
> increased by adjusting the query parameter (nprobe), while it's hard for HNSW 
> to improve its accuracy*. In theory, IVFFlat could achieve 100% recall ratio. 
> Recently, the implementation of HNSW (Hierarchical Navigable Small World, 
> LUCENE-9004) for Lucene, has made great progress. The issue draws attention 
> of those who are interested in Lucene or hope to use HNSW with Solr/Lucene. 
> As an alternative for solving ANN similarity search problems, IVFFlat is also 
> very popular with many users and supporters. Compared with HNSW, IVFFlat has 
> smaller index size but requires k-means clustering, while HNSW is faster in 
> query (no training required) but requires extra storage for saving graphs 
> [indexing 1M 
> vectors|[https://github.com/facebookresearch/faiss/wiki/Indexing-1M-vectors]].
>  Another advantage is that IVFFlat can be faster and more accurate when 
> enables GPU parallel computing (current not support in Java). Both algorithms 
> have their merits and demerits. Since HNSW is now under development, it may 
> be better to provide both implementations (HNSW && IVFFlat) for potential 
> users who are faced with very different scenarios and want to more choices.
> The latest branch is 
> [*lucene-9136-ann-ivfflat*]([https://github.com/irvingzhang/lucene-solr/commits/jira/lucene-9136-ann-ivfflat)|https://github.com/irvingzhang/lucene-solr/commits/jira/lucene-9136-an

[GitHub] [lucene-solr] atris commented on a change in pull request #1303: LUCENE-9114: Improve ValueSourceScorer's Default Cost Implementation

atris commented on a change in pull request #1303: LUCENE-9114: Improve 
ValueSourceScorer's Default Cost Implementation
URL: https://github.com/apache/lucene-solr/pull/1303#discussion_r387451155
 
 

 ##
 File path: 
lucene/queries/src/java/org/apache/lucene/queries/function/FunctionValues.java
 ##
 @@ -90,6 +93,9 @@ public int ordVal(int doc) throws IOException {
* @return the number of unique sort ordinals this instance has
*/
   public int numOrd() { throw new UnsupportedOperationException(); }
+
 
 Review comment:
   Used your recommendation, thanks


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] atris commented on a change in pull request #1303: LUCENE-9114: Improve ValueSourceScorer's Default Cost Implementation

atris commented on a change in pull request #1303: LUCENE-9114: Improve 
ValueSourceScorer's Default Cost Implementation
URL: https://github.com/apache/lucene-solr/pull/1303#discussion_r387449993
 
 

 ##
 File path: 
lucene/queries/src/java/org/apache/lucene/queries/function/ValueSourceScorer.java
 ##
 @@ -55,7 +59,13 @@ public boolean matches() throws IOException {
 
   @Override
   public float matchCost() {
-return 100; // TODO: use cost of ValueSourceScorer.this.matches()
+// If an external cost is set, use that
+if (externallyMutableCost != 0.0) {
+  return externallyMutableCost;
+}
+
+// Cost of iteration is fixed cost + cost exposed by delegated 
FunctionValues instance
+return DEF_COST + values.cost();
 
 Review comment:
   I felt that DEF_COST defines the cost of the VSC.matches call itself and if 
the user prefers to do a complex cost, then override matchCost() :) I agree 
with your approach, added the cost evaluation method as a separate class method


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] dsmiley commented on a change in pull request #1303: LUCENE-9114: Improve ValueSourceScorer's Default Cost Implementation

dsmiley commented on a change in pull request #1303: LUCENE-9114: Improve 
ValueSourceScorer's Default Cost Implementation
URL: https://github.com/apache/lucene-solr/pull/1303#discussion_r387438580
 
 

 ##
 File path: 
lucene/queries/src/java/org/apache/lucene/queries/function/ValueSourceScorer.java
 ##
 @@ -55,7 +59,13 @@ public boolean matches() throws IOException {
 
   @Override
   public float matchCost() {
-return 100; // TODO: use cost of ValueSourceScorer.this.matches()
+// If an external cost is set, use that
+if (externallyMutableCost != 0.0) {
+  return externallyMutableCost;
+}
+
+// Cost of iteration is fixed cost + cost exposed by delegated 
FunctionValues instance
+return DEF_COST + values.cost();
 
 Review comment:
   I suppose the purpose of DEF_COST here is to add on the cost of the 
ValueSourceScorer.matches code that is separate from fetching the value from 
the FunctionValues.  That cost varies by the VSC subclass.  If we want to be 
thorough then maybe _this_ should be settable somehow _as well_?   One easy way 
to do this is to simply refactor out this one line (def cost + FV cost) into a 
protected method on the VSC that may be over-ridden if desired.   Or we could 
just say that is too pedantic, and prefer simplicity.  I'm fine either way.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] dsmiley commented on a change in pull request #1303: LUCENE-9114: Improve ValueSourceScorer's Default Cost Implementation

dsmiley commented on a change in pull request #1303: LUCENE-9114: Improve 
ValueSourceScorer's Default Cost Implementation
URL: https://github.com/apache/lucene-solr/pull/1303#discussion_r387435286
 
 

 ##
 File path: 
lucene/queries/src/java/org/apache/lucene/queries/function/FunctionValues.java
 ##
 @@ -90,6 +93,9 @@ public int ordVal(int doc) throws IOException {
* @return the number of unique sort ordinals this instance has
*/
   public int numOrd() { throw new UnsupportedOperationException(); }
+
 
 Review comment:
   Really needs javadoc explaining what this is.  See TPI.matchCost for 
inspiration.  I suggest:
   
   > An estimate of the expected cost to return a value for a document.
   > It's intended to be used by TwoPhaseIterator.matchCost implementations.
   > Returns an expected cost in number of simple operations like addition, 
multiplication,
   > comparing two numbers and indexing an array.
   > The returned value must be positive.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] dsmiley commented on a change in pull request #1303: LUCENE-9114: Improve ValueSourceScorer's Default Cost Implementation

dsmiley commented on a change in pull request #1303: LUCENE-9114: Improve 
ValueSourceScorer's Default Cost Implementation
URL: https://github.com/apache/lucene-solr/pull/1303#discussion_r387438763
 
 

 ##
 File path: 
lucene/queries/src/java/org/apache/lucene/queries/function/ValueSourceScorer.java
 ##
 @@ -94,4 +104,12 @@ public float getMaxScore(int upTo) throws IOException {
 return Float.POSITIVE_INFINITY;
   }
 
+  /**
+   * Used to externally set a mutable cost for this instance. If set, this 
cost gets preference over the FunctionValues's cost
+   *
+   * @lucene.experimental
+   */
+  public void setExternallyMutableCost(float cost) {
 
 Review comment:
   Why not simply setMatchCost ?  It's apparent it's "mutable", and it's public 
and thus "externally" so :-)
   I think the javadocs should reference TPI.matchCost.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] dsmiley commented on a change in pull request #1303: LUCENE-9114: Improve ValueSourceScorer's Default Cost Implementation

dsmiley commented on a change in pull request #1303: LUCENE-9114: Improve 
ValueSourceScorer's Default Cost Implementation
URL: https://github.com/apache/lucene-solr/pull/1303#discussion_r387436290
 
 

 ##
 File path: 
lucene/queries/src/java/org/apache/lucene/queries/function/ValueSourceScorer.java
 ##
 @@ -39,9 +39,13 @@
  * @lucene.experimental
  */
 public abstract class ValueSourceScorer extends Scorer {
+  // Fixed cost for a single iteration of the TwoPhaseIterator instance
+  private static final int DEF_COST = 5;
+
   protected final FunctionValues values;
   private final TwoPhaseIterator twoPhaseIterator;
   private final DocIdSetIterator disi;
+  private float externallyMutableCost;
 
 Review comment:
   Or we could use a Float object to more clearly show as user-settable via 
non-null?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] atris commented on issue #1303: LUCENE-9114: Improve ValueSourceScorer's Default Cost Implementation

atris commented on issue #1303: LUCENE-9114: Improve ValueSourceScorer's 
Default Cost Implementation
URL: https://github.com/apache/lucene-solr/pull/1303#issuecomment-594318472
 
 
   @dsmiley Any thoughts on this one?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (SOLR-13942) /api/cluster/zk/* to fetch raw ZK data



[ 
https://issues.apache.org/jira/browse/SOLR-13942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050681#comment-17050681
 ] 

Noble Paul edited comment on SOLR-13942 at 3/4/20 3:59 AM:
---

Thanks [~shalin]

I like the "Here's a non-exhaustive list of things you need to troubleshoot 
Solr:"

As a person who is asked to troubleshoot cluster issues at odd hours, I 
appreciate these things you mentioned. I realize how miserable we make the life 
of ops guys. 

I totally endorse #3 as the way going forward. This should be an 
unsupported/undocumented API. We should deprecate the {{/admin/zookeeper}} 
endpoint


was (Author: noble.paul):
Thanks [~shalin]

I like the "Here's a non-exhaustive list of things you need to troubleshoot 
Solr:"

As a person who is asked to troubleshoot cluster issues at odd hours, I realize 
these things you mentioned. I realize how miserable we make the life of ops 
guys. 

I totally endorse #3 as the way going forward. This should be an 
unsupported/undocumented API. We should deprecate the {{/admin/zookeeper}} 
endpoint

> /api/cluster/zk/* to fetch raw ZK data
> --
>
> Key: SOLR-13942
> URL: https://issues.apache.org/jira/browse/SOLR-13942
> Project: Solr
>  Issue Type: Bug
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Major
> Fix For: 8.5
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> example
> download the {{state.json}} of
> {code}
> GET http://localhost:8983/api/cluster/zk/collections/gettingstarted/state.json
> {code}
> get a list of all children under {{/live_nodes}}
> {code}
> GET http://localhost:8983/api/cluster/zk/live_nodes
> {code}
> If the requested path is a node with children show the list of child nodes 
> and their meta data



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (SOLR-13942) /api/cluster/zk/* to fetch raw ZK data



[ 
https://issues.apache.org/jira/browse/SOLR-13942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050681#comment-17050681
 ] 

Noble Paul edited comment on SOLR-13942 at 3/4/20 1:35 AM:
---

Thanks [~shalin]

I like the "Here's a non-exhaustive list of things you need to troubleshoot 
Solr:"

As a person who is asked to troubleshoot cluster issues at odd hours, I realize 
these things you mentioned. I realize how miserable we make the life of ops 
guys. 

I totally endorse #3 as the way going forward. This should be an 
unsupported/undocumented API. We should deprecate the {{/admin/zookeeper}} 
endpoint


was (Author: noble.paul):
Thanks [~shalin] 

I like the "Here's a non-exhaustive list of things you need to troubleshoot 
Solr:"

As a person who is asked to troubleshoot cluster issues at odd hours, I realize 
these things you mentioned. I realize how miserable we make the life of ops 
guys. 

> /api/cluster/zk/* to fetch raw ZK data
> --
>
> Key: SOLR-13942
> URL: https://issues.apache.org/jira/browse/SOLR-13942
> Project: Solr
>  Issue Type: Bug
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Major
> Fix For: 8.5
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> example
> download the {{state.json}} of
> {code}
> GET http://localhost:8983/api/cluster/zk/collections/gettingstarted/state.json
> {code}
> get a list of all children under {{/live_nodes}}
> {code}
> GET http://localhost:8983/api/cluster/zk/live_nodes
> {code}
> If the requested path is a node with children show the list of child nodes 
> and their meta data



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-13942) /api/cluster/zk/* to fetch raw ZK data



[ 
https://issues.apache.org/jira/browse/SOLR-13942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050681#comment-17050681
 ] 

Noble Paul commented on SOLR-13942:
---

Thanks [~shalin] 

I like the "Here's a non-exhaustive list of things you need to troubleshoot 
Solr:"

As a person who is asked to troubleshoot cluster issues at odd hours, I realize 
these things you mentioned. I realize how miserable we make the life of ops 
guys. 

> /api/cluster/zk/* to fetch raw ZK data
> --
>
> Key: SOLR-13942
> URL: https://issues.apache.org/jira/browse/SOLR-13942
> Project: Solr
>  Issue Type: Bug
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Major
> Fix For: 8.5
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> example
> download the {{state.json}} of
> {code}
> GET http://localhost:8983/api/cluster/zk/collections/gettingstarted/state.json
> {code}
> get a list of all children under {{/live_nodes}}
> {code}
> GET http://localhost:8983/api/cluster/zk/live_nodes
> {code}
> If the requested path is a node with children show the list of child nodes 
> and their meta data



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-13942) /api/cluster/zk/* to fetch raw ZK data

2020-03-03 Thread Shalin Shekhar Mangar (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-13942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050677#comment-17050677
 ] 

Shalin Shekhar Mangar commented on SOLR-13942:
--

As someone who runs a managed search service and has to troubleshoot Solr 
issues, I want to add my 2 cents.

There's plenty of information that is required for troubleshooting but is not 
available in clusterstatus or any other documented/public API. Sure there's the 
undocumented /admin/zookeeper which has a weird output format meant for I don't 
know who. But even that does not have a few things that I've found necessary to 
troubleshoot Solr.

Here's a non-exhaustive list of things you need to troubleshoot Solr:
# Length of overseer queues (available in overseerstatus API)
# Contents of overseer queue (mildly useful, available in /admin/zookeeper)
# Overseer election queue and current leader (former is available in 
/admin/zookeeper and latter in overseer status)
# Cluster state (cluster status API)
# Solr.xml (no API regardless of whether it is in ZK or filesystem)
# Leader election queue and current leader for each shard (available in 
/admin/zookeeper)
# Shard terms for each shard/replica (not available in any API)
# Metrics/stats (metrics API)
# Solr Logs (log API? unless it is rolled over)
# GC logs (no API)

The overseerstatus API cannot be hit if there is no overseer so there's that 
too.

We run ZK and Solr inside kubernetes and we do not expose zookeeper publicly. 
So, to use a tool like zkcli means we have to port forward directly to the zk 
node which needs explicit privileges. Ideally we want to hit everything over 
http and never allow port forward privileges to anyone.

So I see the following options:
# Add missing information that is inside ZK (shard terms) to /admin/zookeeper 
and continue to live with its horrible output
# Immediately change /admin/zookeeper to a better output format and change the 
UI to consume this new format
# Deprecate /admin/zookeeper, introduce a clean API, migrate UI to this new 
endpoint or a better alternative and remove /admin/zookeeper in 9.0
# Not do anything and force people to use zkcli and existing solr apis for 
troubleshooting as we've been doing till now

My vote is to go with #3 and we can debate what we want to call the API and 
whether it should a public, documented, supported API or an undocumented API 
like /admin/zookeeper. My preference is to keep this undocumented and 
unsupported just like /admin/zookeeper. The other question is how we can secure 
it -- is it enough to be the same as /admin/zookeeper from a security 
perspective?

> /api/cluster/zk/* to fetch raw ZK data
> --
>
> Key: SOLR-13942
> URL: https://issues.apache.org/jira/browse/SOLR-13942
> Project: Solr
>  Issue Type: Bug
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Major
> Fix For: 8.5
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> example
> download the {{state.json}} of
> {code}
> GET http://localhost:8983/api/cluster/zk/collections/gettingstarted/state.json
> {code}
> get a list of all children under {{/live_nodes}}
> {code}
> GET http://localhost:8983/api/cluster/zk/live_nodes
> {code}
> If the requested path is a node with children show the list of child nodes 
> and their meta data



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search



[ 
https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050663#comment-17050663
 ] 

Julie Tibshirani edited comment on LUCENE-9136 at 3/4/20 12:39 AM:
---

Hello [~irvingzhang], thanks for the updates and for trying out the idea. I was 
thinking we could actually reuse the existing `PostingsFormat` and 
`DocValuesFormat` implementations. This would have a few advantages:
* It simplifies the implementation, since we avoid duplicating a good chunk of 
codec writing + reading logic.
* I agree with your point that we don’t expect the cluster information to take 
up too much memory, since it just contains a map from centroid to the IDs of 
documents that belong to the cluster. I think there are still benefits to 
keeping the main data structures off-heap -- we’d be better able to scale to 
large numbers of documents, especially when multiple vector fields are defined 
at once. There is also no ‘loading’ step, where the data must be read into an 
on-heap data structure before it can be used in queries.

I created this draft PR to sketch out the approach: 
https://github.com/apache/lucene-solr/pull/1314. I’m looking forward to hearing 
your thoughts! I included some benchmarks, and the QPS looks okay. The indexing 
time is ~1 hour for 1 million points -- as we discussed earlier this is a major 
concern, and it will be important to think about how it can be improved.


was (Author: jtibshirani):
Hello [~irvingzhang], thanks for the updates and for trying out the idea. I was 
thinking we could actually reuse the existing `PostingsFormat` and 
`DocValuesFormat` implementations. This would have a few advantages:
* It simplifies the implementation, since we avoid duplicating a good chunk of 
codec writing + reading logic.
* I agree with your point that we don’t expect the cluster information to take 
up too much memory, since it just contains a map from centroid to the IDs of 
documents that belong to the cluster. I think there are still benefits to 
keeping the main data structures off-heap -- we’d be better able to scale to 
large numbers of documents, especially when multiple vector fields are defined 
at once. There is also no ‘loading’ step, where the data must be read into an 
on-heap data structure before it can be used in queries.

I created this draft PR to sketch out the approach: 
https://github.com/apache/lucene-solr/pull/1314. I’m looking forward to hearing 
your thoughts! I included some benchmarks, and the QPS looks okay. The indexing 
time is ~1 hour for 1 million points -- as we discussed earlier this is quite 
high, and it will be important to think about how it can be improved.

> Introduce IVFFlat to Lucene for ANN similarity search
> -
>
> Key: LUCENE-9136
> URL: https://issues.apache.org/jira/browse/LUCENE-9136
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Xin-Chun Zhang
>Priority: Major
> Attachments: 1581409981369-9dea4099-4e41-4431-8f45-a3bb8cac46c0.png, 
> image-2020-02-16-15-05-02-451.png
>
>
> Representation learning (RL) has been an established discipline in the 
> machine learning space for decades but it draws tremendous attention lately 
> with the emergence of deep learning. The central problem of RL is to 
> determine an optimal representation of the input data. By embedding the data 
> into a high dimensional vector, the vector retrieval (VR) method is then 
> applied to search the relevant items.
> With the rapid development of RL over the past few years, the technique has 
> been used extensively in industry from online advertising to computer vision 
> and speech recognition. There exist many open source implementations of VR 
> algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various 
> choices for potential users. However, the aforementioned implementations are 
> all written in C++, and no plan for supporting Java interface, making it hard 
> to be integrated in Java projects or those who are not familier with C/C++  
> [[https://github.com/facebookresearch/faiss/issues/105]]. 
> The algorithms for vector retrieval can be roughly classified into four 
> categories,
>  # Tree-base algorithms, such as KD-tree;
>  # Hashing methods, such as LSH (Local Sensitive Hashing);
>  # Product quantization based algorithms, such as IVFFlat;
>  # Graph-base algorithms, such as HNSW, SSG, NSG;
> where IVFFlat and HNSW are the most popular ones among all the VR algorithms.
> IVFFlat is better for high-precision applications such as face recognition, 
> while HNSW performs better in general scenarios including recommendation and 
> personalized advertisement. *The recall ratio of IVFFlat could be gradually 
> increased by adjusting the query parameter (nprobe), while it

[jira] [Comment Edited] (SOLR-13942) /api/cluster/zk/* to fetch raw ZK data



[ 
https://issues.apache.org/jira/browse/SOLR-13942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050656#comment-17050656
 ] 

Noble Paul edited comment on SOLR-13942 at 3/4/20 12:38 AM:


{quote}The PR was created and merged in exactly 15 minutes.
{quote}
The original branch has been there for 3.5 months. I had to create a new one 
because , git merge was not succeeding for some strange reason. 
{quote}The "data" inside ZooKeeper is cluster metadata, that's subject to 
change without compatibility or any warnings. By providing an API, that people 
start building on top of, you need to start supporting compatibility.
{quote}
Data in Zookeeper is public data. It is not supposed to be fixed. Everyone 
knows this. Do people write tools by reading data from Zk today? I'm sure they 
do. Are they aware it can change? Yes. Why do they do it? They have no other 
choice

tell me one good reason why I should not remove the {{/admin/zookeeper}} 
endpoint and we already have consensus. Every single point that you raised 
against this is valid for the existing API as well. 
{quote}Can we start here? what's the information that you are looking to find 
that belongs to a Solr API and it's not part of one today?
{quote}
Yes, let's start with the overseer queue. and the leader election queue. 
{quote}But you see that the name of the endpoint is "zk", right? or am I 
missing something?
{quote}
The name does not matter. You can just have a prefix of "/sharedclusterdata" as 
the prefix and it will be just fine


was (Author: noble.paul):
{quote}The PR was created and merged in exactly 15 minutes.
{quote}
The original branch has been there for 3.5 months. I had to create a new one 
because , git merge was not succeeding for some strange reason. 
{quote}The "data" inside ZooKeeper is cluster metadata, that's subject to 
change without compatibility or any warnings. By providing an API, that people 
start building on top of, you need to start supporting compatibility.
{quote}
Data in Zookeeper is public data. It is not supposed to be fixed. Everyone 
knows this. Do people write tools by reading data from Zk today? I'm sure they 
do. Are they aware it can change? Yes. Why do they do it? They have no other 
choice
{quote}Can we start here? what's the information that you are looking to find 
that belongs to a Solr API and it's not part of one today?
{quote}
Yes, let's start with the overseer queue. and the leader election queue. 
{quote}But you see that the name of the endpoint is "zk", right? or am I 
missing something?
{quote}
The name does not matter. You can just have a prefix of "/sharedclusterdata" as 
the prefix and it will be just fine

> /api/cluster/zk/* to fetch raw ZK data
> --
>
> Key: SOLR-13942
> URL: https://issues.apache.org/jira/browse/SOLR-13942
> Project: Solr
>  Issue Type: Bug
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Major
> Fix For: 8.5
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> example
> download the {{state.json}} of
> {code}
> GET http://localhost:8983/api/cluster/zk/collections/gettingstarted/state.json
> {code}
> get a list of all children under {{/live_nodes}}
> {code}
> GET http://localhost:8983/api/cluster/zk/live_nodes
> {code}
> If the requested path is a node with children show the list of child nodes 
> and their meta data



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search



[ 
https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050663#comment-17050663
 ] 

Julie Tibshirani edited comment on LUCENE-9136 at 3/4/20 12:38 AM:
---

Hello [~irvingzhang], thanks for the updates and for trying out the idea. I was 
thinking we could actually reuse the existing `PostingsFormat` and 
`DocValuesFormat` implementations. This would have a few advantages:
* It simplifies the implementation, since we avoid duplicating a good chunk of 
codec writing + reading logic.
* I agree with your point that we don’t expect the cluster information to take 
up too much memory, since it just contains a map from centroid to the IDs of 
documents that belong to the cluster. I think there are still benefits to 
keeping the main data structures off-heap -- we’d be better able to scale to 
large numbers of documents, especially when multiple vector fields are defined 
at once. There is also no ‘loading’ step, where the data must be read into an 
on-heap data structure before it can be used in queries.

I created this draft PR to sketch out the idea: 
https://github.com/apache/lucene-solr/pull/1314. I’m looking forward to hearing 
your thoughts! I included some benchmarks, and the QPS looks okay. The indexing 
time is ~1 hour for 1 million points -- as we discussed earlier this is quite 
high, and it will be important to think about how it can be improved.


was (Author: jtibshirani):
Hello @Xin-Chun Zhang, thanks for the updates and for trying out the idea. I 
was thinking we could actually reuse the existing `PostingsFormat` and 
`DocValuesFormat` implementations. This would have a few advantages:
* It simplifies the implementation, since we avoid duplicating a good chunk of 
codec writing + reading logic.
* I agree with your point that we don’t expect the cluster information to take 
up too much memory, since it just contains a map from centroid to the IDs of 
documents that belong to the cluster. I think there are still benefits to 
keeping the main data structures off-heap -- we’d be better able to scale to 
large numbers of documents, especially when multiple vector fields are defined 
at once. There is also no ‘loading’ step, where the data must be read into an 
on-heap data structure before it can be used in queries.

I created this draft PR to sketch out the idea: 
https://github.com/apache/lucene-solr/pull/1314. I’m looking forward to hearing 
your thoughts! I included some benchmarks, and the QPS looks okay. The indexing 
time is ~1 hour for 1 million points -- as we discussed earlier this is quite 
high, and it will be important to think about how it can be improved.

> Introduce IVFFlat to Lucene for ANN similarity search
> -
>
> Key: LUCENE-9136
> URL: https://issues.apache.org/jira/browse/LUCENE-9136
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Xin-Chun Zhang
>Priority: Major
> Attachments: 1581409981369-9dea4099-4e41-4431-8f45-a3bb8cac46c0.png, 
> image-2020-02-16-15-05-02-451.png
>
>
> Representation learning (RL) has been an established discipline in the 
> machine learning space for decades but it draws tremendous attention lately 
> with the emergence of deep learning. The central problem of RL is to 
> determine an optimal representation of the input data. By embedding the data 
> into a high dimensional vector, the vector retrieval (VR) method is then 
> applied to search the relevant items.
> With the rapid development of RL over the past few years, the technique has 
> been used extensively in industry from online advertising to computer vision 
> and speech recognition. There exist many open source implementations of VR 
> algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various 
> choices for potential users. However, the aforementioned implementations are 
> all written in C++, and no plan for supporting Java interface, making it hard 
> to be integrated in Java projects or those who are not familier with C/C++  
> [[https://github.com/facebookresearch/faiss/issues/105]]. 
> The algorithms for vector retrieval can be roughly classified into four 
> categories,
>  # Tree-base algorithms, such as KD-tree;
>  # Hashing methods, such as LSH (Local Sensitive Hashing);
>  # Product quantization based algorithms, such as IVFFlat;
>  # Graph-base algorithms, such as HNSW, SSG, NSG;
> where IVFFlat and HNSW are the most popular ones among all the VR algorithms.
> IVFFlat is better for high-precision applications such as face recognition, 
> while HNSW performs better in general scenarios including recommendation and 
> personalized advertisement. *The recall ratio of IVFFlat could be gradually 
> increased by adjusting the query parameter (nprobe), while it's hard for

[jira] [Comment Edited] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search



[ 
https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050663#comment-17050663
 ] 

Julie Tibshirani edited comment on LUCENE-9136 at 3/4/20 12:38 AM:
---

Hello [~irvingzhang], thanks for the updates and for trying out the idea. I was 
thinking we could actually reuse the existing `PostingsFormat` and 
`DocValuesFormat` implementations. This would have a few advantages:
* It simplifies the implementation, since we avoid duplicating a good chunk of 
codec writing + reading logic.
* I agree with your point that we don’t expect the cluster information to take 
up too much memory, since it just contains a map from centroid to the IDs of 
documents that belong to the cluster. I think there are still benefits to 
keeping the main data structures off-heap -- we’d be better able to scale to 
large numbers of documents, especially when multiple vector fields are defined 
at once. There is also no ‘loading’ step, where the data must be read into an 
on-heap data structure before it can be used in queries.

I created this draft PR to sketch out the approach: 
https://github.com/apache/lucene-solr/pull/1314. I’m looking forward to hearing 
your thoughts! I included some benchmarks, and the QPS looks okay. The indexing 
time is ~1 hour for 1 million points -- as we discussed earlier this is quite 
high, and it will be important to think about how it can be improved.


was (Author: jtibshirani):
Hello [~irvingzhang], thanks for the updates and for trying out the idea. I was 
thinking we could actually reuse the existing `PostingsFormat` and 
`DocValuesFormat` implementations. This would have a few advantages:
* It simplifies the implementation, since we avoid duplicating a good chunk of 
codec writing + reading logic.
* I agree with your point that we don’t expect the cluster information to take 
up too much memory, since it just contains a map from centroid to the IDs of 
documents that belong to the cluster. I think there are still benefits to 
keeping the main data structures off-heap -- we’d be better able to scale to 
large numbers of documents, especially when multiple vector fields are defined 
at once. There is also no ‘loading’ step, where the data must be read into an 
on-heap data structure before it can be used in queries.

I created this draft PR to sketch out the idea: 
https://github.com/apache/lucene-solr/pull/1314. I’m looking forward to hearing 
your thoughts! I included some benchmarks, and the QPS looks okay. The indexing 
time is ~1 hour for 1 million points -- as we discussed earlier this is quite 
high, and it will be important to think about how it can be improved.

> Introduce IVFFlat to Lucene for ANN similarity search
> -
>
> Key: LUCENE-9136
> URL: https://issues.apache.org/jira/browse/LUCENE-9136
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Xin-Chun Zhang
>Priority: Major
> Attachments: 1581409981369-9dea4099-4e41-4431-8f45-a3bb8cac46c0.png, 
> image-2020-02-16-15-05-02-451.png
>
>
> Representation learning (RL) has been an established discipline in the 
> machine learning space for decades but it draws tremendous attention lately 
> with the emergence of deep learning. The central problem of RL is to 
> determine an optimal representation of the input data. By embedding the data 
> into a high dimensional vector, the vector retrieval (VR) method is then 
> applied to search the relevant items.
> With the rapid development of RL over the past few years, the technique has 
> been used extensively in industry from online advertising to computer vision 
> and speech recognition. There exist many open source implementations of VR 
> algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various 
> choices for potential users. However, the aforementioned implementations are 
> all written in C++, and no plan for supporting Java interface, making it hard 
> to be integrated in Java projects or those who are not familier with C/C++  
> [[https://github.com/facebookresearch/faiss/issues/105]]. 
> The algorithms for vector retrieval can be roughly classified into four 
> categories,
>  # Tree-base algorithms, such as KD-tree;
>  # Hashing methods, such as LSH (Local Sensitive Hashing);
>  # Product quantization based algorithms, such as IVFFlat;
>  # Graph-base algorithms, such as HNSW, SSG, NSG;
> where IVFFlat and HNSW are the most popular ones among all the VR algorithms.
> IVFFlat is better for high-precision applications such as face recognition, 
> while HNSW performs better in general scenarios including recommendation and 
> personalized advertisement. *The recall ratio of IVFFlat could be gradually 
> increased by adjusting the query parameter (nprobe), while it's hard f

[jira] [Commented] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search



[ 
https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050663#comment-17050663
 ] 

Julie Tibshirani commented on LUCENE-9136:
--

Hello @Xin-Chun Zhang, thanks for the updates and for trying out the idea. I 
was thinking we could actually reuse the existing `PostingsFormat` and 
`DocValuesFormat` implementations. This would have a few advantages:
* It simplifies the implementation, since we avoid duplicating a good chunk of 
codec writing + reading logic.
* I agree with your point that we don’t expect the cluster information to take 
up too much memory, since it just contains a map from centroid to the IDs of 
documents that belong to the cluster. I think there are still benefits to 
keeping the main data structures off-heap -- we’d be better able to scale to 
large numbers of documents, especially when multiple vector fields are defined 
at once. There is also no ‘loading’ step, where the data must be read into an 
on-heap data structure before it can be used in queries.

I created this draft PR to sketch out the idea: 
https://github.com/apache/lucene-solr/pull/1314. I’m looking forward to hearing 
your thoughts! I included some benchmarks, and the QPS looks okay. The indexing 
time is ~1 hour for 1 million points -- as we discussed earlier this is quite 
high, and it will be important to think about how it can be improved.

> Introduce IVFFlat to Lucene for ANN similarity search
> -
>
> Key: LUCENE-9136
> URL: https://issues.apache.org/jira/browse/LUCENE-9136
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Xin-Chun Zhang
>Priority: Major
> Attachments: 1581409981369-9dea4099-4e41-4431-8f45-a3bb8cac46c0.png, 
> image-2020-02-16-15-05-02-451.png
>
>
> Representation learning (RL) has been an established discipline in the 
> machine learning space for decades but it draws tremendous attention lately 
> with the emergence of deep learning. The central problem of RL is to 
> determine an optimal representation of the input data. By embedding the data 
> into a high dimensional vector, the vector retrieval (VR) method is then 
> applied to search the relevant items.
> With the rapid development of RL over the past few years, the technique has 
> been used extensively in industry from online advertising to computer vision 
> and speech recognition. There exist many open source implementations of VR 
> algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various 
> choices for potential users. However, the aforementioned implementations are 
> all written in C++, and no plan for supporting Java interface, making it hard 
> to be integrated in Java projects or those who are not familier with C/C++  
> [[https://github.com/facebookresearch/faiss/issues/105]]. 
> The algorithms for vector retrieval can be roughly classified into four 
> categories,
>  # Tree-base algorithms, such as KD-tree;
>  # Hashing methods, such as LSH (Local Sensitive Hashing);
>  # Product quantization based algorithms, such as IVFFlat;
>  # Graph-base algorithms, such as HNSW, SSG, NSG;
> where IVFFlat and HNSW are the most popular ones among all the VR algorithms.
> IVFFlat is better for high-precision applications such as face recognition, 
> while HNSW performs better in general scenarios including recommendation and 
> personalized advertisement. *The recall ratio of IVFFlat could be gradually 
> increased by adjusting the query parameter (nprobe), while it's hard for HNSW 
> to improve its accuracy*. In theory, IVFFlat could achieve 100% recall ratio. 
> Recently, the implementation of HNSW (Hierarchical Navigable Small World, 
> LUCENE-9004) for Lucene, has made great progress. The issue draws attention 
> of those who are interested in Lucene or hope to use HNSW with Solr/Lucene. 
> As an alternative for solving ANN similarity search problems, IVFFlat is also 
> very popular with many users and supporters. Compared with HNSW, IVFFlat has 
> smaller index size but requires k-means clustering, while HNSW is faster in 
> query (no training required) but requires extra storage for saving graphs 
> [indexing 1M 
> vectors|[https://github.com/facebookresearch/faiss/wiki/Indexing-1M-vectors]].
>  Another advantage is that IVFFlat can be faster and more accurate when 
> enables GPU parallel computing (current not support in Java). Both algorithms 
> have their merits and demerits. Since HNSW is now under development, it may 
> be better to provide both implementations (HNSW && IVFFlat) for potential 
> users who are faced with very different scenarios and want to more choices.
> The latest branch is 
> [*lucene-9136-ann-ivfflat*]([https://github.com/irvingzhang/lucene-solr/commits/jira/lucene-9136-ann-ivfflat)|https://github.com/irvingzhang/lu

[GitHub] [lucene-solr] jtibshirani commented on issue #1314: Coarse quantization

jtibshirani commented on issue #1314: Coarse quantization
URL: https://github.com/apache/lucene-solr/pull/1314#issuecomment-594242054
 
 
   **Benchmarks**
   
   sift-128-euclidean: a dataset of 1 million SIFT descriptors with 128 dims.
   ```
   APPROACH  RECALL QPS
   LuceneExact() 1.0006.425
   LuceneCluster(n_probes=2) 0.536 1138.926
   LuceneCluster(n_probes=5) 0.749  574.186
   LuceneCluster(n_probes=10)0.874  308.455
   LuceneCluster(n_probes=20)0.951 1161.871
   LuceneCluster(n_probes=50)0.993   67.354
   LuceneCluster(n_probes=100)   0.999   34.651
   ```
   
   glove-100-angular: a dataset of ~1.2 million GloVe word vectors of 100 dims.
   ```
   APPROACH  RECALL QPS
   LuceneExact() 1.0006.722
   LuceneCluster(n_probes=5) 0.680  618.438
   LuceneCluster(n_probes=10)0.766  335.956
   LuceneCluster(n_probes=20)0.835  173.782
   LuceneCluster(n_probes=50)0.905   72.747
   LuceneCluster(n_probes=100)   0.948   37.339
   ```
   
   These benchmarks were performed using the [ann-benchmarks 
repo](https://github.com/erikbern/ann-benchmarks). I hooked up the prototype to 
the benchmarking framework using py4j 
(e10d34c73dc391e4a105253f6181dfc0e9cb6705). Unfortunately py4j adds quite a bit 
of overhead (~3ms per search), so I had to measure that overhead and subtract 
it from the results. This is really not ideal, I will work on more robust 
benchmarks.
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] jtibshirani opened a new pull request #1314: Coarse quantization

jtibshirani opened a new pull request #1314: Coarse quantization
URL: https://github.com/apache/lucene-solr/pull/1314
 
 
   **Note:** this PR is just meant to sketch out an idea and is not meant for 
detailed review.
   
   This PR shows a kNN approach based on coarse quantization (IVFFlat). It adds 
a new format `VectorsFormat`, which simply delegates to `DocValuesFormat` and 
`PostingsFormat` under the hood:
   * The original vectors are stored as `BinaryDocValues`.
   * The vectors are also clustered, and the cluster information is stored in 
postings format. In particular, each cluster centroid is encoded to a 
`BytesRef` to represent a term. Each document belonging to the centroid is 
added to the postings list for that term.
   
   There are currently some pretty big hacks:
   * We re-use the existing doc values and postings formats for simplicity. 
This is fairly fragile since we write to the same files as normal doc values 
and postings -- I think there would be a conflict if there were both a vector 
field and a doc values field with the same name.
* To write the postings list, we compute the map from centroid to documents 
in memory. We then expose it through a hacky `Fields` implementation called 
`ClusterBackedFields` and pass it to the postings writer. It would be better to 
avoid this hack and not to compute cluster information using a map.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-13942) /api/cluster/zk/* to fetch raw ZK data



[ 
https://issues.apache.org/jira/browse/SOLR-13942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050656#comment-17050656
 ] 

Noble Paul commented on SOLR-13942:
---

{quote}The PR was created and merged in exactly 15 minutes.
{quote}
The original branch has been there for 3.5 months. I had to create a new one 
because , git merge was not succeeding for some strange reason. 
{quote}The "data" inside ZooKeeper is cluster metadata, that's subject to 
change without compatibility or any warnings. By providing an API, that people 
start building on top of, you need to start supporting compatibility.
{quote}
Data in Zookeeper is public data. It is not supposed to be fixed. Everyone 
knows this. Do people write tools by reading data from Zk today? I'm sure they 
do. Are they aware it can change? Yes. Why do they do it? They have no other 
choice
{quote}Can we start here? what's the information that you are looking to find 
that belongs to a Solr API and it's not part of one today?
{quote}
Yes, let's start with the overseer queue. and the leader election queue. 
{quote}But you see that the name of the endpoint is "zk", right? or am I 
missing something?
{quote}
The name does not matter. You can just have a prefix of "/sharedclusterdata" as 
the prefix and it will be just fine

> /api/cluster/zk/* to fetch raw ZK data
> --
>
> Key: SOLR-13942
> URL: https://issues.apache.org/jira/browse/SOLR-13942
> Project: Solr
>  Issue Type: Bug
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Major
> Fix For: 8.5
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> example
> download the {{state.json}} of
> {code}
> GET http://localhost:8983/api/cluster/zk/collections/gettingstarted/state.json
> {code}
> get a list of all children under {{/live_nodes}}
> {code}
> GET http://localhost:8983/api/cluster/zk/live_nodes
> {code}
> If the requested path is a node with children show the list of child nodes 
> and their meta data



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (SOLR-13942) /api/cluster/zk/* to fetch raw ZK data

2020-03-03 Thread Tomas Eduardo Fernandez Lobbe (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-13942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050627#comment-17050627
 ] 

Tomas Eduardo Fernandez Lobbe edited comment on SOLR-13942 at 3/3/20 11:19 PM:
---

bq. whether we use ZK or something else to store our shared data
But you see that the name of the endpoint is "zk", right? or am I missing 
something?
bq. You understand the difference between exposing zookeeper and exposing data 
in zookeeper, right?
The "data" inside ZooKeeper is cluster metadata, that's subject to change 
without compatibility or any warnings. By providing an API, that people start 
building on top of, you need to start supporting compatibility.

bq. This is a diagnostic tool.
Can we start here? what's the information that you are looking to find that 
belongs to a Solr API and it's not part of one today? You give "live_nodes" as 
an example, but that comes back in the {{clusterstatus}} call, so that's a bad 
example.

bq. 3.5 months is not "hurry". This issue has been under discussion since 
November.
The PR was created and merged in exactly 15 minutes. 
https://github.com/apache/lucene-solr/pull/1309. With absolutely no review. 
Even if the discussion in this Jira issue was not settled. [~ichattopadhyaya], 
I won't believe you if you tell me you looked at the code, it has a {{main}} 
and the test is a joke.

bq. Mostly, the fears were unfounded and unsubstantiated
well, that's your opinion. I do think you should revert until consensus is 
achieved.


was (Author: tomasflobbe):
bq. whether we use ZK or something else to store our shared data
But you see that the name of the endpoint is "zk", right? or am I missing 
something?
bq. You understand the difference between exposing zookeeper and exposing data 
in zookeeper, right?
The "data" inside ZooKeeper is cluster metadata, that's subject to change 
without compatibility or any warnings. By providing an API, that people start 
building on top of, you need to start supporting compatibility.

bq. This is a diagnostic tool.
Can we start here? what's the information that you are looking to find that 
belongs to a Solr API and it's not part of one today?

bq. 3.5 months is not "hurry". This issue has been under discussion since 
November.
The PR was created and merged in exactly 15 minutes. 
https://github.com/apache/lucene-solr/pull/1309. With absolutely no review. 
Even if the discussion in this Jira issue was not settled. [~ichattopadhyaya], 
I won't believe you if you tell me you looked at the code, it has a {{main}} 
and the test is a joke.

bq. Mostly, the fears were unfounded and unsubstantiated
well, that's your opinion. I do think you should revert until consensus is 
achieved.

> /api/cluster/zk/* to fetch raw ZK data
> --
>
> Key: SOLR-13942
> URL: https://issues.apache.org/jira/browse/SOLR-13942
> Project: Solr
>  Issue Type: Bug
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Major
> Fix For: 8.5
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> example
> download the {{state.json}} of
> {code}
> GET http://localhost:8983/api/cluster/zk/collections/gettingstarted/state.json
> {code}
> get a list of all children under {{/live_nodes}}
> {code}
> GET http://localhost:8983/api/cluster/zk/live_nodes
> {code}
> If the requested path is a node with children show the list of child nodes 
> and their meta data



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (SOLR-13942) /api/cluster/zk/* to fetch raw ZK data

2020-03-03 Thread Tomas Eduardo Fernandez Lobbe (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-13942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050627#comment-17050627
 ] 

Tomas Eduardo Fernandez Lobbe edited comment on SOLR-13942 at 3/3/20 11:16 PM:
---

bq. whether we use ZK or something else to store our shared data
But you see that the name of the endpoint is "zk", right? or am I missing 
something?
bq. You understand the difference between exposing zookeeper and exposing data 
in zookeeper, right?
The "data" inside ZooKeeper is cluster metadata, that's subject to change 
without compatibility or any warnings. By providing an API, that people start 
building on top of, you need to start supporting compatibility.

bq. This is a diagnostic tool.
Can we start here? what's the information that you are looking to find that 
belongs to a Solr API and it's not part of one today?

bq. 3.5 months is not "hurry". This issue has been under discussion since 
November.
The PR was created and merged in exactly 15 minutes. 
https://github.com/apache/lucene-solr/pull/1309. With absolutely no review. 
Even if the discussion in this Jira issue was not settled. [~ichattopadhyaya], 
I won't believe you if you tell me you looked at the code, it has a {{main}} 
and the test is a joke.

bq. Mostly, the fears were unfounded and unsubstantiated
well, that's your opinion. I do think you should revert until consensus is 
achieved.


was (Author: tomasflobbe):
bq. whether we use ZK or something else to store our shared data
But you see that the name of the endpoint is "zk", right? or am I missing 
something?
bq. You understand the difference between exposing zookeeper and exposing data 
in zookeeper, right?
The "data" inside ZooKeeper is cluster metadata, that's subject to change 
without compatibility or any warnings. By providing an API, that people start 
building on top of, you need to start supporting compatibility.

bq. This is a diagnostic tool.
Can we start here? what's the information that you are looking to find that 
belongs to a Solr API and it's not part of one today?

bq. 3.5 months is not "hurry". This issue has been under discussion since 
November.
The PR was created and merged in exactly 15 minutes. 
https://github.com/apache/lucene-solr/pull/1309. With absolutely no review. 
Even if the discussion in this Jira issue was not settled. [~ichattopadhyaya], 
I won't believe you if you tell me you looked at the code, it has a `main` and 
the test is a joke.

bq. Mostly, the fears were unfounded and unsubstantiated
well, that's your opinion. I do think you should revert until consensus is 
achieved.

> /api/cluster/zk/* to fetch raw ZK data
> --
>
> Key: SOLR-13942
> URL: https://issues.apache.org/jira/browse/SOLR-13942
> Project: Solr
>  Issue Type: Bug
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Major
> Fix For: 8.5
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> example
> download the {{state.json}} of
> {code}
> GET http://localhost:8983/api/cluster/zk/collections/gettingstarted/state.json
> {code}
> get a list of all children under {{/live_nodes}}
> {code}
> GET http://localhost:8983/api/cluster/zk/live_nodes
> {code}
> If the requested path is a node with children show the list of child nodes 
> and their meta data



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Reopened] (SOLR-13942) /api/cluster/zk/* to fetch raw ZK data

2020-03-03 Thread Tomas Eduardo Fernandez Lobbe (Jira)



 [ 
https://issues.apache.org/jira/browse/SOLR-13942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomas Eduardo Fernandez Lobbe reopened SOLR-13942:
--

> /api/cluster/zk/* to fetch raw ZK data
> --
>
> Key: SOLR-13942
> URL: https://issues.apache.org/jira/browse/SOLR-13942
> Project: Solr
>  Issue Type: Bug
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Major
> Fix For: 8.5
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> example
> download the {{state.json}} of
> {code}
> GET http://localhost:8983/api/cluster/zk/collections/gettingstarted/state.json
> {code}
> get a list of all children under {{/live_nodes}}
> {code}
> GET http://localhost:8983/api/cluster/zk/live_nodes
> {code}
> If the requested path is a node with children show the list of child nodes 
> and their meta data



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-13942) /api/cluster/zk/* to fetch raw ZK data

2020-03-03 Thread Tomas Eduardo Fernandez Lobbe (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-13942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050627#comment-17050627
 ] 

Tomas Eduardo Fernandez Lobbe commented on SOLR-13942:
--

bq. whether we use ZK or something else to store our shared data
But you see that the name of the endpoint is "zk", right? or am I missing 
something?
bq. You understand the difference between exposing zookeeper and exposing data 
in zookeeper, right?
The "data" inside ZooKeeper is cluster metadata, that's subject to change 
without compatibility or any warnings. By providing an API, that people start 
building on top of, you need to start supporting compatibility.

bq. This is a diagnostic tool.
Can we start here? what's the information that you are looking to find that 
belongs to a Solr API and it's not part of one today?

bq. 3.5 months is not "hurry". This issue has been under discussion since 
November.
The PR was created and merged in exactly 15 minutes. 
https://github.com/apache/lucene-solr/pull/1309. With absolutely no review. 
Even if the discussion in this Jira issue was not settled. [~ichattopadhyaya], 
I won't believe you if you tell me you looked at the code, it has a `main` and 
the test is a joke.

bq. Mostly, the fears were unfounded and unsubstantiated
well, that's your opinion. I do think you should revert until consensus is 
achieved.

> /api/cluster/zk/* to fetch raw ZK data
> --
>
> Key: SOLR-13942
> URL: https://issues.apache.org/jira/browse/SOLR-13942
> Project: Solr
>  Issue Type: Bug
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Major
> Fix For: 8.5
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> example
> download the {{state.json}} of
> {code}
> GET http://localhost:8983/api/cluster/zk/collections/gettingstarted/state.json
> {code}
> get a list of all children under {{/live_nodes}}
> {code}
> GET http://localhost:8983/api/cluster/zk/live_nodes
> {code}
> If the requested path is a node with children show the list of child nodes 
> and their meta data



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Created] (SOLR-14302) Solr stops printing stacktraces in log and output due to OmitStackTraceInFastThrow - regression of SOLR-7436

2020-03-03 Thread Chris M. Hostetter (Jira)

Chris M. Hostetter created SOLR-14302:
-

 Summary: Solr stops printing stacktraces in log and output due to 
OmitStackTraceInFastThrow - regression of SOLR-7436 
 Key: SOLR-14302
 URL: https://issues.apache.org/jira/browse/SOLR-14302
 Project: Solr
  Issue Type: Bug
  Security Level: Public (Default Security Level. Issues are Public)
Affects Versions: 8.1
Reporter: Chris M. Hostetter


I recently saw a person ask a question about an Exception in their logs that 
didn't have a stack trace even though it certainly seemed like it should.

I was immediately suspicious that they may have tweaked their solr options to 
override the {{-XX:-OmitStackTraceInFastThrow}} that was added to bin/solr by 
SOLR-7436, but then i discovered it's gone now - removed in SOLR-13394 w/o any 
discussion/consideration (and possibly unintentionally w/o understanding it's 
purpose?)

We should almost certainly restore this by default.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-03 Thread Michael Froh (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050607#comment-17050607
 ] 

Michael Froh commented on LUCENE-8962:
--

I ended up splitting testMergeOnCommit into two test cases.

One runs through the basic invariants on a single thread and confirms that 
everything behaves as expected.

The other tries indexing and committing from multiple threads, but doesn't 
really make any assumptions about the segment topology in the end (since 
randomness and concurrency can lead to all kinds of possible valid segment 
counts). Instead it just verifies that it doesn't fail and doesn't lose any 
documents.

https://github.com/apache/lucene-solr/pull/1313

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.5
>
> Attachments: LUCENE-8962_demo.png
>
>  Time Spent: 6h 10m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] msfroh opened a new pull request #1313: LUCENE-8962: Split test case

msfroh opened a new pull request #1313: LUCENE-8962: Split test case
URL: https://github.com/apache/lucene-solr/pull/1313

The testMergeOnCommit test case was trying to verify too many things
at once: basic semantics of merge on commit and proper behavior when
a bunch of indexing threads are writing and committing all at once.

Splitting the test into two should make the tests more robust - one
will verify basic behavior, with strict assertions on invariants, while
the other just verifies that everything gets indexed and we don't throw
an exception when multiple threads are indexing and merging on commit.

Also, the part of the test that is now testMultithreadedMergeOnCommit
can take several seconds to run, so moving it to the @Nightly set.

# Description

Fixing an intermittent test failure on testMergeOnCommit.

# Solution

Split the logic from testMergeOnCommit into two test cases. The basic test
has consistently passed, and actually verifies the merge on commit invariants.
The more complicated, more potentially-brittle multithreaded test doesn't
necessarily satisfy clear invariants (as we may be merging on commit from
multiple threads, which could result in multiple segments in the end), but it
should never throw an exception or lose any updates.

# Tests

Split existing test case into two test cases. Ran tests multiple times.

# Checklist

Please review the following and check all that apply:

- [X] I have reviewed the guidelines for [How to
Contribute](https://wiki.apache.org/solr/HowToContribute) and my code conforms
to the standards described there to the best of my ability.
- [X] I have created a Jira issue and added the issue ID to my pull request
title.
- [X] I have given Solr maintainers
[access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork)
to contribute to my PR branch. (optional but recommended)
- [X] I have developed this patch against the `master` branch.
- [X] I have run `ant precommit` and the appropriate test suite.
- [X] I have added tests for my changes.
- [ ] I have added documentation for the [Ref
Guide](https://github.com/apache/lucene-solr/tree/master/solr/solr-ref-guide)
(for Solr changes only).

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-13942) /api/cluster/zk/* to fetch raw ZK data



[ 
https://issues.apache.org/jira/browse/SOLR-13942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050580#comment-17050580
 ] 

Noble Paul commented on SOLR-13942:
---

bq. please revert this until some of the outstanding questions have been 
answered and/or some of us others have been convinced.

After so many months I cannot see a single argument against it. Mostly, the 
fears were unfounded and unsubstantiated

> /api/cluster/zk/* to fetch raw ZK data
> --
>
> Key: SOLR-13942
> URL: https://issues.apache.org/jira/browse/SOLR-13942
> Project: Solr
>  Issue Type: Bug
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Major
> Fix For: 8.5
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> example
> download the {{state.json}} of
> {code}
> GET http://localhost:8983/api/cluster/zk/collections/gettingstarted/state.json
> {code}
> get a list of all children under {{/live_nodes}}
> {code}
> GET http://localhost:8983/api/cluster/zk/live_nodes
> {code}
> If the requested path is a node with children show the list of child nodes 
> and their meta data



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-13942) /api/cluster/zk/* to fetch raw ZK data

2020-03-03 Thread Jason Gerlowski (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-13942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050578#comment-17050578
 ] 

Jason Gerlowski commented on SOLR-13942:


bq. Consensus was sought.  I fully support this.

Maybe it was sought, but it certainly wasn't reached.  There's technically 2 
people in favor, sure.  But there's 4 or 5 people on this JIRA who were raising 
questions and concerns before, while, and after you committed this.  That's not 
consensus by any stretch.

I'll second Jan's request - please revert this until some of the outstanding 
questions have been answered and/or some of us others have been convinced.

(Side note: Guess there's some issue with gitbot that it didn't comment here?)

> /api/cluster/zk/* to fetch raw ZK data
> --
>
> Key: SOLR-13942
> URL: https://issues.apache.org/jira/browse/SOLR-13942
> Project: Solr
>  Issue Type: Bug
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Major
> Fix For: 8.5
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> example
> download the {{state.json}} of
> {code}
> GET http://localhost:8983/api/cluster/zk/collections/gettingstarted/state.json
> {code}
> get a list of all children under {{/live_nodes}}
> {code}
> GET http://localhost:8983/api/cluster/zk/live_nodes
> {code}
> If the requested path is a node with children show the list of child nodes 
> and their meta data



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-13942) /api/cluster/zk/* to fetch raw ZK data



[ 
https://issues.apache.org/jira/browse/SOLR-13942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050562#comment-17050562
 ] 

Ishan Chattopadhyaya commented on SOLR-13942:
-

bq. Why the hurry with this? 
3.5 months is not "hurry". This issue has been under discussion since November.

bq. Why not seek consensus? 
Consensus was sought. I fully support this. This issue makes production Solr 
cluster easier for monitoring by devops or developers, without giving users 
direct access to ZK. Arguably, the same can be done using /admin/zookeeper end 
point as well, but the output format is garbage and hence this is the right way 
to go. But, all of this was discussed. 

bq. Please revert!
Why?

> /api/cluster/zk/* to fetch raw ZK data
> --
>
> Key: SOLR-13942
> URL: https://issues.apache.org/jira/browse/SOLR-13942
> Project: Solr
>  Issue Type: Bug
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Major
> Fix For: 8.5
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> example
> download the {{state.json}} of
> {code}
> GET http://localhost:8983/api/cluster/zk/collections/gettingstarted/state.json
> {code}
> get a list of all children under {{/live_nodes}}
> {code}
> GET http://localhost:8983/api/cluster/zk/live_nodes
> {code}
> If the requested path is a node with children show the list of child nodes 
> and their meta data



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-13942) /api/cluster/zk/* to fetch raw ZK data



[ 
https://issues.apache.org/jira/browse/SOLR-13942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050558#comment-17050558
 ] 

Noble Paul commented on SOLR-13942:
---

bq.Ishan, you say that zk is insecure, but if you look a bit closer, you can 
configure ssl, auth and acl with zk now after 3.5.


Why even expose ZK in the first place? The easiest way to secure something is 
to not give access in the first place. 

> /api/cluster/zk/* to fetch raw ZK data
> --
>
> Key: SOLR-13942
> URL: https://issues.apache.org/jira/browse/SOLR-13942
> Project: Solr
>  Issue Type: Bug
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Major
> Fix For: 8.5
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> example
> download the {{state.json}} of
> {code}
> GET http://localhost:8983/api/cluster/zk/collections/gettingstarted/state.json
> {code}
> get a list of all children under {{/live_nodes}}
> {code}
> GET http://localhost:8983/api/cluster/zk/live_nodes
> {code}
> If the requested path is a node with children show the list of child nodes 
> and their meta data



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-13942) /api/cluster/zk/* to fetch raw ZK data



[ 
https://issues.apache.org/jira/browse/SOLR-13942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050557#comment-17050557
 ] 

Noble Paul commented on SOLR-13942:
---

bq. Are you kidding me? it's a "zookeeper" endpoint

You understand the difference between exposing zookeeper and exposing data in 
zookeeper, right? Exposing Zookeeper is a security loophole. Exposing data is 
not.

This is a diagnostic tool. If you wish to diagnose problems in your cluster, 
the only way to do is to expose more data. We cannot pretend that this data 
does not exist or users do not need it. The users NEED this data and security 
considerations prevent them from exposing Zookeeper to a wider audience

> /api/cluster/zk/* to fetch raw ZK data
> --
>
> Key: SOLR-13942
> URL: https://issues.apache.org/jira/browse/SOLR-13942
> Project: Solr
>  Issue Type: Bug
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Major
> Fix For: 8.5
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> example
> download the {{state.json}} of
> {code}
> GET http://localhost:8983/api/cluster/zk/collections/gettingstarted/state.json
> {code}
> get a list of all children under {{/live_nodes}}
> {code}
> GET http://localhost:8983/api/cluster/zk/live_nodes
> {code}
> If the requested path is a node with children show the list of child nodes 
> and their meta data



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-14232) Add shareSchema leak protections

2020-03-03 Thread David Smiley (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-14232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050553#comment-17050553
 ] 

David Smiley commented on SOLR-14232:
-

If SRL is created per-IndexSchems (when shareSchema=true) as it should be 
changed to, unquestionably, then perhaps "waitingForCore", "infoMBeans" will be 
simply unsupported -- can't use components that need those things -- should 
cause an error as being incompatible.

> Add shareSchema leak protections
> 
>
> Key: SOLR-14232
> URL: https://issues.apache.org/jira/browse/SOLR-14232
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Schema and Analysis
>Reporter: David Smiley
>Priority: Major
>
> The shareSchema option in solr.xml allows cores to share a common 
> IndexSchema, assuming the underlying schema is literally the same (from the 
> same configSet). However this sharing has no protections to prevent an 
> IndexSchema from accidentally referencing the SolrCore and its settings. The 
> effect might be nondeterministic behavior depending on which core loaded the 
> schema first, or the effect might be a memory leak preventing a closed 
> SolrCore from GC'ing, or maybe an error. Example:
>  * IndexSchema could theoretically do property expansion using the core's 
> props, such as solr.core.name, silly as that may be.
>  * IndexSchema uses the same SolrResourceLoader for the core, which in turn 
> tracks infoMBeans and other things that can refer to the core. It should 
> probably have it's own SolrResourceLoader but it's not trivial; there are 
> complications with life-cycle of ResourceLoaderAware tracking etc.
>  * If anything in IndexSchema is SolrCoreAware, this isn't going to work!
>  ** SchemaSimilarityFactory is SolrCoreAware, though I think it could be 
> reduced to being SchemaAware and work.
>  ** ExternalFileField is currently SchemaAware it grabs the 
> SolrResourceLoader to call getDataDir which is bad.  FYI In a separate PR I'm 
> removing getDataDir from SRL.
>  ** Should probably fail if anything is detected.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-13942) /api/cluster/zk/* to fetch raw ZK data

2020-03-03 Thread Jira



[ 
https://issues.apache.org/jira/browse/SOLR-13942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050550#comment-17050550
 ] 

Jan Høydahl commented on SOLR-13942:


Why the hurry with this? Why not seek consensus? Please revert!

Ishan, you say that zk is insecure, but if you look a bit closer, you can 
configure ssl, auth and acl with zk now after 3.5. We still have several open 
JIRAs to make this easier but it is possible. We can also lock down current 
/admin/zookeeper api.

> /api/cluster/zk/* to fetch raw ZK data
> --
>
> Key: SOLR-13942
> URL: https://issues.apache.org/jira/browse/SOLR-13942
> Project: Solr
>  Issue Type: Bug
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Major
> Fix For: 8.5
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> example
> download the {{state.json}} of
> {code}
> GET http://localhost:8983/api/cluster/zk/collections/gettingstarted/state.json
> {code}
> get a list of all children under {{/live_nodes}}
> {code}
> GET http://localhost:8983/api/cluster/zk/live_nodes
> {code}
> If the requested path is a node with children show the list of child nodes 
> and their meta data



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] noblepaul commented on issue #1254: SOLR-14259: Back porting SOLR-14013 to Solr 7.7

noblepaul commented on issue #1254: SOLR-14259: Back porting SOLR-14013 to Solr 
7.7
URL: https://github.com/apache/lucene-solr/pull/1254#issuecomment-594161743
 
 
   Yes, it is ready to merge.
   
   I was thinking of doing a release post 8.5 as I was tied up in other things
   
   On Wed, Mar 4, 2020 at 4:55 AM Houston Putman 
   wrote:
   
   > Hey Noble, is this ready to merge?
   >
   > I can do some performance testing if that'd help.
   >
   > —
   > You are receiving this because you authored the thread.
   > Reply to this email directly, view it on GitHub
   > 
,
   > or unsubscribe
   > 

   > .
   >
   
   
   -- 
   -
   Noble Paul
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-14040) solr.xml shareSchema does not work in SolrCloud



[ 
https://issues.apache.org/jira/browse/SOLR-14040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050542#comment-17050542
 ] 

Noble Paul commented on SOLR-14040:
---

bq.This shortcoming is not new; it was here before I SolrCloud-enabled 
shareSchema. 

You are right. This problem existed. But it affected a small subset of users. 
Now, we have implemented it for cloud, it can potentially affect a vast 
majority of users (if they use it ) .

bq.  shareSchema is undocumented, and undocumented is basically the same as 
experimental,

I'm not sure if {{undocumented = experimental}} . We should have a proper way 
to communicate an existing feature as potentially buggy/use with care

> solr.xml shareSchema does not work in SolrCloud
> ---
>
> Key: SOLR-14040
> URL: https://issues.apache.org/jira/browse/SOLR-14040
> Project: Solr
>  Issue Type: Improvement
>  Components: Schema and Analysis
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Blocker
> Fix For: 8.5
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> solr.xml has a shareSchema boolean option that can be toggled from the 
> default of false to true in order to share IndexSchema objects within the 
> Solr node.  This is silently ignored in SolrCloud mode.  The pertinent code 
> is {{org.apache.solr.core.ConfigSetService#createConfigSetService}} which 
> creates a CloudConfigSetService that is not related to the SchemaCaching 
> class.  This may not be a big deal in SolrCloud which tends not to deal well 
> with many cores per node but I'm working on changing that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-14040) solr.xml shareSchema does not work in SolrCloud

2020-03-03 Thread David Smiley (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-14040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050535#comment-17050535
 ] 

David Smiley commented on SOLR-14040:
-

Hi Noble; thanks for your analysis!  I think the concern you raised is one I am 
aware of -- see the linked issue: SOLR-14232.  This shortcoming is not new; it 
was here before I SolrCloud-enabled shareSchema.  shareSchema is undocumented, 
and undocumented is basically the same as experimental, so I'm not sure there 
is any short-term change to be done here on 8x?

> solr.xml shareSchema does not work in SolrCloud
> ---
>
> Key: SOLR-14040
> URL: https://issues.apache.org/jira/browse/SOLR-14040
> Project: Solr
>  Issue Type: Improvement
>  Components: Schema and Analysis
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Blocker
> Fix For: 8.5
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> solr.xml has a shareSchema boolean option that can be toggled from the 
> default of false to true in order to share IndexSchema objects within the 
> Solr node.  This is silently ignored in SolrCloud mode.  The pertinent code 
> is {{org.apache.solr.core.ConfigSetService#createConfigSetService}} which 
> creates a CloudConfigSetService that is not related to the SchemaCaching 
> class.  This may not be a big deal in SolrCloud which tends not to deal well 
> with many cores per node but I'm working on changing that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (SOLR-14040) solr.xml shareSchema does not work in SolrCloud



 [ 
https://issues.apache.org/jira/browse/SOLR-14040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noble Paul updated SOLR-14040:
--
Priority: Blocker  (was: Major)

> solr.xml shareSchema does not work in SolrCloud
> ---
>
> Key: SOLR-14040
> URL: https://issues.apache.org/jira/browse/SOLR-14040
> Project: Solr
>  Issue Type: Improvement
>  Components: Schema and Analysis
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Blocker
> Fix For: 8.5
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> solr.xml has a shareSchema boolean option that can be toggled from the 
> default of false to true in order to share IndexSchema objects within the 
> Solr node.  This is silently ignored in SolrCloud mode.  The pertinent code 
> is {{org.apache.solr.core.ConfigSetService#createConfigSetService}} which 
> creates a CloudConfigSetService that is not related to the SchemaCaching 
> class.  This may not be a big deal in SolrCloud which tends not to deal well 
> with many cores per node but I'm working on changing that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Reopened] (SOLR-14040) solr.xml shareSchema does not work in SolrCloud



 [ 
https://issues.apache.org/jira/browse/SOLR-14040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noble Paul reopened SOLR-14040:
---

> solr.xml shareSchema does not work in SolrCloud
> ---
>
> Key: SOLR-14040
> URL: https://issues.apache.org/jira/browse/SOLR-14040
> Project: Solr
>  Issue Type: Improvement
>  Components: Schema and Analysis
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Major
> Fix For: 8.5
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> solr.xml has a shareSchema boolean option that can be toggled from the 
> default of false to true in order to share IndexSchema objects within the 
> Solr node.  This is silently ignored in SolrCloud mode.  The pertinent code 
> is {{org.apache.solr.core.ConfigSetService#createConfigSetService}} which 
> creates a CloudConfigSetService that is not related to the SchemaCaching 
> class.  This may not be a big deal in SolrCloud which tends not to deal well 
> with many cores per node but I'm working on changing that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-14040) solr.xml shareSchema does not work in SolrCloud



[ 
https://issues.apache.org/jira/browse/SOLR-14040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050531#comment-17050531
 ] 

Noble Paul commented on SOLR-14040:
---

There are a few of fundamental problems with the way this is implemented

# The first time the {{IndexSchema}} is created, it uses the 
{{SolrResourceLoader}} of the first core lets's call it {{SRL1}}. When another 
core comes up, it gets another {{SolrResourceLoader}} {{SRL2}} . The components 
loaded for core2 uses {{SRL2}} and the {{IndexSchema}} it's gonna use uses 
{{SRL1}} (because it is shared). This can cause {{ClassCastException}} if the 
per core component is sharing an object with schema
# As the core2 is still holding on to the shared schema, it is in turn holding 
a reference to SRL1 . So, even if core1 is unloaded , it is not fully garbage 
collected & the classes & objects can linger on 

What can we do?

# Revert this and have a proper design & implement everything correctly in the 
next release
# Mark this as experimental & give a warning to users that this is faulty






> solr.xml shareSchema does not work in SolrCloud
> ---
>
> Key: SOLR-14040
> URL: https://issues.apache.org/jira/browse/SOLR-14040
> Project: Solr
>  Issue Type: Improvement
>  Components: Schema and Analysis
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Major
> Fix For: 8.5
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> solr.xml has a shareSchema boolean option that can be toggled from the 
> default of false to true in order to share IndexSchema objects within the 
> Solr node.  This is silently ignored in SolrCloud mode.  The pertinent code 
> is {{org.apache.solr.core.ConfigSetService#createConfigSetService}} which 
> creates a CloudConfigSetService that is not related to the SchemaCaching 
> class.  This may not be a big deal in SolrCloud which tends not to deal well 
> with many cores per node but I'm working on changing that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Issue Comment Deleted] (SOLR-14286) Upgrade Jaegar to 1.1.0



 [ 
https://issues.apache.org/jira/browse/SOLR-14286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cassandra Targett updated SOLR-14286:
-
Comment: was deleted

(was: Commit d7f80c743feca76464b119ff14eaad2703fa5594 in lucene-solr's branch 
refs/heads/branch_8x from Cassandra Targett
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=d7f80c7 ]

SOLR-14286: jvm-settings.adoc: minor typos; add links to external resources
)

> Upgrade Jaegar to 1.1.0
> ---
>
> Key: SOLR-14286
> URL: https://issues.apache.org/jira/browse/SOLR-14286
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Cao Manh Dat
>Assignee: Cao Manh Dat
>Priority: Major
> Fix For: master (9.0), 8.5
>
>
> Rohit Singh pointed to me that we are using thrift 0.12.0 (in 
> JaegarTracer-Configurator module) which has several security issues. We 
> should upgrade to Jaegar 1.1.0 which compatible which the current version we 
> are using. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Issue Comment Deleted] (SOLR-14286) Upgrade Jaegar to 1.1.0



 [ 
https://issues.apache.org/jira/browse/SOLR-14286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cassandra Targett updated SOLR-14286:
-
Comment: was deleted

(was: Commit fa6166f2611a965a3f0761bdb7c9e3c7b0aa1d1b in lucene-solr's branch 
refs/heads/master from Cassandra Targett
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=fa6166f ]

SOLR-14286: jvm-settings.adoc: minor typos; add links to external resources
)

> Upgrade Jaegar to 1.1.0
> ---
>
> Key: SOLR-14286
> URL: https://issues.apache.org/jira/browse/SOLR-14286
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Cao Manh Dat
>Assignee: Cao Manh Dat
>Priority: Major
> Fix For: master (9.0), 8.5
>
>
> Rohit Singh pointed to me that we are using thrift 0.12.0 (in 
> JaegarTracer-Configurator module) which has several security issues. We 
> should upgrade to Jaegar 1.1.0 which compatible which the current version we 
> are using. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-14263) Update jvm-settings.adoc



[ 
https://issues.apache.org/jira/browse/SOLR-14263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050528#comment-17050528
 ] 

Cassandra Targett commented on SOLR-14263:
--

Oops, accidentally used SOLR-14286 as the JIRA for a couple of commits to fix 
typos on master and branch_8x. Here are the SHAs for those:

master: fa6166f2611a965a3f0761bdb7c9e3c7b0aa1d1b 
(https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=fa6166f)
branch_8x: d7f80c743feca76464b119ff14eaad2703fa5594 
(https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=d7f80c7)

> Update jvm-settings.adoc
> 
>
> Key: SOLR-14263
> URL: https://issues.apache.org/jira/browse/SOLR-14263
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: documentation
>Reporter: Erick Erickson
>Assignee: Erick Erickson
>Priority: Major
> Fix For: 8.5
>
> Attachments: SOLR-14263.patch
>
>
> First of all it talks about a two gigabyte heap. Second, I thought we were 
> usually recommending -Xmx and -Xms be the same. I'll have a revision up 
> shortly, I'm thinking of some major surgery on it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-13942) /api/cluster/zk/* to fetch raw ZK data

2020-03-03 Thread Tomas Eduardo Fernandez Lobbe (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-13942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050526#comment-17050526
 ] 

Tomas Eduardo Fernandez Lobbe commented on SOLR-13942:
--

bq. In an ideal world where Solr does a wonderful job of maintaining stability, 
this would be redundant. 
Well, we are aiming to that. That's the point some of us have that this goes in 
the opposite direction.

bq. Unfortunately, our users still have to do direct read/write on ZK to make 
things work. 
You certainly have to monitor Zookeeper these days, but that doesn't mean we 
need a Solr API to do it. ZooKeeper has it's own APIs that you can use. 

bq. This is not supposed to be a "public API" that people should use everyday. 
This is like one of those APIs in your toolbox that you have under 
`/admin/info` which normal users do not/should not need.
Once you expose it as an API, it's a public API, no matter how it looks like to 
you. Users will see it, and if they want, start using and depending on it.

bq. Having stealth APIs like `/admin/zookeeper` is the real problem. People do 
not know that ZK is exposed for read, but they are. We should have one and only 
way to achieve it and we should lock it down with proper permissions.
I guess this API was added back in the day as the same "tool box" you are 
referring to. This was in the days were there was no intention in hiding 
ZooKeeper, I suspect that's why it's there? This issue is not dealing with this 
in particular right? it's just adding a new API.

bq. Our goal is to make ZK exposed as little as possible. Everyone should 
access everything through HTTP and there should be no need to expose ZK ever. 
Irrespective of whether we use ZK or something else to store our shared data, 
this endpoint can still work. 
Are you kidding me? it's a "zookeeper" endpoint

> /api/cluster/zk/* to fetch raw ZK data
> --
>
> Key: SOLR-13942
> URL: https://issues.apache.org/jira/browse/SOLR-13942
> Project: Solr
>  Issue Type: Bug
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Major
> Fix For: 8.5
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> example
> download the {{state.json}} of
> {code}
> GET http://localhost:8983/api/cluster/zk/collections/gettingstarted/state.json
> {code}
> get a list of all children under {{/live_nodes}}
> {code}
> GET http://localhost:8983/api/cluster/zk/live_nodes
> {code}
> If the requested path is a node with children show the list of child nodes 
> and their meta data



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] tflobbe merged pull request #1312: Fix resource leak in TestPolicyCloud

tflobbe merged pull request #1312: Fix resource leak in TestPolicyCloud
URL: https://github.com/apache/lucene-solr/pull/1312
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (SOLR-13942) /api/cluster/zk/* to fetch raw ZK data



[ 
https://issues.apache.org/jira/browse/SOLR-13942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050489#comment-17050489
 ] 

Noble Paul edited comment on SOLR-13942 at 3/3/20 7:31 PM:
---

[~tflobbe]

In an ideal world where Solr does a wonderful job of maintaining stability, 
this would be redundant. Unfortunately, our users still have to do direct 
read/write on ZK to make things work. This is not supposed to  be a "public 
API" that people should use everyday. This is like one of those APIs in your 
toolbox that you have under `/admin/info` which normal users do not/should not 
need. Having stealth APIs like `/admin/zookeeper` is the real problem. People 
do not know that ZK is exposed for read, but they are. We should have one and 
only way to achieve it and we should lock it down with proper permissions.


Our goal is to make ZK exposed as little as possible. Everyone should access 
everything through HTTP and there should be no need to expose ZK ever. 
Irrespective of whether we use ZK or something else to store our shared data, 
this endpoint can still work. 

Our SolrJ clients always used to require direct access to ZK. We implemented an 
HTTP only  SolrJ client so that direct access of ZK is not required. That's how 
we make ZK an implementation detail. 


was (Author: noble.paul):
[~tflobbe]

In an ideal world where Solr does a wonderful job of maintaining stability, 
this would be redundant. Unfortunately, our users still have to do direct 
read/write on ZK to make things work. This is not supposed to  be a "public 
API" that people should use everyday. This is like one of those APIs in your 
toolbox that you have under `/admin/info` which normal users do not/should not 
need. Having stealth APIs like `/admin/zookeeper` is the real problem. People 
do not know that ZK is exposed for read, but they are. We should have one and 
only way to achieve it and we should lock it down with proper permissions.


Our goal is to make ZK exposed as little as possible. Everyone should access 
everything through HTTP and there should be no need to expose ZK ever. 
Irrespective of whether we use ZK or something else to store our shared data, 
this endpoint can still work. 

Our SolrJ clients always used to require direct access to ZK. We implemented an 
HTTP only  SolrJ client so that direct access of ZK is required. That's how we 
make ZK an implementation detail. 

> /api/cluster/zk/* to fetch raw ZK data
> --
>
> Key: SOLR-13942
> URL: https://issues.apache.org/jira/browse/SOLR-13942
> Project: Solr
>  Issue Type: Bug
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Major
> Fix For: 8.5
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> example
> download the {{state.json}} of
> {code}
> GET http://localhost:8983/api/cluster/zk/collections/gettingstarted/state.json
> {code}
> get a list of all children under {{/live_nodes}}
> {code}
> GET http://localhost:8983/api/cluster/zk/live_nodes
> {code}
> If the requested path is a node with children show the list of child nodes 
> and their meta data



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (SOLR-13942) /api/cluster/zk/* to fetch raw ZK data



[ 
https://issues.apache.org/jira/browse/SOLR-13942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050489#comment-17050489
 ] 

Noble Paul edited comment on SOLR-13942 at 3/3/20 7:22 PM:
---

[~tflobbe]

In an ideal world where Solr does a wonderful job of maintaining stability, 
this would be redundant. Unfortunately, our users still have to do direct 
read/write on ZK to make things work. This is not supposed to  be a "public 
API" that people should use everyday. This is like one of those APIs in your 
toolbox that you have under `/admin/info` which normal users do not/should not 
need. Having stealth APIs like `/admin/zookeeper` is the real problem. People 
do not know that ZK is exposed for read, but they are. We should have one and 
only way to achieve it and we should lock it down with proper permissions.


Our goal is to make ZK exposed as little as possible. Everyone should access 
everything through HTTP and there should be no need to expose ZK ever. 
Irrespective of whether we use ZK or something else to store our shared data, 
this endpoint can still work. 

Our SolrJ clients always used to require direct access to ZK. We implemented an 
HTTP only  SolrJ client so that direct access of ZK is required. That's how we 
make ZK an implementation detail. 


was (Author: noble.paul):
[~tflobbe]

In an ideal world where Solr does a wonderful job of maintaining stability, 
this would be redundant. Unfortunately, our users still have to to direct 
read/write on ZK to make things work. This is not supposed to  be a "public 
API" that people should use everyday. This is like one of those APIs in your 
toolbox that you have under `/admin/info` which normal users do not/should not 
need. Having stealth APIs like `/admin/zookeeper` is the real problem. People 
do not know that ZK is exposed for read, but they are. We should have one and 
only way to achieve it and we should lock it down with proper permissions.


Our goal is to make ZK exposed as little as possible. Everyone should access 
everything through HTTP and there should be no need to expose ZK ever. 
Irrespective of whether we use ZK or something else to store our shared data, 
this endpoint can still work. 

Our SolrJ clients always used to require direct access to ZK. We implemented an 
HTTP only  SolrJ client so that direct access of ZK is required. That's how we 
make ZK an implementation detail. 

> /api/cluster/zk/* to fetch raw ZK data
> --
>
> Key: SOLR-13942
> URL: https://issues.apache.org/jira/browse/SOLR-13942
> Project: Solr
>  Issue Type: Bug
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Major
> Fix For: 8.5
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> example
> download the {{state.json}} of
> {code}
> GET http://localhost:8983/api/cluster/zk/collections/gettingstarted/state.json
> {code}
> get a list of all children under {{/live_nodes}}
> {code}
> GET http://localhost:8983/api/cluster/zk/live_nodes
> {code}
> If the requested path is a node with children show the list of child nodes 
> and their meta data



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (SOLR-13942) /api/cluster/zk/* to fetch raw ZK data



[ 
https://issues.apache.org/jira/browse/SOLR-13942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050489#comment-17050489
 ] 

Noble Paul edited comment on SOLR-13942 at 3/3/20 7:21 PM:
---

[~tflobbe]

In an ideal world where Solr does a wonderful job of maintaining stability, 
this would be redundant. Unfortunately, our users still have to to direct 
read/write on ZK to make things work. This is not supposed to  be a "public 
API" that people should use everyday. This is like one of those APIs in your 
toolbox that you have under `/admin/info` which normal users do not/should not 
need. Having stealth APIs like `/admin/zookeeper` is the real problem. People 
do not know that ZK is exposed for read, but they are. We should have one and 
only way to achieve it and we should lock it down with proper permissions.


Our goal is to make ZK exposed as little as possible. Everyone should access 
everything through HTTP and there should be no need to expose ZK ever. 
Irrespective of whether we use ZK or something else to store our shared data, 
this endpoint can still work. 

Our SolrJ clients always used to require direct access to ZK. We implemented an 
HTTP only  SolrJ client so that direct access of ZK is required. That's how we 
make ZK an implementation detail. 


was (Author: noble.paul):
[~tflobbe]

In an ideal world where Solr does a wonderful job of maintaining stability, 
this would be redundant. Unfortunately, our users still have to to direct 
read/write on ZK to make things work. This is not supposed to  be a "public 
API" that people should use everyday. This is like one of those APIs in your 
toolbox that you have under `/admin/info` which normal users do not/should not 
need. Having stealth APIs like `/admin/zookeeper` is the real problem. People 
do not know that ZK is exposed for read, but they are. We should have one and 
only way to achieve it and we should lock it down with proper permissions.

Our ZK exposed as little as possible. Everyone should access everything through 
HTTP and there should be no need to expose ZK ever. Irrespective of whether we 
use ZK or something else to store our shared data, this endpoint can still 
work. 

> /api/cluster/zk/* to fetch raw ZK data
> --
>
> Key: SOLR-13942
> URL: https://issues.apache.org/jira/browse/SOLR-13942
> Project: Solr
>  Issue Type: Bug
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Major
> Fix For: 8.5
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> example
> download the {{state.json}} of
> {code}
> GET http://localhost:8983/api/cluster/zk/collections/gettingstarted/state.json
> {code}
> get a list of all children under {{/live_nodes}}
> {code}
> GET http://localhost:8983/api/cluster/zk/live_nodes
> {code}
> If the requested path is a node with children show the list of child nodes 
> and their meta data



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (SOLR-13942) /api/cluster/zk/* to fetch raw ZK data



[ 
https://issues.apache.org/jira/browse/SOLR-13942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050489#comment-17050489
 ] 

Noble Paul edited comment on SOLR-13942 at 3/3/20 7:18 PM:
---

[~tflobbe]

In an ideal world where Solr does a wonderful job of maintaining stability, 
this would be redundant. Unfortunately, our users still have to to direct 
read/write on ZK to make things work. This is not supposed to  be a "public 
API" that people should use everyday. This is like one of those APIs in your 
toolbox that you have under `/admin/info` which normal users do not/should not 
need. Having stealth APIs like `/admin/zookeeper` is the real problem. People 
do not know that ZK is exposed for read, but they are. We should have one and 
only way to achieve it and we should lock it down with proper permissions.

Our ZK exposed as little as possible. Everyone should access everything through 
HTTP and there should be no need to expose ZK ever. Irrespective of whether we 
use ZK or something else to store our shared data, this endpoint can still 
work. 


was (Author: noble.paul):
[~tflobbe]

In an ideal world where Solr does a wonderful job of maintaining stability, 
this would be redundant. Unfortunately, our users still have to to direct 
read/write on ZK to make things work. This is not supposed to  be a "public 
API" that people should use everyday. This is like one of those APIs in your 
toolbox that you have under `/admin/info` which normal users do not/should not 
need. Having stealth APIs like `/admin/zookeeper` is the real problem. People 
do not know that ZK is exposed for read, but they are. We should have one and 
only way to achieve it and we should lock it down with proper permissions.

> /api/cluster/zk/* to fetch raw ZK data
> --
>
> Key: SOLR-13942
> URL: https://issues.apache.org/jira/browse/SOLR-13942
> Project: Solr
>  Issue Type: Bug
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Major
> Fix For: 8.5
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> example
> download the {{state.json}} of
> {code}
> GET http://localhost:8983/api/cluster/zk/collections/gettingstarted/state.json
> {code}
> get a list of all children under {{/live_nodes}}
> {code}
> GET http://localhost:8983/api/cluster/zk/live_nodes
> {code}
> If the requested path is a node with children show the list of child nodes 
> and their meta data



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-13942) /api/cluster/zk/* to fetch raw ZK data



[ 
https://issues.apache.org/jira/browse/SOLR-13942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050489#comment-17050489
 ] 

Noble Paul commented on SOLR-13942:
---

[~tflobbe]

In an ideal world where Solr does a wonderful job of maintaining stability, 
this would be redundant. Unfortunately, our users still have to to direct 
read/write on ZK to make things work. This is not supposed to  be a "public 
API" that people should use everyday. This is like one of those APIs in your 
toolbox that you have under `/admin/info` which normal users do not/should not 
need. Having stealth APIs like `/admin/zookeeper` is the real problem. People 
do not know that ZK is exposed for read, but they are. We should have one and 
only way to achieve it and we should lock it down with proper permissions.

> /api/cluster/zk/* to fetch raw ZK data
> --
>
> Key: SOLR-13942
> URL: https://issues.apache.org/jira/browse/SOLR-13942
> Project: Solr
>  Issue Type: Bug
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Major
> Fix For: 8.5
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> example
> download the {{state.json}} of
> {code}
> GET http://localhost:8983/api/cluster/zk/collections/gettingstarted/state.json
> {code}
> get a list of all children under {{/live_nodes}}
> {code}
> GET http://localhost:8983/api/cluster/zk/live_nodes
> {code}
> If the requested path is a node with children show the list of child nodes 
> and their meta data



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] atris commented on issue #1303: LUCENE-9114: Improve ValueSourceScorer's Default Cost Implementation

2020-03-03 Thread ASF subversion and git services (Jira)

atris commented on issue #1303: LUCENE-9114: Improve ValueSourceScorer's 
Default Cost Implementation
URL: https://github.com/apache/lucene-solr/pull/1303#issuecomment-594105306
 
 
   > Ehh; nevermind my ill-thought-out idea of a cost on the Map context. There 
are many ValueSource.getValues impls that'd need to parse it, and then there's 
a concern that we wouldn't want it to propagate to sub-FunctionValues.
   > 
   > Alternative proposal: When FunctionRangeQuery calls 
functionValues.getRangeScorer, it gets back a ValueSourceScorer. We could just 
add a mutable cost on VSC that if set will be returned by VSC and if not VSC 
will delegate to the proposed `FV.cost`. While the mutability of it isn't 
pretty, it's also quite minor. It saves FRQ from having to wrap the scorer only 
to specify a matchCost.
   
   Yeah, I had a similar idea but the per VS implementations' need for parsing 
was putting off.
   
   I am not a fan of the intrusive mutability ability but agree that it is the 
cleanest thing to do to support this functionality while not needing to define 
a cost model for every FV implementation in this PR. As a follow up to this PR, 
I plan to define cost models for DoubleValuesSource, IntFieldSource etc.
   
   Raised another iteration with the said approach. I am not sure how to 
comprehensively write a test for this functionality -- but Lucene and Solr test 
suites pass with this commit. Please see and let me know your thoughts and 
comments.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-14286) Upgrade Jaegar to 1.1.0



[ 
https://issues.apache.org/jira/browse/SOLR-14286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050467#comment-17050467
 ] 

ASF subversion and git services commented on SOLR-14286:


Commit d7f80c743feca76464b119ff14eaad2703fa5594 in lucene-solr's branch 
refs/heads/branch_8x from Cassandra Targett
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=d7f80c7 ]

SOLR-14286: jvm-settings.adoc: minor typos; add links to external resources


> Upgrade Jaegar to 1.1.0
> ---
>
> Key: SOLR-14286
> URL: https://issues.apache.org/jira/browse/SOLR-14286
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Cao Manh Dat
>Assignee: Cao Manh Dat
>Priority: Major
> Fix For: master (9.0), 8.5
>
>
> Rohit Singh pointed to me that we are using thrift 0.12.0 (in 
> JaegarTracer-Configurator module) which has several security issues. We 
> should upgrade to Jaegar 1.1.0 which compatible which the current version we 
> are using. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-14286) Upgrade Jaegar to 1.1.0

2020-03-03 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-14286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050466#comment-17050466
 ] 

ASF subversion and git services commented on SOLR-14286:


Commit fa6166f2611a965a3f0761bdb7c9e3c7b0aa1d1b in lucene-solr's branch 
refs/heads/master from Cassandra Targett
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=fa6166f ]

SOLR-14286: jvm-settings.adoc: minor typos; add links to external resources


> Upgrade Jaegar to 1.1.0
> ---
>
> Key: SOLR-14286
> URL: https://issues.apache.org/jira/browse/SOLR-14286
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Cao Manh Dat
>Assignee: Cao Manh Dat
>Priority: Major
> Fix For: master (9.0), 8.5
>
>
> Rohit Singh pointed to me that we are using thrift 0.12.0 (in 
> JaegarTracer-Configurator module) which has several security issues. We 
> should upgrade to Jaegar 1.1.0 which compatible which the current version we 
> are using. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-12238) Synonym Query Style Boost By Payload



[ 
https://issues.apache.org/jira/browse/SOLR-12238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050462#comment-17050462
 ] 

Cassandra Targett commented on SOLR-12238:
--

[~romseygeek] or [~alessandro.benedetti], This issue seems to be missing from 
CHANGES.txt for 8.5 - was there a reason why it was left out, or is it just an 
oversight?

> Synonym Query Style Boost By Payload
> 
>
> Key: SOLR-12238
> URL: https://issues.apache.org/jira/browse/SOLR-12238
> Project: Solr
>  Issue Type: Improvement
>  Components: query parsers
>Affects Versions: 7.2
>Reporter: Alessandro Benedetti
>Assignee: Alan Woodward
>Priority: Major
> Fix For: 8.5
>
> Attachments: SOLR-12238.patch, SOLR-12238.patch, SOLR-12238.patch, 
> SOLR-12238.patch
>
>  Time Spent: 8h
>  Remaining Estimate: 0h
>
> This improvement is built on top of the Synonym Query Style feature and 
> brings the possibility of boosting synonym queries using the payload 
> associated.
> It introduces two new modalities for the Synonym Query Style :
> PICK_BEST_BOOST_BY_PAYLOAD -> build a Disjunction query with the clauses 
> boosted by payload
> AS_DISTINCT_TERMS_BOOST_BY_PAYLOAD -> build a Boolean query with the clauses 
> boosted by payload
> This new synonym query styles will assume payloads are available so they must 
> be used in conjunction with a token filter able to produce payloads.
> An synonym.txt example could be :
> # Synonyms used by Payload Boost
> tiger => tiger|1.0, Big_Cat|0.8, Shere_Khan|0.9
> leopard => leopard, Big_Cat|0.8, Bagheera|0.9
> lion => lion|1.0, panthera leo|0.99, Simba|0.8
> snow_leopard => panthera uncia|0.99, snow leopard|1.0
> A simple token filter to populate the payloads from such synonym.txt is :
>  delimiter="|"/>



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8849) DocValuesRewriteMethod.visit should visit the MTQ



[ 
https://issues.apache.org/jira/browse/LUCENE-8849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050461#comment-17050461
 ] 

Michele Palmia commented on LUCENE-8849:


No use case, I'm studying Lucene, found the issue and used it to learn how the 
query visiting system works. Failed at learning how they're normally tested 
though! :)

> DocValuesRewriteMethod.visit should visit the MTQ
> -
>
> Key: LUCENE-8849
> URL: https://issues.apache.org/jira/browse/LUCENE-8849
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Reporter: David Smiley
>Priority: Minor
> Attachments: LUCENE-8849.patch
>
>
> The DocValuesRewriteMethod implements the QueryVisitor API (visit method) in 
> a way that surprises me.  It does not visit the wrapped MTQ query.  Shouldn't 
> it?  Here is what I think it should do, similar to other query wrappers:
> {code:java}
> @Override
> public void visit(QueryVisitor visitor) {
>   query.visit(visitor.getSubVisitor(BooleanClause.Occur.MUST, this));
> }
> {code}
> CC [~romseygeek]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8849) DocValuesRewriteMethod.visit should visit the MTQ



[ 
https://issues.apache.org/jira/browse/LUCENE-8849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050460#comment-17050460
 ] 

Michele Palmia commented on LUCENE-8849:


I added a silly one-line patch to fix this. I chose FILTER instead of MUST as 
the rewritten query is really a filter.

I tried to add a test but failed badly - _TestBooleanQuery_  for instance has a 
_testQueryVisitor()_ method that only tests the correctness of the visit, not 
whether the visit actually takes place at all. I would have tested this by 
mocking the _QueryVisitor_, but from what I could gather _Mockito_ is not 
available in core. Any suggestion on how to test this would be a great help!

> DocValuesRewriteMethod.visit should visit the MTQ
> -
>
> Key: LUCENE-8849
> URL: https://issues.apache.org/jira/browse/LUCENE-8849
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Reporter: David Smiley
>Priority: Minor
> Attachments: LUCENE-8849.patch
>
>
> The DocValuesRewriteMethod implements the QueryVisitor API (visit method) in 
> a way that surprises me.  It does not visit the wrapped MTQ query.  Shouldn't 
> it?  Here is what I think it should do, similar to other query wrappers:
> {code:java}
> @Override
> public void visit(QueryVisitor visitor) {
>   query.visit(visitor.getSubVisitor(BooleanClause.Occur.MUST, this));
> }
> {code}
> CC [~romseygeek]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-8681) Prorated early termination

2020-03-03 Thread Michael Sokolov (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-8681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Sokolov resolved LUCENE-8681.
-
Resolution: Won't Fix

There are too many adversarial cases to make this approach useful in any broad 
setting. 

> Prorated early termination
> --
>
> Key: LUCENE-8681
> URL: https://issues.apache.org/jira/browse/LUCENE-8681
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Michael Sokolov
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> In this issue we'll exploit the distribution of top K documents among 
> segments to extract performance gains when using early termination. The basic 
> idea is we do not need to collect K documents from every segment and then 
> merge. Rather we can collect a number of documents that is proportional to 
> the segment's size plus an error bound derived from the combinatorics seen as 
> a (multinomial) probability distribution.
> https://github.com/apache/lucene-solr/pull/564 has the proposed change.
> [~rcmuir] pointed out on the mailing list that this patch confounds two 
> settings: (1) whether to collect all hits, ensuring correct hit counts, and 
> (2) whether to guarantee that the top K hits are precisely the top K.
> The current patch treats this as the same thing. It takes the position that 
> if the user says it's OK to have approximate counts, then it's also OK to 
> introduce some small chance of ranking error; occasionally some of the top K 
> we return may draw from the top K + epsilon.
> Instead we could provide some additional knobs to the user. Currently the 
> public API is {{TopFieldCOllector.create(Sort, int, FieldDoc, int 
> threshold)}}. The threshold parameter controls when to apply early 
> termination; it allows the collector to terminate once the given number of 
> documents have been collected.
> Instead of using the same threshold to control leaf-level early termination, 
> we could provide an additional leaf-level parameter. For example, this could 
> be a scale factor on the error bound, eg a number of standard deviations to 
> apply. The patch uses 3, but a much more conservative bound would be 4 or 
> even 5. With these values, some speedup would still result, but with a much 
> lower level of ranking errors. A value of MAX_INT would ensure no leaf-level 
> termination would ever occur.
> We could also hide the precise numerical bound and offer users a three-way 
> enum (EXACT, APPROXIMATE_COUNT, APPROXIMATE_RANK) that controls whether to 
> apply this optimization, using some predetermined error bound.
> I posted the patch without any user-level tuning since I think the user has 
> already indicated a preference for speed over precision by specifying a 
> finite (global) threshold, but if we want to provide finer control, these two 
> options seem to make the most sense to me. Providing access to the number of 
> standard deviation to allow from the expected distribution gives the user the 
> finest control, but it could be hard to explain its proper use.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8849) DocValuesRewriteMethod.visit should visit the MTQ

2020-03-03 Thread David Smiley (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050458#comment-17050458
 ] 

David Smiley commented on LUCENE-8849:
--

Thanks for the patch Michele Palmia !  Just curious; was there a use-case that 
prompted your interest?

> DocValuesRewriteMethod.visit should visit the MTQ
> -
>
> Key: LUCENE-8849
> URL: https://issues.apache.org/jira/browse/LUCENE-8849
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Reporter: David Smiley
>Priority: Minor
> Attachments: LUCENE-8849.patch
>
>
> The DocValuesRewriteMethod implements the QueryVisitor API (visit method) in 
> a way that surprises me.  It does not visit the wrapped MTQ query.  Shouldn't 
> it?  Here is what I think it should do, similar to other query wrappers:
> {code:java}
> @Override
> public void visit(QueryVisitor visitor) {
>   query.visit(visitor.getSubVisitor(BooleanClause.Occur.MUST, this));
> }
> {code}
> CC [~romseygeek]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-8849) DocValuesRewriteMethod.visit should visit the MTQ



 [ 
https://issues.apache.org/jira/browse/LUCENE-8849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michele Palmia updated LUCENE-8849:
---
Attachment: LUCENE-8849.patch

> DocValuesRewriteMethod.visit should visit the MTQ
> -
>
> Key: LUCENE-8849
> URL: https://issues.apache.org/jira/browse/LUCENE-8849
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Reporter: David Smiley
>Priority: Minor
> Attachments: LUCENE-8849.patch
>
>
> The DocValuesRewriteMethod implements the QueryVisitor API (visit method) in 
> a way that surprises me.  It does not visit the wrapped MTQ query.  Shouldn't 
> it?  Here is what I think it should do, similar to other query wrappers:
> {code:java}
> @Override
> public void visit(QueryVisitor visitor) {
>   query.visit(visitor.getSubVisitor(BooleanClause.Occur.MUST, this));
> }
> {code}
> CC [~romseygeek]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] tflobbe opened a new pull request #1312: Fix resource leak in TestPolicyCloud

tflobbe opened a new pull request #1312: Fix resource leak in TestPolicyCloud
URL: https://github.com/apache/lucene-solr/pull/1312
 
 
   Trivial fix of a test resource leak


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] HoustonPutman commented on issue #1254: SOLR-14259: Back porting SOLR-14013 to Solr 7.7

HoustonPutman commented on issue #1254: SOLR-14259: Back porting SOLR-14013 to 
Solr 7.7
URL: https://github.com/apache/lucene-solr/pull/1254#issuecomment-594083095
 
 
   Hey Noble, is this ready to merge?
   
   I can do some performance testing if that'd help.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-13942) /api/cluster/zk/* to fetch raw ZK data

2020-03-03 Thread Tomas Eduardo Fernandez Lobbe (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-13942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050416#comment-17050416
 ] 

Tomas Eduardo Fernandez Lobbe commented on SOLR-13942:
--

I understand how you may want this today for monitoring, but I agree with 
Anshum's point about exposing too much of ZooKeeper, it's going in the opposite 
direction that what everyone agreed we should go. I think we shouldn't add new 
APIs to expose ZooKeeper internals here. 

> /api/cluster/zk/* to fetch raw ZK data
> --
>
> Key: SOLR-13942
> URL: https://issues.apache.org/jira/browse/SOLR-13942
> Project: Solr
>  Issue Type: Bug
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Major
> Fix For: 8.5
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> example
> download the {{state.json}} of
> {code}
> GET http://localhost:8983/api/cluster/zk/collections/gettingstarted/state.json
> {code}
> get a list of all children under {{/live_nodes}}
> {code}
> GET http://localhost:8983/api/cluster/zk/live_nodes
> {code}
> If the requested path is a node with children show the list of child nodes 
> and their meta data



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9191) Fix linefiledocs compression or replace in tests

2020-03-03 Thread Michael McCandless (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050413#comment-17050413
 ] 

Michael McCandless commented on LUCENE-9191:


I plan to commit this soon ... it should improve the efficiency of tests using 
{{LineFileDocs}} since the up-front random seeking is much more efficient now.

These docs are derived from the [Europarl parallel corpus 
v7|https://www.statmt.org/europarl/], and then randomly split into smallish 
documents, one per line, and then broken into 20 MB, 200 MB and 2000 MB source 
files (before compression).  I'll commit the 20 MB file here, along with the 
Python script that creates the random files.

I also copied all the files up to {{home.apache.org}}: [200 
MB|http://home.apache.org/~mikemccand/200mb.txt.gz] (and its [.seek 
file|http://home.apache.org/~mikemccand/200mb.txt.seek]), and [2000 
MB|http://home.apache.org/~mikemccand/2000mb.txt.gz] (and its [.seek 
file|http://home.apache.org/~mikemccand/2000mb.txt.seek]), in case developers 
want to test on a wider set of random docs :)

> Fix linefiledocs compression or replace in tests
> 
>
> Key: LUCENE-9191
> URL: https://issues.apache.org/jira/browse/LUCENE-9191
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Assignee: Michael McCandless
>Priority: Major
> Attachments: LUCENE-9191.patch, LUCENE-9191.patch
>
>
> LineFileDocs(random) is very slow, even to open. It does a very slow "random 
> skip" through a gzip compressed file.
> For the analyzers tests, in LUCENE-9186 I simply removed its usage, since 
> TestUtil.randomAnalysisString is superior, and fast. But we should address 
> other tests using it, since LineFileDocs(random) is slow!
> I think it is also the case that every lucene test has probably tested every 
> LineFileDocs line many times now, whereas randomAnalysisString will invent 
> new ones.
> Alternatively, we could "fix" LineFileDocs(random), e.g. special compression 
> options (in blocks)... deflate supports such stuff. But it would make it even 
> hairier than it is now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-14301) Remove external commons-codec usage in gradle validateJarChecksums

2020-03-03 Thread Lucene/Solr QA (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-14301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050395#comment-17050395
 ] 

Lucene/Solr QA commented on SOLR-14301:
---

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
|| || || || {color:brown} master Compile Tests {color} ||
|| || || || {color:brown} Patch Compile Tests {color} ||
|| || || || {color:brown} Other Tests {color} ||
| {color:black}{color} | {color:black} {color} | {color:black}  0m  8s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | SOLR-14301 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12995478/SOLR-14301-01.patch |
| Optional Tests |  |
| uname | Linux lucene1-us-west 4.15.0-54-generic #58-Ubuntu SMP Mon Jun 24 
10:55:24 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | ant |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-SOLR-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh
 |
| git revision | master / bc6fa3b6506 |
| ant | version: Apache Ant(TM) version 1.10.5 compiled on March 28 2019 |
| modules | C: . U: . |
| Console output | 
https://builds.apache.org/job/PreCommit-SOLR-Build/694/console |
| Powered by | Apache Yetus 0.7.0   http://yetus.apache.org |


This message was automatically generated.



> Remove external commons-codec usage in gradle validateJarChecksums
> --
>
> Key: SOLR-14301
> URL: https://issues.apache.org/jira/browse/SOLR-14301
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Andras Salamon
>Priority: Minor
> Attachments: SOLR-14301-01.patch
>
>
> Right now gradle calculates SHA-1 checksums using an external 
> {{commons-codec}} library. We can calculate SHA-1 using Java 8 classes, no 
> need for commons-codec here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Resolved] (SOLR-13942) /api/cluster/zk/* to fetch raw ZK data



 [ 
https://issues.apache.org/jira/browse/SOLR-13942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ishan Chattopadhyaya resolved SOLR-13942.
-
Fix Version/s: 8.5
   Resolution: Fixed

bq. I disagree with your statement above that the community's position is "No 
data in ZK is private". To the contrary, in the ref-guide itself we suggest to 
users that there are valid reasons to control ZK read access.

Ref guide says, "You might even want to limit read-access, if you think there 
is stuff in ZooKeeper that not everyone should know about. Or you might just in 
general work on a need-to-know basis". The endpoint that we have today can't be 
locked down using authentication; this new endpoint can be.

bq. Redundant Functionality - Our Solr distribution already offers a handful of 
ways to access ZK data. bin/solr, zkcli.sh, the /admin/zookeeper endpoint, 
various Collections API commands, etc. And where these aren't 
available/sufficient, there are plenty of external ZK clients to use. 
bin/solr and zkcli.sh require direct access to ZK, and hence insecure. 
/admin/zookeeper is ugly, and this one is not; I'm +1 to deprecating that one 
in favour of this one. External ZK clients again need access to ZK, which is 
insecure.

bq. Slimness Concerns - I'm a little surprised honestly that you and Ishan have 
taken this up, when you've been so instrumental in getting people focused on 
slimming Solr down and cutting everything out of core that's not related to 
Solr's raison d'etre.
This issue is totally unrelated to that goal (which we'll continue to follow).

> /api/cluster/zk/* to fetch raw ZK data
> --
>
> Key: SOLR-13942
> URL: https://issues.apache.org/jira/browse/SOLR-13942
> Project: Solr
>  Issue Type: Bug
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Major
> Fix For: 8.5
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> example
> download the {{state.json}} of
> {code}
> GET http://localhost:8983/api/cluster/zk/collections/gettingstarted/state.json
> {code}
> get a list of all children under {{/live_nodes}}
> {code}
> GET http://localhost:8983/api/cluster/zk/live_nodes
> {code}
> If the requested path is a node with children show the list of child nodes 
> and their meta data



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9258) DocTermsIndexDocValues should not assume it's operating on a SortedDocValues field

2020-03-03 Thread Lucene/Solr QA (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050381#comment-17050381
 ] 

Lucene/Solr QA commented on LUCENE-9258:


| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
18s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
18s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
18s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Release audit (RAT) {color} | 
{color:green}  0m 18s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Check forbidden APIs {color} | 
{color:green}  0m 18s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Validate source patterns {color} | 
{color:green}  0m 18s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  0m 
30s{color} | {color:green} queries in the patch passed. {color} |
| {color:black}{color} | {color:black} {color} | {color:black}  3m 11s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | LUCENE-9258 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12995374/LUCENE-9258.patch |
| Optional Tests |  compile  javac  unit  ratsources  checkforbiddenapis  
validatesourcepatterns  |
| uname | Linux lucene2-us-west.apache.org 4.4.0-170-generic #199-Ubuntu SMP 
Thu Nov 14 01:45:04 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | ant |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-LUCENE-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh
 |
| git revision | master / bc6fa3b |
| ant | version: Apache Ant(TM) version 1.9.6 compiled on July 20 2018 |
| Default Java | LTS |
|  Test Results | 
https://builds.apache.org/job/PreCommit-LUCENE-Build/255/testReport/ |
| modules | C: lucene/queries U: lucene/queries |
| Console output | 
https://builds.apache.org/job/PreCommit-LUCENE-Build/255/console |
| Powered by | Apache Yetus 0.7.0   http://yetus.apache.org |


This message was automatically generated.



> DocTermsIndexDocValues should not assume it's operating on a SortedDocValues 
> field
> --
>
> Key: LUCENE-9258
> URL: https://issues.apache.org/jira/browse/LUCENE-9258
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 8.4
>Reporter: Michele Palmia
>Priority: Minor
> Attachments: LUCENE-9258.patch
>
>
> When requesting a new _ValueSourceScorer_ (with _getRangeScorer_) from 
> _DocTermsIndexDocValues_ , the latter instantiates a new iterator on 
> _SortedDocValues_ regardless of the fact that the underlying field can 
> actually be of a different type (e.g. a _SortedSetDocValues_ processed 
> through a _SortedSetSelector_).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] s1monw commented on issue #1274: LUCENE-9164: Prevent IW from closing gracefully if threads are still modifying

s1monw commented on issue #1274: LUCENE-9164: Prevent IW from closing 
gracefully if threads are still modifying
URL: https://github.com/apache/lucene-solr/pull/1274#issuecomment-594052954
 
 
   > It's scary we are adding yet another synchronization tool to IndexWriter; 
I think we need to add comments to make it clear how it interacts with at least 
IndexWriter's monitor lock (synchronized).
   
   I agree with you. I do wonder if it's worth it. It might be better to go 
back and simplify things (I try for years now) rather than making it more 
complicated.  the change proposed here might be the better solution after all 
https://github.com/apache/lucene-solr/pull/1215
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-9253) Support custom dictionaries in KoreanTokenizer

2020-03-03 Thread Namgyu Kim (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-9253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Namgyu Kim resolved LUCENE-9253.

Fix Version/s: 8.5
   master (9.0)
   Resolution: Fixed

> Support custom dictionaries in KoreanTokenizer
> --
>
> Key: LUCENE-9253
> URL: https://issues.apache.org/jira/browse/LUCENE-9253
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Namgyu Kim
>Assignee: Namgyu Kim
>Priority: Major
> Fix For: master (9.0), 8.5
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> KoreanTokenizer does not support custom dictionaries(system, unknown) now, 
> even though Nori provides DictionaryBuilder that creates custom dictionary.
> In the current state, it is very difficult for Nori users to use a custom 
> dictionary.
> Therefore, we need to open a new constructor that uses it.
> Kuromoji is already supported(LUCENE-8971) that, and I referenced it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] mikemccand commented on a change in pull request #1274: LUCENE-9164: Prevent IW from closing gracefully if threads are still modifying

mikemccand commented on a change in pull request #1274: LUCENE-9164: Prevent IW 
from closing gracefully if threads are still modifying
URL: https://github.com/apache/lucene-solr/pull/1274#discussion_r387130934
 
 

 ##
 File path: lucene/core/src/java/org/apache/lucene/index/IndexWriter.java
 ##
 @@ -2275,57 +2265,73 @@ private void rollbackInternalNoCommit() throws 
IOException {
 
   docWriter.close(); // mark it as closed first to prevent subsequent 
indexing actions/flushes
   assert !Thread.holdsLock(this) : "IndexWriter lock should never be hold 
when aborting";
-  docWriter.abort(); // don't sync on IW here
-  docWriter.flushControl.waitForFlush(); // wait for all concurrently 
running flushes
+  docWriter.abort(); // don't sync on IW here - this waits for all 
concurrently running flushes
   publishFlushedSegments(true); // empty the flush ticket queue otherwise 
we might not have cleaned up all resources
-  synchronized (this) {
-
-if (pendingCommit != null) {
-  pendingCommit.rollbackCommit(directory);
-  try {
-deleter.decRef(pendingCommit);
-  } finally {
-pendingCommit = null;
-notifyAll();
+  // we might rollback due to a tragic event which means we potentially 
already have
+  // a lease in this case we we can't acquire all leases. In this case 
rolling back in best effort in terms
+  // off letting all threads gracefully finish. In such a situation some 
threads might run into
+  // AlreadyClosedExceptions in places they normally wouldn't which 
doesn't have any impact on
+  // correctness or consistency. The tragic event is fatal anyway.
+  final int leases;
+  if (gracefully) { // in the case
+leases = Integer.MAX_VALUE;
+modificationLease.acquireUninterruptibly(leases);
 
 Review comment:
   This will also block until any other threads holding modification leases 
close them, and then will acquire all leases and prevent other threads from 
doing so?  Can you add a comment?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] mikemccand commented on a change in pull request #1274: LUCENE-9164: Prevent IW from closing gracefully if threads are still modifying

mikemccand commented on a change in pull request #1274: LUCENE-9164: Prevent IW 
from closing gracefully if threads are still modifying
URL: https://github.com/apache/lucene-solr/pull/1274#discussion_r386976022
 
 

 ##
 File path: lucene/core/src/java/org/apache/lucene/index/IndexWriter.java
 ##
 @@ -1444,6 +1446,7 @@ public synchronized long tryUpdateDocValue(IndexReader 
readerIn, int docID, Fiel
   }
 
   private synchronized long tryModifyDocument(IndexReader readerIn, int docID, 
DocModifier toApply) throws IOException {
+// no modificationLease allowed here since we are synchronized on the IW
 
 Review comment:
   Maybe change `allowed` to `needed`?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] mikemccand commented on a change in pull request #1274: LUCENE-9164: Prevent IW from closing gracefully if threads are still modifying

mikemccand commented on a change in pull request #1274: LUCENE-9164: Prevent IW 
from closing gracefully if threads are still modifying
URL: https://github.com/apache/lucene-solr/pull/1274#discussion_r387129182
 
 

 ##
 File path: lucene/core/src/java/org/apache/lucene/index/IndexWriter.java
 ##
 @@ -2244,21 +2233,22 @@ public void rollback() throws IOException {
 // Ensure that only one thread actually gets to do the
 // closing, and make sure no commit is also in progress:
 if (shouldClose(true)) {
-  rollbackInternal();
+  rollbackInternal(true);
 }
   }
 
-  private void rollbackInternal() throws IOException {
+  private void rollbackInternal(boolean gracefully) throws IOException {
 
 Review comment:
   Can you add private javadocs explaining what `gracefully` means?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] mikemccand commented on a change in pull request #1274: LUCENE-9164: Prevent IW from closing gracefully if threads are still modifying

mikemccand commented on a change in pull request #1274: LUCENE-9164: Prevent IW 
from closing gracefully if threads are still modifying
URL: https://github.com/apache/lucene-solr/pull/1274#discussion_r386975715
 
 

 ##
 File path: lucene/core/src/java/org/apache/lucene/index/IndexWriter.java
 ##
 @@ -326,6 +327,9 @@ static int getActualMaxDocs() {
   private long mergeGen;
   private boolean stopMerges;
   private boolean didMessageState;
+  // This allows to ensure that all modifying threads have left IW before we 
are closing / rolling back
+  // see {@link IndexWriter#rollbackInternal}
+  private final Semaphore modificationLease = new Semaphore(Integer.MAX_VALUE, 
true);
 
 Review comment:
   Why are we passing `true` for `fair`?  I guess it's to prevent starvation of 
one thread trying to close, while other threads continue indexing?  Maybe leave 
a comment?
   
   Also, maybe add comment explaining lock acquisition order?  At least, it 
seems if we are `synchronized` we do not need the `modificationLease`?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] mikemccand commented on a change in pull request #1274: LUCENE-9164: Prevent IW from closing gracefully if threads are still modifying

mikemccand commented on a change in pull request #1274: LUCENE-9164: Prevent IW 
from closing gracefully if threads are still modifying
URL: https://github.com/apache/lucene-solr/pull/1274#discussion_r387130244
 
 

 ##
 File path: lucene/core/src/java/org/apache/lucene/index/IndexWriter.java
 ##
 @@ -2275,57 +2265,73 @@ private void rollbackInternalNoCommit() throws 
IOException {
 
   docWriter.close(); // mark it as closed first to prevent subsequent 
indexing actions/flushes
   assert !Thread.holdsLock(this) : "IndexWriter lock should never be hold 
when aborting";
-  docWriter.abort(); // don't sync on IW here
-  docWriter.flushControl.waitForFlush(); // wait for all concurrently 
running flushes
+  docWriter.abort(); // don't sync on IW here - this waits for all 
concurrently running flushes
   publishFlushedSegments(true); // empty the flush ticket queue otherwise 
we might not have cleaned up all resources
-  synchronized (this) {
-
-if (pendingCommit != null) {
-  pendingCommit.rollbackCommit(directory);
-  try {
-deleter.decRef(pendingCommit);
-  } finally {
-pendingCommit = null;
-notifyAll();
+  // we might rollback due to a tragic event which means we potentially 
already have
+  // a lease in this case we we can't acquire all leases. In this case 
rolling back in best effort in terms
+  // off letting all threads gracefully finish. In such a situation some 
threads might run into
+  // AlreadyClosedExceptions in places they normally wouldn't which 
doesn't have any impact on
+  // correctness or consistency. The tragic event is fatal anyway.
+  final int leases;
+  if (gracefully) { // in the case
+leases = Integer.MAX_VALUE;
+modificationLease.acquireUninterruptibly(leases);
+  } else {
+// still try to drain all permits to prevent any new threads modifying 
the index.
+leases = modificationLease.drainPermits();
 
 Review comment:
   This will block until all threads return their leases?  Can you add a 
comment to that effect?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] mikemccand commented on a change in pull request #1274: LUCENE-9164: Prevent IW from closing gracefully if threads are still modifying

mikemccand commented on a change in pull request #1274: LUCENE-9164: Prevent IW 
from closing gracefully if threads are still modifying
URL: https://github.com/apache/lucene-solr/pull/1274#discussion_r387129545
 
 

 ##
 File path: lucene/core/src/java/org/apache/lucene/index/IndexWriter.java
 ##
 @@ -2275,57 +2265,73 @@ private void rollbackInternalNoCommit() throws 
IOException {
 
   docWriter.close(); // mark it as closed first to prevent subsequent 
indexing actions/flushes
   assert !Thread.holdsLock(this) : "IndexWriter lock should never be hold 
when aborting";
-  docWriter.abort(); // don't sync on IW here
-  docWriter.flushControl.waitForFlush(); // wait for all concurrently 
running flushes
+  docWriter.abort(); // don't sync on IW here - this waits for all 
concurrently running flushes
   publishFlushedSegments(true); // empty the flush ticket queue otherwise 
we might not have cleaned up all resources
-  synchronized (this) {
-
-if (pendingCommit != null) {
-  pendingCommit.rollbackCommit(directory);
-  try {
-deleter.decRef(pendingCommit);
-  } finally {
-pendingCommit = null;
-notifyAll();
+  // we might rollback due to a tragic event which means we potentially 
already have
+  // a lease in this case we we can't acquire all leases. In this case 
rolling back in best effort in terms
+  // off letting all threads gracefully finish. In such a situation some 
threads might run into
 
 Review comment:
   s/`off`/`of`


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] mikemccand commented on a change in pull request #1274: LUCENE-9164: Prevent IW from closing gracefully if threads are still modifying

mikemccand commented on a change in pull request #1274: LUCENE-9164: Prevent IW 
from closing gracefully if threads are still modifying
URL: https://github.com/apache/lucene-solr/pull/1274#discussion_r387128166
 
 

 ##
 File path: lucene/core/src/java/org/apache/lucene/index/IndexWriter.java
 ##
 @@ -1554,20 +1554,24 @@ public long deleteDocuments(Query... queries) throws 
IOException {
   }
 }
 
-try {
-  long seqNo = docWriter.deleteQueries(queries);
-  if (seqNo < 0) {
-seqNo = -seqNo;
-processEvents(true);
-  }
-
-  return seqNo;
+try (Closeable finalizer = acquireModificationLease()) {
+  return maybeProcessEvents(docWriter.deleteQueries(queries));
 } catch (VirtualMachineError tragedy) {
   tragicEvent(tragedy, "deleteDocuments(Query..)");
   throw tragedy;
 }
   }
 
+  private Closeable acquireModificationLease() {
+assert Thread.holdsLock(this) == false : "Can't acquire modification lock 
while holding the IW lock";
 
 Review comment:
   Can/should we also assert the reverse?  I.e. whenever we `sync` on `this`, 
`assert` that the `modificationLease` is not held?  They are mutually exclusive?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9257) FSTLoadMode should not be BlockTree specific as it is used more generally in index package

2020-03-03 Thread juan camilo rodriguez duran (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050346#comment-17050346
 ] 

juan camilo rodriguez duran commented on LUCENE-9257:
-

If the benchmarks are consistent and show good results +1 to adrien's comment

> FSTLoadMode should not be BlockTree specific as it is used more generally in 
> index package
> --
>
> Key: LUCENE-9257
> URL: https://issues.apache.org/jira/browse/LUCENE-9257
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Bruno Roustant
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> FSTLoadMode and its associate attribute key (static String) are currently 
> defined in BlockTreeTermsReader, but they are actually used outside of 
> BlockTree in the general "index" package.
> CheckIndex and ReadersAndUpdates are using these enum and attribute key to 
> drive the FST load mode through the SegmentReader which is not specific to a 
> postings format. They have an unnecessary dependency to BlockTreeTermsReader.
> We could move FSTLoadMode out of BlockTreeTermsReader, to make it a public 
> enum of the "index" package. That way CheckIndex and ReadersAndUpdates do not 
> import anymore BlockTreeTermsReader.
> This would also allow other postings formats to use the same enum (e.g. 
> LUCENE-9254)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (SOLR-14301) Remove external commons-codec usage in gradle validateJarChecksums

2020-03-03 Thread Andras Salamon (Jira)



 [ 
https://issues.apache.org/jira/browse/SOLR-14301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Salamon updated SOLR-14301:
--
Attachment: SOLR-14301-01.patch

> Remove external commons-codec usage in gradle validateJarChecksums
> --
>
> Key: SOLR-14301
> URL: https://issues.apache.org/jira/browse/SOLR-14301
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Andras Salamon
>Priority: Minor
> Attachments: SOLR-14301-01.patch
>
>
> Right now gradle calculates SHA-1 checksums using an external 
> {{commons-codec}} library. We can calculate SHA-1 using Java 8 classes, no 
> need for commons-codec here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (SOLR-14301) Remove external commons-codec usage in gradle validateJarChecksums

2020-03-03 Thread Andras Salamon (Jira)



 [ 
https://issues.apache.org/jira/browse/SOLR-14301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Salamon updated SOLR-14301:
--
Status: Patch Available  (was: Open)

> Remove external commons-codec usage in gradle validateJarChecksums
> --
>
> Key: SOLR-14301
> URL: https://issues.apache.org/jira/browse/SOLR-14301
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Andras Salamon
>Priority: Minor
> Attachments: SOLR-14301-01.patch
>
>
> Right now gradle calculates SHA-1 checksums using an external 
> {{commons-codec}} library. We can calculate SHA-1 using Java 8 classes, no 
> need for commons-codec here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Created] (SOLR-14301) Remove external commons-codec usage in gradle validateJarChecksums

2020-03-03 Thread Andras Salamon (Jira)

Andras Salamon created SOLR-14301:
-

 Summary: Remove external commons-codec usage in gradle 
validateJarChecksums
 Key: SOLR-14301
 URL: https://issues.apache.org/jira/browse/SOLR-14301
 Project: Solr
  Issue Type: Improvement
  Security Level: Public (Default Security Level. Issues are Public)
Reporter: Andras Salamon


Right now gradle calculates SHA-1 checksums using an external {{commons-codec}} 
library. We can calculate SHA-1 using Java 8 classes, no need for commons-codec 
here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-9258) DocTermsIndexDocValues should not assume it's operating on a SortedDocValues field



 [ 
https://issues.apache.org/jira/browse/LUCENE-9258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michele Palmia updated LUCENE-9258:
---
Status: Patch Available  (was: Open)

> DocTermsIndexDocValues should not assume it's operating on a SortedDocValues 
> field
> --
>
> Key: LUCENE-9258
> URL: https://issues.apache.org/jira/browse/LUCENE-9258
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 8.4
>Reporter: Michele Palmia
>Priority: Minor
> Attachments: LUCENE-9258.patch
>
>
> When requesting a new _ValueSourceScorer_ (with _getRangeScorer_) from 
> _DocTermsIndexDocValues_ , the latter instantiates a new iterator on 
> _SortedDocValues_ regardless of the fact that the underlying field can 
> actually be of a different type (e.g. a _SortedSetDocValues_ processed 
> through a _SortedSetSelector_).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] jpountz opened a new pull request #1311: LUCENE-9260: Verify checksums of CFS files.

jpountz opened a new pull request #1311: LUCENE-9260: Verify checksums of CFS 
files.
URL: https://github.com/apache/lucene-solr/pull/1311
 
 
   See https://issues.apache.org/jira/browse/LUCENE-8833.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Created] (LUCENE-9260) Verify checksums of CFS files?

2020-03-03 Thread Adrien Grand (Jira)

Adrien Grand created LUCENE-9260:


 Summary: Verify checksums of CFS files?
 Key: LUCENE-9260
 URL: https://issues.apache.org/jira/browse/LUCENE-9260
 Project: Lucene - Core
  Issue Type: Task
Reporter: Adrien Grand


While CFS files write checksums in their footer, we never validate these 
checksums. Can we verify them in LeafReader#checkIntegrity?

This checksum is a bit redundant with the checksums of the files that are 
stored in the CFS file, but I'd rather verify some bytes multiple times than 
have checksums that never get verified?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9259) NGramFilter use wrong argument name for preserve option

2020-03-03 Thread Tomoko Uchida (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050263#comment-17050263
 ] 

Tomoko Uchida commented on LUCENE-9259:
---

[~Paul Pazderski] thanks, good catch. The fix looks good to me (this preserve 
backward compatibility), let me check the tests and documentation.

> NGramFilter use wrong argument name for preserve option
> ---
>
> Key: LUCENE-9259
> URL: https://issues.apache.org/jira/browse/LUCENE-9259
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 7.4, 8.0
>Reporter: Paul Pazderski
>Priority: Minor
> Attachments: LUCENE-9259.patch
>
>
> LUCENE-7960 added the possibility to preserve the original term when using 
> NGram filters. The documentation says to enable it with 'preserveOriginal' 
> and it works for EdgeNGram filter. But NGram filter requires the initial 
> planned option 'keepShortTerms' to enable this feature.
> This inconsistency is confusing. I'll provide a patch with a possible fix.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9259) NGramFilter use wrong argument name for preserve option

2020-03-03 Thread Lucene/Solr QA (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050255#comment-17050255
 ] 

Lucene/Solr QA commented on LUCENE-9259:


| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
52s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
20s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
20s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Release audit (RAT) {color} | 
{color:green}  0m 23s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Check forbidden APIs {color} | 
{color:green}  0m 20s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Validate source patterns {color} | 
{color:green}  0m 20s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Validate ref guide {color} | 
{color:green}  0m 20s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  0m 
46s{color} | {color:green} common in the patch passed. {color} |
| {color:black}{color} | {color:black} {color} | {color:black}  7m 30s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | LUCENE-9259 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12995465/LUCENE-9259.patch |
| Optional Tests |  compile  javac  unit  ratsources  checkforbiddenapis  
validatesourcepatterns  validaterefguide  |
| uname | Linux lucene2-us-west.apache.org 4.4.0-170-generic #199-Ubuntu SMP 
Thu Nov 14 01:45:04 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | ant |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-LUCENE-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh
 |
| git revision | master / bc6fa3b |
| ant | version: Apache Ant(TM) version 1.9.6 compiled on July 20 2018 |
| Default Java | LTS |
|  Test Results | 
https://builds.apache.org/job/PreCommit-LUCENE-Build/254/testReport/ |
| modules | C: lucene/analysis/common solr/solr-ref-guide U: . |
| Console output | 
https://builds.apache.org/job/PreCommit-LUCENE-Build/254/console |
| Powered by | Apache Yetus 0.7.0   http://yetus.apache.org |


This message was automatically generated.



> NGramFilter use wrong argument name for preserve option
> ---
>
> Key: LUCENE-9259
> URL: https://issues.apache.org/jira/browse/LUCENE-9259
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 7.4, 8.0
>Reporter: Paul Pazderski
>Priority: Minor
> Attachments: LUCENE-9259.patch
>
>
> LUCENE-7960 added the possibility to preserve the original term when using 
> NGram filters. The documentation says to enable it with 'preserveOriginal' 
> and it works for EdgeNGram filter. But NGram filter requires the initial 
> planned option 'keepShortTerms' to enable this feature.
> This inconsistency is confusing. I'll provide a patch with a possible fix.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-13942) /api/cluster/zk/* to fetch raw ZK data

2020-03-03 Thread Jason Gerlowski (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-13942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050254#comment-17050254
 ] 

Jason Gerlowski commented on SOLR-13942:


I think there's more objections here than you're addressing here Noble.  
Discussion has focused on the security aspect and ignored other objections that 
I don't want to see get lost.  So I figured it'd be good to summarize the 
objections so far:

# *Security Concerns* - I disagree with your statement above that the 
community's position is "No data in ZK is private".  To the contrary, in the 
[ref-guide|https://lucene.apache.org/solr/guide/8_3/zookeeper-access-control.html]
 itself we suggest to users that there are valid reasons to control ZK read 
access.  I understand your usecase about wanting to lock down ZK access but 
support read access for accessing broader debugging information.  But what 
specific ZK information are you imagining access for that isn't already exposed 
by other Solr admin APIs?  And why aren't ZK ACLs sufficient to tackle this? 
# *Public API/Abstraction Concerns* - Anshum's point above is a good one - 
we're trying to treat ZK more and more as an implementation detail.  The 
community has tackled several JIRAs in the last few years around making ZK less 
noticeable.  A good example is SOLR-9784, which you yourself proposed.  Yes, we 
have other APIs which expose ZK information, but the prevailing sentiment in 
the community has cut against those.  Their existence isn't an argument to 
double down and create new ZK-leaking endpoints.
# *Redundant Functionality* - Our Solr distribution already offers a handful of 
ways to access ZK data. bin/solr, zkcli.sh, the /admin/zookeeper endpoint, 
various Collections API commands, etc.  And where these aren't 
available/sufficient, there are plenty of external ZK clients to use.  Yes, 
these would require installation, but {{wget  && tar -xvf 
 && bin/zkCli.sh ...}} doesn't seem onerous enough to merit 
giving Solr a set of ZK client APIs.
# *Slimness Concerns* - I'm a little surprised honestly that you and Ishan have 
taken this up, when you've been so instrumental in getting people focused on 
slimming Solr down and cutting everything out of core that's not related to 
Solr's raison d'etre.  I think that's been a great focus to see in terms of 
helping with Solr's maintainability and stability.  This seems to undercut that 
a bit - with the renewed focus on slimming down and with so many other options 
available, does Solr really need to take on the added surface area of being a 
fully fledged ZK client?

> /api/cluster/zk/* to fetch raw ZK data
> --
>
> Key: SOLR-13942
> URL: https://issues.apache.org/jira/browse/SOLR-13942
> Project: Solr
>  Issue Type: Bug
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> example
> download the {{state.json}} of
> {code}
> GET http://localhost:8983/api/cluster/zk/collections/gettingstarted/state.json
> {code}
> get a list of all children under {{/live_nodes}}
> {code}
> GET http://localhost:8983/api/cluster/zk/live_nodes
> {code}
> If the requested path is a node with children show the list of child nodes 
> and their meta data



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Reopened] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-03-03 Thread Michael Sokolov (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Sokolov reopened LUCENE-8962:
-

There are still test fails being reported by jenkins. Here's one that 
reproduced for me:

[repro] Repro line:  ant test  -Dtestcase=TestIndexWriterMergePolicy 
-Dtests.method=testMergeOnCommit -Dtests.seed=33907689C9809E73 
-Dtests.multiplier=2 -Dtests.slow=true 
-Dtests.locale=th-TH-u-nu-thai-x-lvariant-TH -Dtests.timezone=Africa/Banjul 
-Dtests.asserts=true -Dtests.file.encoding=UTF-8

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 8.5
>
> Attachments: LUCENE-8962_demo.png
>
>  Time Spent: 6h
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-9259) NGramFilter use wrong argument name for preserve option

2020-03-03 Thread Paul Pazderski (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-9259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Pazderski updated LUCENE-9259:
---
Status: Patch Available  (was: Open)

> NGramFilter use wrong argument name for preserve option
> ---
>
> Key: LUCENE-9259
> URL: https://issues.apache.org/jira/browse/LUCENE-9259
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 7.4, 8.0
>Reporter: Paul Pazderski
>Priority: Minor
> Attachments: LUCENE-9259.patch
>
>
> LUCENE-7960 added the possibility to preserve the original term when using 
> NGram filters. The documentation says to enable it with 'preserveOriginal' 
> and it works for EdgeNGram filter. But NGram filter requires the initial 
> planned option 'keepShortTerms' to enable this feature.
> This inconsistency is confusing. I'll provide a patch with a possible fix.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-9259) NGramFilter use wrong argument name for preserve option

2020-03-03 Thread Paul Pazderski (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-9259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Pazderski updated LUCENE-9259:
---
Attachment: LUCENE-9259.patch

> NGramFilter use wrong argument name for preserve option
> ---
>
> Key: LUCENE-9259
> URL: https://issues.apache.org/jira/browse/LUCENE-9259
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 7.4, 8.0
>Reporter: Paul Pazderski
>Priority: Minor
> Attachments: LUCENE-9259.patch
>
>
> LUCENE-7960 added the possibility to preserve the original term when using 
> NGram filters. The documentation says to enable it with 'preserveOriginal' 
> and it works for EdgeNGram filter. But NGram filter requires the initial 
> planned option 'keepShortTerms' to enable this feature.
> This inconsistency is confusing. I'll provide a patch with a possible fix.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Created] (LUCENE-9259) NGramFilter use wrong argument name for preserve option

2020-03-03 Thread Paul Pazderski (Jira)

Paul Pazderski created LUCENE-9259:
--

 Summary: NGramFilter use wrong argument name for preserve option
 Key: LUCENE-9259
 URL: https://issues.apache.org/jira/browse/LUCENE-9259
 Project: Lucene - Core
  Issue Type: Bug
  Components: modules/analysis
Affects Versions: 8.0, 7.4
Reporter: Paul Pazderski


LUCENE-7960 added the possibility to preserve the original term when using 
NGram filters. The documentation says to enable it with 'preserveOriginal' and 
it works for EdgeNGram filter. But NGram filter requires the initial planned 
option 'keepShortTerms' to enable this feature.

This inconsistency is confusing. I'll provide a patch with a possible fix.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-13350) Explore collector managers for multi-threaded search



[ 
https://issues.apache.org/jira/browse/SOLR-13350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050227#comment-17050227
 ] 

Ishan Chattopadhyaya commented on SOLR-13350:
-

bq. One issue that I remember is that there is difficulty in concurrently using 
a DocSetCollector
Thanks [~dsmiley]. Here, I've not used a shared DocSetCollector, but every 
thread uses its own, and I'm intersecting all of those at the end of the 
search. This might not be as performant, but perhaps good for a start.

> Explore collector managers for multi-threaded search
> 
>
> Key: SOLR-13350
> URL: https://issues.apache.org/jira/browse/SOLR-13350
> Project: Solr
>  Issue Type: New Feature
>Reporter: Ishan Chattopadhyaya
>Assignee: Ishan Chattopadhyaya
>Priority: Major
> Attachments: SOLR-13350.patch, SOLR-13350.patch, SOLR-13350.patch
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> AFAICT, SolrIndexSearcher can be used only to search all the segments of an 
> index in series. However, using CollectorManagers, segments can be searched 
> concurrently and result in reduced latency. Opening this issue to explore the 
> effectiveness of using CollectorManagers in SolrIndexSearcher from latency 
> and throughput perspective.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (SOLR-13350) Explore collector managers for multi-threaded search



[ 
https://issues.apache.org/jira/browse/SOLR-13350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050015#comment-17050015
 ] 

Ishan Chattopadhyaya edited comment on SOLR-13350 at 3/3/20 1:47 PM:
-

-FYI, I'm trying to update the patch to latest master. It is taking me longer 
due to changes in SOLR-13892.-
Added a new PR for this (as the earlier one fell completely out of date): 
https://github.com/apache/lucene-solr/pull/1310
Would appreciate if someone can please review.


was (Author: ichattopadhyaya):
FYI, I'm trying to update the patch to latest master. It is taking me longer 
due to changes in SOLR-13892.

> Explore collector managers for multi-threaded search
> 
>
> Key: SOLR-13350
> URL: https://issues.apache.org/jira/browse/SOLR-13350
> Project: Solr
>  Issue Type: New Feature
>Reporter: Ishan Chattopadhyaya
>Assignee: Ishan Chattopadhyaya
>Priority: Major
> Attachments: SOLR-13350.patch, SOLR-13350.patch, SOLR-13350.patch
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> AFAICT, SolrIndexSearcher can be used only to search all the segments of an 
> index in series. However, using CollectorManagers, segments can be searched 
> concurrently and result in reduced latency. Opening this issue to explore the 
> effectiveness of using CollectorManagers in SolrIndexSearcher from latency 
> and throughput perspective.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] chatman opened a new pull request #1310: SOLR-13350: Multithreaded search using collector managers

chatman opened a new pull request #1310: SOLR-13350: Multithreaded search using 
collector managers
URL: https://github.com/apache/lucene-solr/pull/1310
 
 
   This is almost complete. Here, all queries are multi-threaded.
   TODO: Implement a query time parameter (default: off) to enable 
multi-threaded searching.
   
   Tests pass.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9257) FSTLoadMode should not be BlockTree specific as it is used more generally in index package

2020-03-03 Thread Adrien Grand (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050181#comment-17050181
 ] 

Adrien Grand commented on LUCENE-9257:
--

What about removing FSTLoadMode and always loading the FST off-heap? I've been 
running benchmarks with OffheapFSTStore and the overhead is very small for 
end-to-end indexing.

I think that something like LUCENE-8833 may be a better way to configure what 
should be in memory and what not?

> FSTLoadMode should not be BlockTree specific as it is used more generally in 
> index package
> --
>
> Key: LUCENE-9257
> URL: https://issues.apache.org/jira/browse/LUCENE-9257
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Bruno Roustant
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> FSTLoadMode and its associate attribute key (static String) are currently 
> defined in BlockTreeTermsReader, but they are actually used outside of 
> BlockTree in the general "index" package.
> CheckIndex and ReadersAndUpdates are using these enum and attribute key to 
> drive the FST load mode through the SegmentReader which is not specific to a 
> postings format. They have an unnecessary dependency to BlockTreeTermsReader.
> We could move FSTLoadMode out of BlockTreeTermsReader, to make it a public 
> enum of the "index" package. That way CheckIndex and ReadersAndUpdates do not 
> import anymore BlockTreeTermsReader.
> This would also allow other postings formats to use the same enum (e.g. 
> LUCENE-9254)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] mikemccand commented on issue #1274: LUCENE-9164: Prevent IW from closing gracefully if threads are still modifying

mikemccand commented on issue #1274: LUCENE-9164: Prevent IW from closing 
gracefully if threads are still modifying
URL: https://github.com/apache/lucene-solr/pull/1274#issuecomment-593916689
 
 
   I'll have a look!


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] dweiss commented on issue #1304: LUCENE-9242: generate javadocs by calling Ant javadoc task

dweiss commented on issue #1304: LUCENE-9242: generate javadocs by calling Ant 
javadoc task
URL: https://github.com/apache/lucene-solr/pull/1304#issuecomment-593907618
 
 
   Let's leave this for a follow-up issue, Tomoko. This patch is already a lot 
of work, improvements can come later.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] noblepaul merged pull request #1309: SOLR-13942: /api/cluster/zk/* to fetch raw ZK data