[jira] [Comment Edited] (SOLR-14194) Allow Highlighting to work for indexes with uniqueKey that is not stored

2020-02-09 Thread Andrzej Wislowski (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033403#comment-17033403
 ] 

Andrzej Wislowski edited comment on SOLR-14194 at 2/10/20 6:58 AM:
---

[~dsmiley], I have added to: HighlighterWithoutStoredIdTest

method to clear system properties 

 
{code:java}
+ @AfterClass 
+ public static void afterClass() { 
+System.clearProperty("solr.tests.id.stored"); 
+System.clearProperty("solr.tests.id.docValues"); 
+ } {code}


was (Author: awislowski):
[~dsmiley], I have added to: HighlighterWithoutStoredIdTest

method to clear system properties 

 

+ @AfterClass
+ public static void afterClass() {
+ System.clearProperty("solr.tests.id.stored");
+ System.clearProperty("solr.tests.id.docValues");
+ }

 

> Allow Highlighting to work for indexes with uniqueKey that is not stored
> 
>
> Key: SOLR-14194
> URL: https://issues.apache.org/jira/browse/SOLR-14194
> Project: Solr
>  Issue Type: Improvement
>  Components: highlighter
>Affects Versions: master (9.0)
>Reporter: Andrzej Wislowski
>Assignee: David Smiley
>Priority: Minor
>  Labels: highlighter
> Attachments: SOLR-14194.patch, SOLR-14194.patch, SOLR-14194.patch, 
> SOLR-14194.patch
>
>
> Highlighting requires uniqueKey to be a stored field. I have changed 
> Highlighter allow returning results on indexes with uniqueKey that is a not 
> stored field, but saved as a docvalue type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14194) Allow Highlighting to work for indexes with uniqueKey that is not stored

2020-02-09 Thread Andrzej Wislowski (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033403#comment-17033403
 ] 

Andrzej Wislowski commented on SOLR-14194:
--

[~dsmiley], I have added to: HighlighterWithoutStoredIdTest

method to clear system properties 

 

+ @AfterClass
+ public static void afterClass() {
+ System.clearProperty("solr.tests.id.stored");
+ System.clearProperty("solr.tests.id.docValues");
+ }

 

> Allow Highlighting to work for indexes with uniqueKey that is not stored
> 
>
> Key: SOLR-14194
> URL: https://issues.apache.org/jira/browse/SOLR-14194
> Project: Solr
>  Issue Type: Improvement
>  Components: highlighter
>Affects Versions: master (9.0)
>Reporter: Andrzej Wislowski
>Assignee: David Smiley
>Priority: Minor
>  Labels: highlighter
> Attachments: SOLR-14194.patch, SOLR-14194.patch, SOLR-14194.patch, 
> SOLR-14194.patch
>
>
> Highlighting requires uniqueKey to be a stored field. I have changed 
> Highlighter allow returning results on indexes with uniqueKey that is a not 
> stored field, but saved as a docvalue type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14194) Allow Highlighting to work for indexes with uniqueKey that is not stored

2020-02-09 Thread David Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033371#comment-17033371
 ] 

David Smiley commented on SOLR-14194:
-

This situation illustrates the superiority of PRs to patches -- clarity of what 
changed.  Andrzej can you please tell me what you changed and what that fixed?  
It's odd to see test failures here reported by Yetus; I didn't have a problem 
locally but maybe I wasn't thorough.

> Allow Highlighting to work for indexes with uniqueKey that is not stored
> 
>
> Key: SOLR-14194
> URL: https://issues.apache.org/jira/browse/SOLR-14194
> Project: Solr
>  Issue Type: Improvement
>  Components: highlighter
>Affects Versions: master (9.0)
>Reporter: Andrzej Wislowski
>Assignee: David Smiley
>Priority: Minor
>  Labels: highlighter
> Attachments: SOLR-14194.patch, SOLR-14194.patch, SOLR-14194.patch, 
> SOLR-14194.patch
>
>
> Highlighting requires uniqueKey to be a stored field. I have changed 
> Highlighter allow returning results on indexes with uniqueKey that is a not 
> stored field, but saved as a docvalue type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] dsmiley commented on issue #1191: SOLR-14197 Reduce API of SolrResourceLoader

2020-02-09 Thread GitBox
dsmiley commented on issue #1191: SOLR-14197 Reduce API of SolrResourceLoader
URL: https://github.com/apache/lucene-solr/pull/1191#issuecomment-583950052
 
 
   Perhaps the remaining larger changes relating to new classes (e.g. 
StandaloneSolrResourceLoader) should wait for a follow-on commit; there's 
plenty here already.  Maybe a few static methods could/should move elsewhere 
but this is ready for a review I think.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9004) Approximate nearest vector search

2020-02-09 Thread Erick Erickson (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033284#comment-17033284
 ] 

Erick Erickson commented on LUCENE-9004:


Which is even more reason not to have a more-than-one-pass optimize, correct?

Which bring up an interesting end-to-end indexing speedup. Limited use-cases 
but...

Do the original indexing with NoMergePolicy, playing games with when Lucene 
flushes segments, you'd want the initial flush to be as large as possible. The 
idea here is to create, say, 5G segments (or whatever) during indexing without 
merging. Then the optimize step merges all the segments at once.

> Approximate nearest vector search
> -
>
> Key: LUCENE-9004
> URL: https://issues.apache.org/jira/browse/LUCENE-9004
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael Sokolov
>Priority: Major
> Attachments: hnsw_layered_graph.png
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> "Semantic" search based on machine-learned vector "embeddings" representing 
> terms, queries and documents is becoming a must-have feature for a modern 
> search engine. SOLR-12890 is exploring various approaches to this, including 
> providing vector-based scoring functions. This is a spinoff issue from that.
> The idea here is to explore approximate nearest-neighbor search. Researchers 
> have found an approach based on navigating a graph that partially encodes the 
> nearest neighbor relation at multiple scales can provide accuracy > 95% (as 
> compared to exact nearest neighbor calculations) at a reasonable cost. This 
> issue will explore implementing HNSW (hierarchical navigable small-world) 
> graphs for the purpose of approximate nearest vector search (often referred 
> to as KNN or k-nearest-neighbor search).
> At a high level the way this algorithm works is this. First assume you have a 
> graph that has a partial encoding of the nearest neighbor relation, with some 
> short and some long-distance links. If this graph is built in the right way 
> (has the hierarchical navigable small world property), then you can 
> efficiently traverse it to find nearest neighbors (approximately) in log N 
> time where N is the number of nodes in the graph. I believe this idea was 
> pioneered in  [1]. The great insight in that paper is that if you use the 
> graph search algorithm to find the K nearest neighbors of a new document 
> while indexing, and then link those neighbors (undirectedly, ie both ways) to 
> the new document, then the graph that emerges will have the desired 
> properties.
> The implementation I propose for Lucene is as follows. We need two new data 
> structures to encode the vectors and the graph. We can encode vectors using a 
> light wrapper around {{BinaryDocValues}} (we also want to encode the vector 
> dimension and have efficient conversion from bytes to floats). For the graph 
> we can use {{SortedNumericDocValues}} where the values we encode are the 
> docids of the related documents. Encoding the interdocument relations using 
> docids directly will make it relatively fast to traverse the graph since we 
> won't need to lookup through an id-field indirection. This choice limits us 
> to building a graph-per-segment since it would be impractical to maintain a 
> global graph for the whole index in the face of segment merges. However 
> graph-per-segment is a very natural at search time - we can traverse each 
> segments' graph independently and merge results as we do today for term-based 
> search.
> At index time, however, merging graphs is somewhat challenging. While 
> indexing we build a graph incrementally, performing searches to construct 
> links among neighbors. When merging segments we must construct a new graph 
> containing elements of all the merged segments. Ideally we would somehow 
> preserve the work done when building the initial graphs, but at least as a 
> start I'd propose we construct a new graph from scratch when merging. The 
> process is going to be  limited, at least initially, to graphs that can fit 
> in RAM since we require random access to the entire graph while constructing 
> it: In order to add links bidirectionally we must continually update existing 
> documents.
> I think we want to express this API to users as a single joint 
> {{KnnGraphField}} abstraction that joins together the vectors and the graph 
> as a single joint field type. Mostly it just looks like a vector-valued 
> field, but has this graph attached to it.
> I'll push a branch with my POC and would love to hear comments. It has many 
> nocommits, basic design is not really set, there is no Query implementation 
> and no integration iwth IndexSearcher, but it does work by some measure using 
> a standalone test class. I've tested with uniform random vectors and 

[GitHub] [lucene-solr] ErickErickson commented on issue #1169: LUCENE-9004: A minor feature and patch -- support deleting vector values and fix segments merging

2020-02-09 Thread GitBox
ErickErickson commented on issue #1169: LUCENE-9004: A minor feature and patch 
-- support deleting vector values and fix segments merging
URL: https://github.com/apache/lucene-solr/pull/1169#issuecomment-583848170
 
 
   Julie:
   
   Moving the conversation about forceMerge over from the JIRA as per Julie.
   
   I can imagine ways to shorten the merge process, but it'll still take quite 
a long time. My main concern was that I didn't know if the problem Julie was 
talking about was functional or not. So it sounds like the issue is "just" 
performance.
   
   Ways to shorten it: First, I'm assuming you're using TieredMergePolicy, 
which is the default. The forceMerge(1) option _may_ rewrite any given segment 
multiple times. There's a limit of 30 segments merged at any given time, see 
maxMergeAtOnceExplicit. So say you have 300 segments, first you'd have 10 
merges of 30 segments in the first pass, then another merge of the resulting 
segments. Each pass is a complete rewrite of the entire index. Depending on the 
number of segments, there could be more passes. That limit is mainly there so 
forceMerge doesn't consume too many resources if, say, indexing or searching 
are going on, but in your case I'd guess you don't care about that. So you 
could set it to a very large number and get it done in a single pass.
   
   I think that's about the most savings you'd get, I don't think (but haven't 
measured) whether merging 150 small segments totaling 300G in a single pass is 
any slower or faster than merging 10 segments totaling 300G, if you wanted to 
try that you could set maxMergedSegmentMB. That would simply do more merging in 
the background during indexing to produce fewer, larger segments. Like I said, 
though, I don't think this will make any difference.
   
   So my guess is that if you bump maxMergeAtOnceExplicit to a very large 
number, you'll cut your merge time in half (or a third or quarter, or... 
depending on the number of passes). It'll still take considerable time, but may 
be acceptable.
   
   Best,
   Erick
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14209) Upgrade JQuery to 3.4.1

2020-02-09 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033204#comment-17033204
 ] 

ASF subversion and git services commented on SOLR-14209:


Commit 07de70ba62778f8298a38658ff35f035f8eb197e in lucene-solr's branch 
refs/heads/branch_8x from Mikhail Khludnev
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=07de70b ]

SOLR-14209: specify charset via String for Java 8.


> Upgrade JQuery to 3.4.1
> ---
>
> Key: SOLR-14209
> URL: https://issues.apache.org/jira/browse/SOLR-14209
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Admin UI, contrib - Velocity
>Reporter: Kevin Risden
>Assignee: Kevin Risden
>Priority: Major
> Fix For: 8.5
>
> Attachments: Screen Shot 2020-01-23 at 3.17.07 PM.png, Screen Shot 
> 2020-01-23 at 3.28.47 PM.png
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Currently JQuery is on 2.1.3. It would be good to upgrade to the latest 
> version if possible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14209) Upgrade JQuery to 3.4.1

2020-02-09 Thread Mikhail Khludnev (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033201#comment-17033201
 ] 

Mikhail Khludnev commented on SOLR-14209:
-

[https://builds.apache.org/job/Lucene-Solr-Tests-8.x/1051/console]

8.x is failing and failing I'm going to push into {{branch_8x}} fix with 
encoding provided as String. Concerns? 

> Upgrade JQuery to 3.4.1
> ---
>
> Key: SOLR-14209
> URL: https://issues.apache.org/jira/browse/SOLR-14209
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Admin UI, contrib - Velocity
>Reporter: Kevin Risden
>Assignee: Kevin Risden
>Priority: Major
> Fix For: 8.5
>
> Attachments: Screen Shot 2020-01-23 at 3.17.07 PM.png, Screen Shot 
> 2020-01-23 at 3.28.47 PM.png
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Currently JQuery is on 2.1.3. It would be good to upgrade to the latest 
> version if possible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14247) IndexSizeTriggerMixedBoundsTest does a lot of sleeping

2020-02-09 Thread Erick Erickson (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033197#comment-17033197
 ] 

Erick Erickson commented on SOLR-14247:
---

No failures beasting it 1,000 times. This makes me a little nervous, in that 
those sleeps were put in for a reason. That said, if we run into problems, we 
can go back in and put in a timer and check the condition every, say, 100 ms 
rather than stick a sleep in and hope it's long enough. I usually put a timer 
in that has unreasonably long upper bound (say 30 seconds) on the theory that I 
don't care if the test takes a long time in the (hopefully) rare failure cases.

> IndexSizeTriggerMixedBoundsTest does a lot of sleeping
> --
>
> Key: SOLR-14247
> URL: https://issues.apache.org/jira/browse/SOLR-14247
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Tests
>Reporter: Mike Drob
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When I run tests locally, the slowest reported test is always 
> IndexSizeTriggerMixedBoundsTest  coming in at around 2 minutes.
> I took a look at the code and discovered that at least 80s of that is all 
> sleeps!
> There might need to be more synchronization and ordering added back in, but 
> when I removed all of the sleeps the test still passed locally for me, so I'm 
> not too sure what the point was or why we were slowing the system down so 
> much.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-11725) json.facet's stddev() function should be changed to use the "Corrected sample stddev" formula

2020-02-09 Thread Munendra S N (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-11725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033150#comment-17033150
 ] 

Munendra S N commented on SOLR-11725:
-

 [^SOLR-11725.patch] 
This patch contains upgrade entry and test cases for singleton and 0 size 
samples
[~ysee...@gmail.com]
Thank you for noticing NaN change for singleton case. I had missed changing the 
base case. Latest patch replicates StatsComponent behavior

> json.facet's stddev() function should be changed to use the "Corrected sample 
> stddev" formula
> -
>
> Key: SOLR-11725
> URL: https://issues.apache.org/jira/browse/SOLR-11725
> Project: Solr
>  Issue Type: Sub-task
>  Components: Facet Module
>Reporter: Chris M. Hostetter
>Priority: Major
> Attachments: SOLR-11725.patch, SOLR-11725.patch, SOLR-11725.patch
>
>
> While working on some equivalence tests/demonstrations for 
> {{facet.pivot+stats.field}} vs {{json.facet}} I noticed that the {{stddev}} 
> calculations done between the two code paths can be measurably different, and 
> realized this is due to them using very different code...
> * {{json.facet=foo:stddev(foo)}}
> ** {{StddevAgg.java}}
> ** {{Math.sqrt((sumSq/count)-Math.pow(sum/count, 2))}}
> * {{stats.field=\{!stddev=true\}foo}}
> ** {{StatsValuesFactory.java}}
> ** {{Math.sqrt(((count * sumOfSquares) - (sum * sum)) / (count * (count - 
> 1.0D)))}}
> Since I"m not really a math guy, I consulting with a bunch of smart math/stat 
> nerds I know online to help me sanity check if these equations (some how) 
> reduced to eachother (In which case the discrepancies I was seeing in my 
> results might have just been due to the order of intermediate operation 
> execution & floating point rounding differences).
> They confirmed that the two bits of code are _not_ equivalent to each other, 
> and explained that the code JSON Faceting is using is equivalent to the 
> "Uncorrected sample stddev" formula, while StatsComponent's code is 
> equivalent to the the "Corrected sample stddev" formula...
> https://en.wikipedia.org/wiki/Standard_deviation#Uncorrected_sample_standard_deviation
> When I told them that stuff like this is why no one likes mathematicians and 
> pressed them to explain which one was the "most canonical" (or "most 
> generally applicable" or "best") definition of stddev, I was told that:
> # This is something statisticians frequently disagree on
> # Practically speaking the diff between the calculations doesn't tend to 
> differ significantly when count is "very large"
> # _"Corrected sample stddev" is more appropriate when comparing two 
> distributions_
> Given that:
> * the primary usage of computing the stddev of a field/function against a 
> Solr result set (or against a sub-set of results defined by a facet 
> constraint) is probably to compare that distribution to a different Solr 
> result set (or to compare N sub-sets of results defined by N facet 
> constraints)
> * the size of the sets of documents (values) can be relatively small when 
> computing stats over facet constraint sub-sets
> ...it seems like {{StddevAgg.java}} should be updated to use the "Corrected 
> sample stddev" equation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-11725) json.facet's stddev() function should be changed to use the "Corrected sample stddev" formula

2020-02-09 Thread Munendra S N (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-11725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Munendra S N updated SOLR-11725:

Attachment: SOLR-11725.patch

> json.facet's stddev() function should be changed to use the "Corrected sample 
> stddev" formula
> -
>
> Key: SOLR-11725
> URL: https://issues.apache.org/jira/browse/SOLR-11725
> Project: Solr
>  Issue Type: Sub-task
>  Components: Facet Module
>Reporter: Chris M. Hostetter
>Priority: Major
> Attachments: SOLR-11725.patch, SOLR-11725.patch, SOLR-11725.patch
>
>
> While working on some equivalence tests/demonstrations for 
> {{facet.pivot+stats.field}} vs {{json.facet}} I noticed that the {{stddev}} 
> calculations done between the two code paths can be measurably different, and 
> realized this is due to them using very different code...
> * {{json.facet=foo:stddev(foo)}}
> ** {{StddevAgg.java}}
> ** {{Math.sqrt((sumSq/count)-Math.pow(sum/count, 2))}}
> * {{stats.field=\{!stddev=true\}foo}}
> ** {{StatsValuesFactory.java}}
> ** {{Math.sqrt(((count * sumOfSquares) - (sum * sum)) / (count * (count - 
> 1.0D)))}}
> Since I"m not really a math guy, I consulting with a bunch of smart math/stat 
> nerds I know online to help me sanity check if these equations (some how) 
> reduced to eachother (In which case the discrepancies I was seeing in my 
> results might have just been due to the order of intermediate operation 
> execution & floating point rounding differences).
> They confirmed that the two bits of code are _not_ equivalent to each other, 
> and explained that the code JSON Faceting is using is equivalent to the 
> "Uncorrected sample stddev" formula, while StatsComponent's code is 
> equivalent to the the "Corrected sample stddev" formula...
> https://en.wikipedia.org/wiki/Standard_deviation#Uncorrected_sample_standard_deviation
> When I told them that stuff like this is why no one likes mathematicians and 
> pressed them to explain which one was the "most canonical" (or "most 
> generally applicable" or "best") definition of stddev, I was told that:
> # This is something statisticians frequently disagree on
> # Practically speaking the diff between the calculations doesn't tend to 
> differ significantly when count is "very large"
> # _"Corrected sample stddev" is more appropriate when comparing two 
> distributions_
> Given that:
> * the primary usage of computing the stddev of a field/function against a 
> Solr result set (or against a sub-set of results defined by a facet 
> constraint) is probably to compare that distribution to a different Solr 
> result set (or to compare N sub-sets of results defined by N facet 
> constraints)
> * the size of the sets of documents (values) can be relatively small when 
> computing stats over facet constraint sub-sets
> ...it seems like {{StddevAgg.java}} should be updated to use the "Corrected 
> sample stddev" equation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org