Re: Performance issue with Solr 8.6.1 Unified Highlighter does not occur on Solr 6.

2021-02-01 Thread Kerwin
 Hi David,

Thanks for filing this issue. The classic non-weightMatcher mode works well
for us right now. Yes, we are using the POSTINGS mode for most of the
fields although explicitly mentioning it gives an error since not all
fields are indexed with offsets. So I guess the highlighter is picking the
right choice for each field. Here is the test with hl.offsetSource=ANALYSIS
and hl.weightMatches=false that you requested.

hl.offsetSource=ANALYSIS=false (340 ms)

The above is thus better than the original highlighter. I'll also try and
create that PR soon.


Re: Performance issue with Solr 8.6.1 Unified Highlighter does not occur on Solr 6.

2021-01-28 Thread Kerwin
On another note, since response time is in question, I have been using a
customhighlighter to just override the method encodeSnippets() in the
UnifiedSolrHighlighter class since solr 6 since Solr sends back blank array
(ZERO_LEN_STR_ARRAY) in the response payload for fields that do not match.
Here is the code before:
if (snippet == null) {
  //TODO reuse logic of DefaultSolrHighlighter.alternateField
  summary.add(field, ZERO_LEN_STR_ARRAY);
} 

So I had removed this clause and made the following change:

if (snippet != null) {
   // we used a special snippet separator char and we can now split on
it.
  summary.add(field, snippet.split(SNIPPET_SEPARATOR));
}

This has not changed in Solr 8 too, which for 76 fields gives a very large
payload. So I will keep this custom code for now.

On Fri, Jan 29, 2021 at 12:28 PM Kerwin  wrote:

> Hi David,
>
> Thanks so much for your reply.
> hl.weightMatches was indeed the culprit. After setting it to false, I am
> now getting the same sub-second response as Solr 6. I am using Solr 8.6.1
> (8.6.1)
>
> Here are the tests I carried out:
> hl.requireFieldMatch=true=true  (2458 ms)
> hl.requireFieldMatch=false=true (3964 ms)
> hl.requireFieldMatch=true=false (158 ms)
> hl.requireFieldMatch=false=false (169 ms) (CHOSEN since
> this is consistent with our earlier setting).
>
> Thanks again, I will inform our other teams as well doing the Solr upgrade
> to check the CHANGES.txt doc related to this.
>


Re: Performance issue with Solr 8.6.1 Unified Highlighter does not occur on Solr 6.

2021-01-28 Thread Kerwin
Hi David,

Thanks so much for your reply.
hl.weightMatches was indeed the culprit. After setting it to false, I am
now getting the same sub-second response as Solr 6. I am using Solr 8.6.1
(8.6.1)

Here are the tests I carried out:
hl.requireFieldMatch=true=true  (2458 ms)
hl.requireFieldMatch=false=true (3964 ms)
hl.requireFieldMatch=true=false (158 ms)
hl.requireFieldMatch=false=false (169 ms) (CHOSEN since
this is consistent with our earlier setting).

Thanks again, I will inform our other teams as well doing the Solr upgrade
to check the CHANGES.txt doc related to this.


Performance issue with Solr 8.6.1 Unified Highlighter does not occur on Solr 6.

2021-01-26 Thread Kerwin
Hi,

While upgrading to Solr 8 from 6 the Unified highlighter begins to have
performance issues going from approximately 100ms to more than 4 seconds
with 76 fields in the hl.q  and hl.fl parameters. So I played with
different options and found that the hl.q parameter needs to have any one
field for the performance issue to vanish. I do not know why this would be
so. Could you check if this is a bug or something else? This is not the
case if I use the original highlighter which has same performance on Solr 6
and Solr 8 of ~ 1.5 seconds. The highlighting payload is also mostly same
in all the cases.

Prior Solr 8 configuration with bad performance of > 4sec
{!edismax qf="field1 field2 ..field76" v=$qq}
field1 field2 ..field76

Solr 8 configuration with original Solr 6 performance of ~ 100 ms
{!edismax qf="field1" v=$qq}
field1 field2 ..field76

Other highlighting parameters
true
unified
200
WORD
en
10

If I remove the hl.q parameter altogether, the performance time shoots up
to 6-7 seconds, since our user query is quite large with more fields and is
more complicated, I suspect.


Different Edismax Behavior with user params vs Solr config params on Solr 8.

2021-01-19 Thread Kerwin
Hi,
I am upgrading from Solr 6.5.1 to solr 8.6.1 and have noticed a change in
the Edismax parser behavior which is affecting our search results. If user
operators are present in the search query, the Solr 6 behavior was to take
mm parameters from the user query string which was 0% by default if not
present. However in Solr 8, it is taking mm parameter specified in the Solr
request handler config which is 100%, see config below.

Hence the user query "samsung OR nokia" works in Solr 6 and is not working
in Solr 8 anymore when documents do not have both strings. This is
affecting our search results. Could you suggest why there is this
difference and how to resolve this?

A simple example config is as follows:


   
 
  {!edismax
  qf=manu
  mm=100% v=$qq}
  



The user query is:
http://localhost:8983/solr/search?qq=spring OR boot.

Appreciate any pointers which could resolve this. One solution is to
create two search handlers and check if there are parameters in the user
query before-hand, but I'd prefer this as a last resort.


Re: Issues upgrading from Solr 6.5.1 to Solr 8.6.1

2021-01-18 Thread Kerwin
I further checked that BM25Similarity class until solr 7.7 has a null check
for norms in the explainTFNorm method but this is removed in Solr 8
onwards. Does omitNorms work in Solr8? Can someone send me what the debug
output looks like with omitNorms="true"?
Here is my config:


On Mon, Jan 18, 2021 at 1:51 PM Kerwin  wrote:

> Hi eveybody,
>
> I am migrating from solr 6.5.1 to solr 8.6.1 and am having a couple of
> issues for which I need your help. There is a significant change in ranking
> between Solr 6 and 8 search results which I need to fix before using Solr8
> in our live environment. I noticed a couple of changes upfront which could
> be some of the reasons for ranking changes.
>
> 1. Solr Omit norms not working as expected in Solr 8 with
> BM25SimilarityFactory.
> 2. LegacyBM25SimilarityFactory 'qf' parameter boost value not correct when
> using Edismax.
>
> I tried the Solr examples with the following configuration and can
> replicate the difference on Solr 8.6.1.
>
> *Schema being used:*
>  *omitNorms="true"*/>
>
> *Solr query:*
> http://localhost:8983/solr/solr/select?q=*manu:Samsung*
> =true=json=on
>
> *Solr 6 debug output (Note, 0.0 = parameter b (norms omitted for field))*
>  "SP2514N":"
> 2.6390574 = weight(manu:samsung in 1) [SchemaSimilarity], result of:
>   2.6390574 = score(doc=1,freq=1.0 = termFreq=1.0
> ), product of:
> 2.6390574 = idf, computed as log(1 + (docCount - docFreq + 0.5) /
> (docFreq + 0.5)) from:
>   1.0 = docFreq
>   20.0 = docCount
> 1.0 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1) from:
>   1.0 = termFreq=1.0
>   1.2 = parameter k1
>   *0.0 = parameter b (norms omitted for field)*
> "}
>
> *Solr 8 debug output*
> "SP2514N":"
> 1.5827883 = weight(manu:samsung in 1) [SchemaSimilarity], result of:
>   1.5827883 = score(freq=1.0), computed as boost * idf * tf from:
> 2.6390574 = idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:
>   1 = n, number of documents containing term
>   20 = N, total number of documents with field
> 0.59975517 = tf, computed as freq / (freq + k1 * (1 - b + b * dl /
> avgdl)) from:
>   1.0 = freq, occurrences of term within document
>   1.2 = k1, term saturation parameter
>
>
> *0.75 = b, length normalization parameter  1.0 = dl, length of field
> 2.45 = avgdl, average length of field*
> "}
>
> As you can see above, length normalization is not used in solr 6 which is
> correct while it is being used in Solr 8. I tried to replicate this with
> LegacyBM25SimilarityFactory as well and see the same issue there. Secondly
> LegacyBM25SimilarityFactory is behaving differently with the *'qf' boost*
> value for fields with the edismax parser which I am also using.
>
> Request handler with Edismax:
> 
> 
> explicit
> json
> off
> 10
> edismax
> manu
> 100%
> false
> 
> 
>
> Debug output:
> "SP2514N":"
> 3.4821343 = weight(manu:samsung in 1) [LegacyBM25Similarity], result of:
>   3.4821343 = score(freq=1.0), computed as boost * idf * tf from:
> *2.2 = boost*
> 2.6390574 = idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:
>   1 = n, number of documents containing term
>   20 = N, total number of documents with field
> 0.59975517 = tf, computed as freq / (freq + k1 * (1 - b + b * dl /
> avgdl)) from:
>   1.0 = freq, occurrences of term within document
>   1.2 = k1, term saturation parameter
>   0.75 = b, length normalization parameter
>   1.0 = dl, length of field
>   2.45 = avgdl, average length of field
> "}
>
> On checking the Solr source code this value of 2.2 = boost is roughly
> equal to 1 + k1, as per the code below.
>
> return bm25Similarity.scorer(*boost * (1 + bm25Similarity.getK1()*),
> collectionStats, termStats);
>
> Since LegacyBM25Similarity is supposed to keep the same scoring as Solr 6
> BM25Similarity, which is not working as expected, I cannot test the changes
> in scoring. Kindly help to resolve the above 2 issues. I could be doing
> something wrong with the configuration, but I read the Solr 7 and Solr 8
> migration notes, so not sure where I'm going wrong. Kindly advise.
>


Issues upgrading from Solr 6.5.1 to Solr 8.6.1

2021-01-18 Thread Kerwin
Hi eveybody,

I am migrating from solr 6.5.1 to solr 8.6.1 and am having a couple of
issues for which I need your help. There is a significant change in ranking
between Solr 6 and 8 search results which I need to fix before using Solr8
in our live environment. I noticed a couple of changes upfront which could
be some of the reasons for ranking changes.

1. Solr Omit norms not working as expected in Solr 8 with
BM25SimilarityFactory.
2. LegacyBM25SimilarityFactory 'qf' parameter boost value not correct when
using Edismax.

I tried the Solr examples with the following configuration and can
replicate the difference on Solr 8.6.1.

*Schema being used:*


*Solr query:*
http://localhost:8983/solr/solr/select?q=*manu:Samsung*
=true=json=on

*Solr 6 debug output (Note, 0.0 = parameter b (norms omitted for field))*
 "SP2514N":"
2.6390574 = weight(manu:samsung in 1) [SchemaSimilarity], result of:
  2.6390574 = score(doc=1,freq=1.0 = termFreq=1.0
), product of:
2.6390574 = idf, computed as log(1 + (docCount - docFreq + 0.5) /
(docFreq + 0.5)) from:
  1.0 = docFreq
  20.0 = docCount
1.0 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1) from:
  1.0 = termFreq=1.0
  1.2 = parameter k1
  *0.0 = parameter b (norms omitted for field)*
"}

*Solr 8 debug output*
"SP2514N":"
1.5827883 = weight(manu:samsung in 1) [SchemaSimilarity], result of:
  1.5827883 = score(freq=1.0), computed as boost * idf * tf from:
2.6390574 = idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:
  1 = n, number of documents containing term
  20 = N, total number of documents with field
0.59975517 = tf, computed as freq / (freq + k1 * (1 - b + b * dl /
avgdl)) from:
  1.0 = freq, occurrences of term within document
  1.2 = k1, term saturation parameter


*0.75 = b, length normalization parameter  1.0 = dl, length of field
  2.45 = avgdl, average length of field*
"}

As you can see above, length normalization is not used in solr 6 which is
correct while it is being used in Solr 8. I tried to replicate this with
LegacyBM25SimilarityFactory as well and see the same issue there. Secondly
LegacyBM25SimilarityFactory is behaving differently with the *'qf' boost*
value for fields with the edismax parser which I am also using.

Request handler with Edismax:


explicit
json
off
10
edismax
manu
100%
false



Debug output:
"SP2514N":"
3.4821343 = weight(manu:samsung in 1) [LegacyBM25Similarity], result of:
  3.4821343 = score(freq=1.0), computed as boost * idf * tf from:
*2.2 = boost*
2.6390574 = idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:
  1 = n, number of documents containing term
  20 = N, total number of documents with field
0.59975517 = tf, computed as freq / (freq + k1 * (1 - b + b * dl /
avgdl)) from:
  1.0 = freq, occurrences of term within document
  1.2 = k1, term saturation parameter
  0.75 = b, length normalization parameter
  1.0 = dl, length of field
  2.45 = avgdl, average length of field
"}

On checking the Solr source code this value of 2.2 = boost is roughly equal
to 1 + k1, as per the code below.

return bm25Similarity.scorer(*boost * (1 + bm25Similarity.getK1()*),
collectionStats, termStats);

Since LegacyBM25Similarity is supposed to keep the same scoring as Solr 6
BM25Similarity, which is not working as expected, I cannot test the changes
in scoring. Kindly help to resolve the above 2 issues. I could be doing
something wrong with the configuration, but I read the Solr 7 and Solr 8
migration notes, so not sure where I'm going wrong. Kindly advise.


Re: Boost differences in two environments for same query and config

2012-04-13 Thread Kerwin
Hi Erick,

Thanks for your suggestions.
I did an optimize on the remote installation and this time with the
same number of documents but still face the same issue as seen from
the debug output below:

9.950362E-4 = (MATCH) sum of:
9.950362E-4 = (MATCH) weight(RECORD_TYPE:info in 35916), product of:
9.950362E-4 = queryWeight(RECORD_TYPE:info), product of:
1.0 = idf(docFreq=58891, maxDocs=8181811)
9.950362E-4 = queryNorm
1.0 = (MATCH) fieldWeight(RECORD_TYPE:info in 35916), product 
of:
1.0 = tf(termFreq(RECORD_TYPE:info)=1)
1.0 = idf(docFreq=58891, maxDocs=8181811)
1.0 = fieldNorm(field=RECORD_TYPE, doc=35916)
0.0 = (MATCH) product of:
1.0945399 = (MATCH) sum of:
0.99503624 = (MATCH) weight(CD:ee123^1000.0 in 35916), 
product of:
0.99503624 = queryWeight(CD:ee123^1000.0), 
product of:
1000.0 = boost
1.0 = idf(docFreq=1, maxDocs=8181811)
9.950362E-4 = queryNorm
1.0 = (MATCH) fieldWeight(CD:ee123 in 35916), 
product of:
1.0 = tf(termFreq(CD:ee123)=1)
1.0 = idf(docFreq=1, maxDocs=8181811)
1.0 = fieldNorm(field=CD, doc=35916)
0.09950362 = (MATCH)
ConstantScoreQuery(QueryWrapperFilter(CD:ee123 CD:ee123c CD:ee123c.
CD:ee123dc CD:ee123e CD:ee123e. CD:ee123en CD:ee123fx CD:ee123g
CD:ee123g.1 CD:ee123g1 CD:ee123ee123 CD:ee123l.1 CD:ee123l1 CD:ee123ll
CD:ee123lr CD:ee123m.z CD:ee123mg CD:ee123mz CD:ee123na CD:ee123nx
CD:ee123ol CD:ee123op CD:ee123p CD:ee123p.1 CD:ee123p1 CD:ee123pn
CD:ee123r.1 CD:ee123r1 CD:ee123s CD:ee123s.z CD:ee123sm CD:ee123sn
CD:ee123sp CD:ee123ss CD:ee123sz)), product of:
100.0 = boost
9.950362E-4 = queryNorm
0.0 = coord(2/3)


So I got the conf folder from the remote server location and replaced
my local conf folder with this one to see if the indexes were formed
differently but my local installation continues to work.I would expect
to see the same behaviour as on the remote installation but it did not
happen. (The only difference on the remote installation is that there
are cores while my local installation has no cores).
Anything else I could try?
Thanks for your help.

On 4/11/12, Erick Erickson erickerick...@gmail.com wrote:
 Well, you're matching a different number of records, so I have to assume
 your indexes are different on the two machines.

 Here is one case where doing an optimize might make sense, that'll purge
 the data associated with any deleted records from the index which should
 make comparisons better

 Additionally, you have to insure that your request handler is identical
 on both, have you made any changes to solrconfig.xml?

 About the coord (2/3), I'm pretty clueless. But also insure that your
 parsed query is identical on both, which is an additional check on
 whether you've changed something on one server and not the
 other.

 Best
 Erick

 On Wed, Apr 11, 2012 at 8:19 AM, Kerwin kerwin...@gmail.com wrote:
 Hi All,

 I am firing the following Solr query against installations on two
 environments one on my local Windows machine and the other on Unix
 (Remote).

 RECORD_TYPE:info AND (NAME:ee123* OR CD:ee123^1000 OR CD:ee123*^100)

 There are no differences in the DataImportHandler configuration ,
 Schema and Solrconfig for both these installations.
 The correct expected result is given by the local installation of Solr
 which also gives scores as expected for the boosts.

 CORRECT/Expected:
 Debug query output for local installation:

 10.822258 = (MATCH) sum of:
0.002170282 = (MATCH) weight(RECORD_TYPE:info in 35916), product
 of:
3.65739E-4 = queryWeight(RECORD_TYPE:info), product of:
5.933964 = idf(docFreq=58891, maxDocs=8181811)
6.1634855E-5 = queryNorm
5.933964 = (MATCH) fieldWeight(RECORD_TYPE:info in 35916),
 product of:
1.0 = tf(termFreq(RECORD_TYPE:info)=1)
5.933964 = idf(docFreq=58891, maxDocs=8181811)
1.0 = fieldNorm(field=RECORD_TYPE, doc=35916)
10.820087 = (MATCH) product of:
16.230131 = (MATCH) sum of:
16.223969 = (MATCH) weight(CD:ee123^1000.0 in
 35916), product of:
0.81 = queryWeight(CD:ee123^1000.0),
 product of:
1000.0 = boost
16.224277 = idf(docFreq=1,
 maxDocs=8181811

Re: advice on creating a solr index when data source is from many unrelated db tables

2010-08-01 Thread Kerwin
Hi,

This is something that I am working on too.I have been trying to combine
results from 3 different tables and trying to avoid the usual SQL union
clauses.
One thing I have tried to do is watch out for common fields like, for
example, first name and last name that could be present in all tables and
then map the similar fields from different tables in data-config.xml to the
same two Schema fields for names. I too have realised that using record type
is a good idea to filter on the results and perhaps make the search faster
by filtering on type.
On Fri, Jul 30, 2010 at 9:58 AM, S Ahmed sahmed1...@gmail.com wrote:

 So I have tables like this:

 Users
 UserSales
 UserHistory
 UserAddresses
 UserNotes
 ClientAddress
 CalenderEvent
 Articles
 Blogs

 Just seems odd to me, jamming on these tables into a single index.  But I
 guess the idea of using a 'type' field to quality exactly what I am
 searching is a good idea, in case I need to filter for only 'articles' or
 blogs or contacts etc.

 But there might be 50 fields if I do this no?



 On Fri, Jul 30, 2010 at 4:01 AM, Chantal Ackermann 
 chantal.ackerm...@btelligent.de wrote:

  Hi Ahmed,
 
  fields that are empty do not impact the index. It's different from a
  database.
  I have text fields for different languages and per document there is
  always only one of the languages set (the text fields for the other
  languages are empty/not set). It works all very well and fast.
 
  I wonder more about what you describe as unrelated data - why would
  you want to put unrelated data into a single index? If you want to
  search on all the data and return mixed results there surely must be
  some kind of relation between the documents?
 
  Chantal
 
  On Thu, 2010-07-29 at 21:33 +0200, S Ahmed wrote:
   I understand (and its straightforward) when you want to create a index
  for
   something simple like Products.
  
   But how do you go about creating a Solr index when you have data coming
  from
   10-15 database tables, and the tables have unrelated data?
  
   The issue is then you would have many 'columns' in your index, and they
  will
   be NULL for much of the data since you are trying to shove 15 db tables
  into
   a single Solr/Lucense index.
  
  
   This must be a common problem, what are the potential solutions?
 
 
 
 



Issue Indexing zip file content in Solr 1.4

2009-11-21 Thread Kerwin
Hi,

 Has anyone faced this issue? If yes why is Tika 0.4 bundled with solr 1.4
.. Instead it should be Tika 0.5...

Problem:
I have a zip file with multiple files of different formats in it.
I am trying to index the zip file content with Solr 1.4 but the Autodetect
parser context is not being passed with the current 1.4 distribution of the
extractingDocumentLoader.So I am unable to index zip file content since an
Empty parser is being created. After indexing the file only the package
entries are displayed as content.
I replaced Tika 0.4 that come with the solr 1.4 distribution with Tika 0.5
along wih some other POI jars and this seems to work as the context is now
being passed and the delegate parser is able to deletate to the correct
parser.

In Tika 0.4 the Autodetect parser does not create the context but in Tika
0.5 it creates the context before calling the parse method.

Am I missing something? Please advise.


Re: Indexing multiple documents in Solr/SolrCell

2009-11-17 Thread Kerwin
Hi Sascha,

Thanks for your reply.
Our approach is similar to what you have mentioned in the jira issue except
that we have all metadata in the xml and not in the database. I am therefore
using a custom XmlUpdateRequestHandler to parse the XML and then calling
Tika from within the XML Loader to parse the content. Until now this seems
to work.
When and in which Solr version do you expect the jira issue to be
addressed?


On Mon, Nov 16, 2009 at 5:02 PM, Sascha Szott sz...@zib.de wrote:

 Hi,

 the problem you've described -- an integration of DataImportHandler (to
 traverse the XML file and get the document urls) and Solr Cell (to extract
 content afterwards) -- is already addressed in issue SOLR-1358 (
 https://issues.apache.org/jira/browse/SOLR-1358).

 Best,
 Sascha


 Kerwin wrote:

 Hi,

 I am new to this forum and would like to know if the function described
 below has been developed or exists in Solr. If it does not exist, is it a
 good Idea and can I contribute.

 We need to index multiple documents with different formats. So we use Solr
 with Tika (Solr Cell).

 Question:
 Can you index both metadata and content for multiple documents iteratively
 in Solr?
 For example I have an XML with metadata and a links to the documents
 content. There are many documents in this XML and I would like to index
 them
 all without firing multiple URLs.

 Example of XML
 add
 doc
 field name=id34122/field
 field name=authorMichael/field
 field name=size3MB/field
 field name=URLURL of the document/field
 /doc
 /add
 doc2./doc2.../docN

 I need to index all these documents by sending this XML in a single
 URL.The
 collection of documents to be indexed could be on a file system.

 I have altered the Solr code to be able to do this but is there an already
 existing feature?





Indexing multiple documents in Solr/SolrCell

2009-11-16 Thread Kerwin
Hi,

I am new to this forum and would like to know if the function described
below has been developed or exists in Solr. If it does not exist, is it a
good Idea and can I contribute.

We need to index multiple documents with different formats. So we use Solr
with Tika (Solr Cell).

Question:
Can you index both metadata and content for multiple documents iteratively
in Solr?
For example I have an XML with metadata and a links to the documents
content. There are many documents in this XML and I would like to index them
all without firing multiple URLs.

Example of XML
add
doc
field name=id34122/field
field name=authorMichael/field
field name=size3MB/field
field name=URLURL of the document/field
/doc
/add
doc2./doc2.../docN

I need to index all these documents by sending this XML in a single URL.The
collection of documents to be indexed could be on a file system.

I have altered the Solr code to be able to do this but is there an already
existing feature?