Re: StandardTokenizerFactory doesn't split on underscore

2021-01-09 Thread Adam Walz
It is expected that the StandardTokenizer will not break on underscores.
The StandardTokenizer follows the Unicode UAX 29
<https://unicode.org/reports/tr29/#Word_Boundaries> standard which
specifies an underscore as an "extender" and this rule
<https://unicode.org/reports/tr29/#WB13a> says to not break from extenders.
This is why xiefengchang was suggesting to use a
PatternReplaceFilterFactory after the StandardTokenizer in order to further
split on underscores if that is your use case.

On Sat, Jan 9, 2021 at 2:58 PM Rahul Goswami  wrote:

> Nope. The underscore is preserved right after tokenization even before it
> reaches any filters. You can choose the type "text_general" and try an
> index time analysis through the "Analysis" page on Solr Admin UI.
>
> Thanks,
> Rahul
>
> On Sat, Jan 9, 2021 at 8:22 AM xiefengchang 
> wrote:
>
> > did you configured PatternReplaceFilterFactory?
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > At 2021-01-08 12:16:06, "Rahul Goswami"  wrote:
> > >Hello,
> > >So recently I was debugging a problem on Solr 7.7.2 where the query
> wasn't
> > >returning the desired results. Turned out that the indexed terms had
> > >underscore separated terms, but the query didn't. I was under the
> > >impression that terms separated by underscore are also tokenized by
> > >StandardTokenizerFactory, but turns out that's not the case. Eg:
> > >'hello-world' would be tokenized into 'hello' and 'world', but
> > >'hello_world' is treated as a single token.
> > >Is this a bug or a designed behavior?
> > >
> > >If this is by design, it would be helpful if this behavior is included
> in
> > >the documentation since it is similar to the behavior with periods.
> > >
> > >
> >
> https://lucene.apache.org/solr/guide/6_6/tokenizers.html#Tokenizers-StandardTokenizer
> > >"Periods (dots) that are not followed by whitespace are kept as part of
> > the
> > >token, including Internet domain names. "
> > >
> > >Thanks,
> > >Rahul
> >
>


-- 
Adam Walz


Re: Need some help on solr versions (LTS vs stable)

2019-11-13 Thread Adam Walz
The LTS idea I believe comes from the solr downloads page where 7.7.x is
designated as LTS. https://lucene.apache.org/solr/downloads.html

On Wed, Nov 13, 2019 at 9:41 AM Shawn Heisey  wrote:

> On 11/6/2019 9:58 AM, suyog joshi wrote:
> > So we can say its better to go with latest stable version (8.x) instead
> of
> > 7.x, which is LTS right now, but can soon become EOL post launching of
> 9.x
> > sometime early next year.
>
> I don't know where you got the idea that 7.x is LTS ... but I do not
> think that is correct.  I don't think we have a version that could be
> called LTS, at least not the way I have seen the term used.
>
> It's true that 7.x currently is in a state where it is unlikely to have
> its feature list changed, which could be seen as stability.  But chances
> are that if you DO run into a bug with a 7.x version, the fix for that
> problem will probably only make it into the current stable branch, so
> you'd be upgrading to at least an 8.x version in order to obtain the fix.
>
> Changing to an LTS model would mean changes to the way development is
> done on the project.  Change is always scary.  I've asked on the dev
> list about this.
>
> Thanks,
> Shawn
>


-- 
Adam Walz


Re: Solr 7.0.1 Duplicate document appearing in search results

2019-05-14 Thread Adam Walz
Thanks Erick,

We've never merged indexes. We don't use the MapReduceIndexerTool, but do
use an external map reduce process to reindex. To reindex from an empty
state we have a map reduce job which runs on a separate HBase cluster and
indexes into this shard. During this job each mapper is concurrently making
http update requests to the shard, but only 1 mapper should post a document
per unique "id".

Reindexing from scratch is done roughly every 3 months. In between that
time we have a worker external to solr which reads from an event stream and
posts http updates to the solr cluster.

The  has never but updated to my knowledge, but if it has it
definitely wasn't updated in the last 3 months since the last reindexing.

Also since the last reindexing nothing in the solrconfig.xml or
managed-schema has been updated, nor has the index been manipulated outside
of the solr framework.

On Tue, May 14, 2019 at 5:24 PM Erick Erickson 
wrote:

> This is indeed strange. First of all, forget about explanations that
> involve the transaction log etc. When Lucene opens a searcher, it is only
> for closed segments, the tlog has nothing to do with that.
>
> Have you ever merget indexes? The MapReduceIndexerTool, if you ever used
> it, does not de-duplicate. Ditto if you ever changed the . The
> fact that you say that this clears up when you re-index the document leads
> me to wonder whether you have manipulated the index outside the normal Solr
> framework.
>
> IOW, I’ve never seen this before, so I suspect there’s something you did
> in your setup that seemed innocent at the time that lead to this
> (temporary) situation.
>
> Best,
> Erick
>
> > On May 14, 2019, at 5:43 PM, Adam Walz  wrote:
> >
> > In my solr schema I have set a uniqueKey of "id" where the id field is a
> > solr.StrField. When querying with this field as a filter I would expect
> to
> > always get 1 or 0 documents as a result. However I am getting back
> multiple
> > documents with the same "id" field, but different internal `docid`s. This
> > problem is intermittent and seems to resolve itself when the document is
> > updated. This is happening on solr 7.0.1 without SolrCloud and while only
> > querying a single shard without routing.
> >
> > Any thoughts on what could be causing this behavior? This is a very large
> > single shard with 300 million documents and an index size of 750GB. I
> know
> > that is not recommended for a single shard, but could it explain these
> > duplicate results possibly because of the time it takes to commit, merge,
> > or something with tlogs?
> >
> > -- Query --
> > http://solr:8983/solr/filesearch/select?fl=id,[docid],score&fq=id:file_
> > <
> http://solr1128.ve.box.net:8985/solr/filesearch/select?fl=id,[docid],score&fq=id:file_413041895994&q=*:*
> >
> > *382506116*&q=*:*
> > <
> http://solr1128.ve.box.net:8985/solr/filesearch/select?fl=id,[docid],score&fq=id:file_413041895994&q=*:*
> >
> > -- Response --
> >
> > {
> >  "responseHeader":{
> >"status":0,
> >"QTime":0,
> >"params":{
> >  "mm":" 1<-0% ",
> >  "q.alt":"*:*",
> >  "ps":"100",
> >  "echoParams":"all",
> >  "fl":"id,[docid],score",
> >  "fq":"id:file_413041895994",
> >  "sort":"score desc",
> >  "rows":"35",
> >  "version":"2.2",
> >  "q":"*:*",
> >  "tie":"0.01",
> >  "defType":"edismax",
> >  "qf":"id name_combined^10 name_zh-cn^10 name_shingle
> > name_shingle_zh-cn name_token^60 description file_content_en
> > file_content_fr file_content_de file_content_it file_content_es
> > file_content_zh-cn user_name user_email comments tags",
> >  "pf":"description name_shingle^100 name_shingle_zh-cn^100 comments
> tags",
> >  "wt":"json",
> >  "debugQuery":"off"}},
> >  "response":{"numFound":2,"start":0,"maxScore":1.0,"docs":[
> >  {
> >"id":"file_382506116",
> >
> >"[docid]":346266675,
> >"score":1.0}]
> >  },{
> >
> >"id":"file_382506116",
> >"[docid]":170442733,
> >"score":1.0}]
> >
> >  }}
> >
> >
> > -- Schema snippet --
> > 
> >   > required="true"/>
> > 
> > id
> >
> > --
> > Adam Walz
>


Solr 7.0.1 Duplicate document appearing in search results

2019-05-14 Thread Adam Walz
In my solr schema I have set a uniqueKey of "id" where the id field is a
solr.StrField. When querying with this field as a filter I would expect to
always get 1 or 0 documents as a result. However I am getting back multiple
documents with the same "id" field, but different internal `docid`s. This
problem is intermittent and seems to resolve itself when the document is
updated. This is happening on solr 7.0.1 without SolrCloud and while only
querying a single shard without routing.

Any thoughts on what could be causing this behavior? This is a very large
single shard with 300 million documents and an index size of 750GB. I know
that is not recommended for a single shard, but could it explain these
duplicate results possibly because of the time it takes to commit, merge,
or something with tlogs?

-- Query --
http://solr:8983/solr/filesearch/select?fl=id,[docid],score&fq=id:file_
<http://solr1128.ve.box.net:8985/solr/filesearch/select?fl=id,[docid],score&fq=id:file_413041895994&q=*:*>
*382506116*&q=*:*
<http://solr1128.ve.box.net:8985/solr/filesearch/select?fl=id,[docid],score&fq=id:file_413041895994&q=*:*>
-- Response --

{
  "responseHeader":{
"status":0,
"QTime":0,
"params":{
  "mm":" 1<-0% ",
  "q.alt":"*:*",
  "ps":"100",
  "echoParams":"all",
  "fl":"id,[docid],score",
  "fq":"id:file_413041895994",
  "sort":"score desc",
  "rows":"35",
  "version":"2.2",
  "q":"*:*",
  "tie":"0.01",
  "defType":"edismax",
  "qf":"id name_combined^10 name_zh-cn^10 name_shingle
name_shingle_zh-cn name_token^60 description file_content_en
file_content_fr file_content_de file_content_it file_content_es
file_content_zh-cn user_name user_email comments tags",
  "pf":"description name_shingle^100 name_shingle_zh-cn^100 comments tags",
  "wt":"json",
  "debugQuery":"off"}},
  "response":{"numFound":2,"start":0,"maxScore":1.0,"docs":[
  {
"id":"file_382506116",

"[docid]":346266675,
"score":1.0}]
  },{

"id":"file_382506116",
"[docid]":170442733,
"score":1.0}]

  }}


-- Schema snippet --

  

 id

-- 
Adam Walz