Solr Suggest Component with weight expression returns no suggestions

2018-08-03 Thread Buckler, Christine
I am having difficulty getting Solr's Suggest Component to work with a weight 
expression. I have tried to match the format of the example in the 
documentation (see related code from schema.xml and solrconfig.xml below) but 
no results are found when I request suggestions from this dictionary and there 
are no logging errors or exceptions. Am I missing something or do I have an 
incorrect parameter?
schema:


solrconfig:


  SuggesterX
  DocumentExpressionDictionaryFactory
  FuzzyLookupFactory
  product_name
  (weight_one + weight_two)
  weight_one
  weight_two
  text_suggest



  
true
5
SuggesterX

  
  
suggest
  



Christine



FreeTextSuggester exception: need at least one suggestion

2018-08-03 Thread Buckler, Christine
Hi fellow Solr Suggesters,

I am getting an exception when building the suggester index for: 
FreeTextSuggester…

java.lang.IllegalArgumentException: need at least one suggestion
at 
org.apache.lucene.search.suggest.analyzing.FreeTextSuggester.build(FreeTextSuggester.java:300)
at 
org.apache.lucene.search.suggest.analyzing.FreeTextSuggester.build(FreeTextSuggester.java:247)
at org.apache.lucene.search.suggest.Lookup.build(Lookup.java:190)
at org.apache.solr.spelling.suggest.SolrSuggester.build(SolrSuggester.java:181)
at 
org.apache.solr.handler.component.SuggestComponent$SuggesterListener.buildSuggesterIndex(SuggestComponent.java:529)
at 
org.apache.solr.handler.component.SuggestComponent$SuggesterListener.newSearcher(SuggestComponent.java:511)
at org.apache.solr.core.SolrCore.lambda$getSearcher$17(SolrCore.java:2275)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:188)
at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167)
at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641)
at java.base/java.lang.Thread.run(Thread.java:844)

What does it mean by at least one suggestion? I have ngrams=2, does this mean 
that it requires document fields to be at least 2 word/tokens? I have tried 
testing with only >2 and still get an exception.

Christine



Solr timeAllowed metric

2018-08-03 Thread Wei
Hi,

We tried to use solr's timeAllowed parameter to restrict the time spend on
expensive queries.  But as described at

https://lucene.apache.org/solr/guide/6_6/common-query-parameters.html#CommonQueryParameters-ThetimeAllowedParameter

" This value is only checked at the time of Query Expansion and Document
collection" .  Does that mean Solr will not abort the request if
timeAllowed is exceeded during the scoring process? What are the components
(query, facet,  stats, debug etc) this metric is effectively used?

Thanks,
Wei


Re: Master recovery in ReplicationHandler

2018-08-03 Thread Chuong Thao
Hi Shawn, thank you for replying I'm following this 
https://lucene.apache.org/solr/guide/7_3/making-and-restoring-backups.html#backup-restore-storage-repositories
 
(https://link.getmailspring.com/link/1533322369.local-73f54f42-640a-v1.3.0-fd741...@getmailspring.com/0?redirect=https%3A%2F%2Flucene.apache.org%2Fsolr%2Fguide%2F7_3%2Fmaking-and-restoring-backups.html%23backup-restore-storage-repositories=c29sci11c2VyQGx1Y2VuZS5hcGFjaGUub3Jn).
 I am planning to keep 5 copies in a local directory, how do I specify that 
directory in, perhaps, SolrConfig.xml?

Charles
On Aug 2 2018, at 1:56 pm, Shawn Heisey  wrote:
>
> On 8/1/2018 10:05 AM, Chuong Thao wrote:
> > I am looking to deploy Solr 7.3 in containers with replication handler. Is 
> > there a way to recover the docs on master from the slave if the master is 
> > suddenly killed?
>
>
> Replication in a master-slave setup only goes from master to slave. It
> cannot go from slave to master.
>
> "Recovery" is a SolrCloud concept. SolrCloud does use the replication
> handler to accomplish recovery, but in a SolrCloud setup, the
> replication handler has no explicit configuration. When a recovery is
> required, SolrCloud configures the replication handler on the fly and
> initiates a one-time replication. Historically, SolrCloud did not use
> the replication handler for normal index synchronization. In 7.x
> versions, new replica types exist that DO use the replication handler
> ... but it's configured on the fly in the same way that index recovery is.
>
> To do what you want to do, I see two options:
> 1) Copy index directories from the slave to the master before you start
> the master.
> 2) Reconfigure your systems so that the slave becomes the master and the
> master becomes a slave, then restart the processes.
>
> Either way, it's a manual process. This is not likely to change. If
> you want to have more automation, switch to SolrCloud. Because
> SolrCloud sets up a true cluster, there are no masters and no slaves. I
> would recommend SolrCloud for most new installations, especially one
> where servers might be added or removed frequently.
>
> Thanks,
> Shawn
>



Re: Problem with fuzzy search and accentuation

2018-08-03 Thread Erick Erickson
Stemming is getting in the way here. You could probably use copyField
to a field that doesn't stem and fuzzy search against that field
rather than the stemmed one.

Best,
Erick

On Fri, Aug 3, 2018 at 11:31 AM, Monique Monteiro
 wrote:
> By adding debug=true, I get the following:
>
>
>- administração (correct result):
>
> "debug":{
> "rawquerystring":"administração",
> "querystring":"administração",
> "parsedquery":"text:administr",
> "parsedquery_toString":"text:administr",
> "QParser":"LuceneQParser"}}
>
>
>- administração~ (incorrect behaviour, no results):
>
> "debug":{
> "rawquerystring":"administração~",
> "querystring":"administração~",
> "parsedquery":"text:administração~2",
> "parsedquery_toString":"text:administração~2",
> "QParser":"LuceneQParser"}}
>
>
>- tribunal (correct result):
>
> "debug":{
> "rawquerystring":"tribunal",
> "querystring":"tribunal",
> "parsedquery":"text:tribunal",
> "parsedquery_toString":"text:tribunal",
> "QParser":"LuceneQParser"}}
>
>
>- tribubal (correct result, no accents):
>
>  "debug":{
> "rawquerystring":"tribubal~",
> "querystring":"tribubal~",
> "parsedquery":"text:tribubal~2",
> "parsedquery_toString":"text:tribubal~2",
> "QParser":"LuceneQParser"}}
>
> On Fri, Aug 3, 2018 at 3:26 PM Erick Erickson 
> wrote:
>
>> What does adding =query show you the parsed query is in the two
>> cases?
>>
>> My guess is that accent folding is kicking in one case but not the
>> other, but that's
>> a blind guess.
>>
>>
>>
>> On Fri, Aug 3, 2018 at 11:19 AM, Monique Monteiro
>>  wrote:
>> > Hi all,
>> >
>> > I'm having a problem when I search for a word with some non-ASCII
>> > characters in combination with fuzzy search.
>> >
>> > For example, if I type 'administração' or 'contratação' (both words end
>> > with 'ção'), the search results are returned correctly.  However, if I
>> type
>> > 'administração~', no result is returned.  For other terms, I haven't
>> found
>> > any problem.
>> >
>> > My Solr version is  6.6.3.
>> >
>> > Has anyone any idea about what may cause this issue?
>> >
>> > Thanks in advance.
>> >
>> > --
>> > Monique Monteiro
>> > Twitter: http://twitter.com/monilouise
>>
>
>
> --
> Monique Monteiro
> Twitter: http://twitter.com/monilouise


Re: Problem with fuzzy search and accentuation

2018-08-03 Thread Monique Monteiro
By adding debug=true, I get the following:


   - administração (correct result):

"debug":{
"rawquerystring":"administração",
"querystring":"administração",
"parsedquery":"text:administr",
"parsedquery_toString":"text:administr",
"QParser":"LuceneQParser"}}


   - administração~ (incorrect behaviour, no results):

"debug":{
"rawquerystring":"administração~",
"querystring":"administração~",
"parsedquery":"text:administração~2",
"parsedquery_toString":"text:administração~2",
"QParser":"LuceneQParser"}}


   - tribunal (correct result):

"debug":{
"rawquerystring":"tribunal",
"querystring":"tribunal",
"parsedquery":"text:tribunal",
"parsedquery_toString":"text:tribunal",
"QParser":"LuceneQParser"}}


   - tribubal (correct result, no accents):

 "debug":{
"rawquerystring":"tribubal~",
"querystring":"tribubal~",
"parsedquery":"text:tribubal~2",
"parsedquery_toString":"text:tribubal~2",
"QParser":"LuceneQParser"}}

On Fri, Aug 3, 2018 at 3:26 PM Erick Erickson 
wrote:

> What does adding =query show you the parsed query is in the two
> cases?
>
> My guess is that accent folding is kicking in one case but not the
> other, but that's
> a blind guess.
>
>
>
> On Fri, Aug 3, 2018 at 11:19 AM, Monique Monteiro
>  wrote:
> > Hi all,
> >
> > I'm having a problem when I search for a word with some non-ASCII
> > characters in combination with fuzzy search.
> >
> > For example, if I type 'administração' or 'contratação' (both words end
> > with 'ção'), the search results are returned correctly.  However, if I
> type
> > 'administração~', no result is returned.  For other terms, I haven't
> found
> > any problem.
> >
> > My Solr version is  6.6.3.
> >
> > Has anyone any idea about what may cause this issue?
> >
> > Thanks in advance.
> >
> > --
> > Monique Monteiro
> > Twitter: http://twitter.com/monilouise
>


-- 
Monique Monteiro
Twitter: http://twitter.com/monilouise


Re: Problem with fuzzy search and accentuation

2018-08-03 Thread Erick Erickson
What does adding =query show you the parsed query is in the two cases?

My guess is that accent folding is kicking in one case but not the
other, but that's
a blind guess.



On Fri, Aug 3, 2018 at 11:19 AM, Monique Monteiro
 wrote:
> Hi all,
>
> I'm having a problem when I search for a word with some non-ASCII
> characters in combination with fuzzy search.
>
> For example, if I type 'administração' or 'contratação' (both words end
> with 'ção'), the search results are returned correctly.  However, if I type
> 'administração~', no result is returned.  For other terms, I haven't found
> any problem.
>
> My Solr version is  6.6.3.
>
> Has anyone any idea about what may cause this issue?
>
> Thanks in advance.
>
> --
> Monique Monteiro
> Twitter: http://twitter.com/monilouise


Problem with fuzzy search and accentuation

2018-08-03 Thread Monique Monteiro
Hi all,

I'm having a problem when I search for a word with some non-ASCII
characters in combination with fuzzy search.

For example, if I type 'administração' or 'contratação' (both words end
with 'ção'), the search results are returned correctly.  However, if I type
'administração~', no result is returned.  For other terms, I haven't found
any problem.

My Solr version is  6.6.3.

Has anyone any idea about what may cause this issue?

Thanks in advance.

-- 
Monique Monteiro
Twitter: http://twitter.com/monilouise


Re: SolrCloud CDCR issue

2018-08-03 Thread cdatta
Any pointers?



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Schema Change for Solr 7.4

2018-08-03 Thread Shawn Heisey
On 8/3/2018 9:09 AM, Joe Lerner wrote:
> We recently set up Solr 7.4 in Production. There are 2 Solr nodes, with 3
> zookeepers. We need to make a schema change. What I want to do is simply
> push the updated schema to Solr, and then re-index all the content to pick
> up the change. But I am being told that I need to:
>
> 1.Delete the collection that depends on this config-set.
> 2.Reload the config-set
> 3.Recreate the dependent collection
>
> It seems to me that between steps #1 and #3, users will not be able to
> search, which is not cool.

Here's a procedure that should work for most situations:

1. Upload a new configset to ZooKeeper.
2. Create a new collection using the new configset.
3. Index data into the new collection.
4. Set up an alias with the original collection name, pointing at the
new collection.
5. When you're sure it's good, delete the old collection.

Step 4 redirects requests to the original collection name so they end up
on the collection.



Whether you need to delete the data before reindexing into the same
collection depends on the precise nature of the change to your schema. 
Some schema changes require not only deleting all data in the
collection, but actually deleting the entire index directory in every
shard replica to remove all traces of the old data.  Can you give
precise details about what change you are planning to the schema?

If you can be absolutely sure that there are no commits happening with
openSearcher set to true and your schema change is safe for the existing
index, then you can use the following procedure.  Note that if anything
goes wrong or the wrong kind of commit occurs during this, your users
will be searching incomplete data:

1. Change the schema.
2. Reload the collection.
3. Delete all documents.
4. Index your data.
5. Issue a commit to make the changes visible.

Thanks,
Shawn



Re: Schema Change for Solr 7.4

2018-08-03 Thread Walter Underwood
For an in-place migration:

1. Add new fields to the schema.
2. Reindex to populate those fields.
3. Change queries to use those fields and stop using old fields.
4. Stop sending data to old fields, reindex.
5. Remove old fields from the schema.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Aug 3, 2018, at 8:48 AM, Christopher Schultz 
>  wrote:
> 
> Joe,
> 
> On 8/3/18 11:44 AM, Joe Lerner wrote:
>> OK--yes, I can see how that would work. But it would require some quick
>> infrastructure flexibility that, at least to this point, we don't really
>> have.
> 
> The only thing that needs swapping is the URL that your application uses
> to connect to Solr, so you don't need anything terribly complicated to
> proxy it.
> 
> Something like Squid would work, and you'd only have a few seconds of
> downtime to set it up initially, and then another few seconds to swap later.
> 
> Heck, you can even remove the proxy after you are all done. It doesn't
> have to be a permanent fixture in your infrastructure.
> 
> -chris
> 



Solr Relevance Engineer Training, Sept 25 & 26

2018-08-03 Thread Doug Turnbull
Hey everyone,

Many may know me, in the words of Will Hayes, as "Mr. Relevance". I'm the
author of the book Relevant Search, and prolific blogge
r about all things Solr relevance.

I want to share I'll be running a Solr 'Think Like a Relevance Engineer'
course Sept 25 and 26
.
It's a great way to learn to build intelligent, relevant Solr search
experiences with topics ranging from measuring search quality, to
taxonomy-based semantic search, to introducing learning to rank.

It's also a chance to just hang out with our team (and maybe if we can
convince some Lucidworks folks ;) ) in Charlottesville in beautiful autumn
near the Blue Ridge mountains.

We had a succesful training course mixing open source Solr and
Elasticsearch in early July. Based on feedback, we decided to focus the
content of our course to go deep on one search engine. Some pics here :)
https://twitter.com/softwaredoug/status/1017438551464701953

(if you'd like this course in Elasticsearch, please let me know and we may
schedule it with enough interest)

Hope to see you there
-Doug
-- 
CTO, OpenSource Connections
Author, Relevant Search
http://o19s.com/doug


Re: Schema Change for Solr 7.4

2018-08-03 Thread Christopher Schultz
Joe,

On 8/3/18 11:44 AM, Joe Lerner wrote:
> OK--yes, I can see how that would work. But it would require some quick
> infrastructure flexibility that, at least to this point, we don't really
> have.

The only thing that needs swapping is the URL that your application uses
to connect to Solr, so you don't need anything terribly complicated to
proxy it.

Something like Squid would work, and you'd only have a few seconds of
downtime to set it up initially, and then another few seconds to swap later.

Heck, you can even remove the proxy after you are all done. It doesn't
have to be a permanent fixture in your infrastructure.

-chris



signature.asc
Description: OpenPGP digital signature


Re: Schema Change for Solr 7.4

2018-08-03 Thread Joe Lerner
OK--yes, I can see how that would work. But it would require some quick
infrastructure flexibility that, at least to this point, we don't really
have.

Joe



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Schema Change for Solr 7.4

2018-08-03 Thread Christopher Schultz
Joe,

On 8/3/18 11:09 AM, Joe Lerner wrote:
> We recently set up Solr 7.4 in Production. There are 2 Solr nodes, with 3
> zookeepers. We need to make a schema change. What I want to do is simply
> push the updated schema to Solr, and then re-index all the content to pick
> up the change. But I am being told that I need to:
> 
> 1.Delete the collection that depends on this config-set.
> 2.Reload the config-set
> 3.Recreate the dependent collection
> 
> It seems to me that between steps #1 and #3, users will not be able to
> search, which is not cool.
> 
> Can I avoid the outage to my search capabilitty?

I dunno about how to do any online-updates like this, but you could
always instead:

0. place a proxy between your application and Solr
1. stand-up a new service
2. load the config-set
3. create the collection
4. load all the data from source
5. swap the service at the proxy to the newly-created service

-chris



signature.asc
Description: OpenPGP digital signature


Schema Change for Solr 7.4

2018-08-03 Thread Joe Lerner
We recently set up Solr 7.4 in Production. There are 2 Solr nodes, with 3
zookeepers. We need to make a schema change. What I want to do is simply
push the updated schema to Solr, and then re-index all the content to pick
up the change. But I am being told that I need to:

1.  Delete the collection that depends on this config-set.
2.  Reload the config-set
3.  Recreate the dependent collection

It seems to me that between steps #1 and #3, users will not be able to
search, which is not cool.

Can I avoid the outage to my search capabilitty?

Thanks!

Joe



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


numdocs is different on replicas in a shard

2018-08-03 Thread Webster Homer
This morning I was told that there was something screwy with one of our
collections.
This collection has 2 shards and 2 replicas per shard. Each replica has a
different value for numDocs!
Datacenter #1
shard1_replica11513053
shard1_replica21512653
shard2_replica11512296
shard2_replica21512487

We have 2 copies of this collection that are populated via cdcr from a
common collection. Both copies show the same thing. They run in different
datacenters in the cloud
Datacenter #2
shard1_replica11513054
shard1_replica21512903
shard2_replica11512452
shard2_replica21512487

We are running Solr 7.2.0 in SolrCloud mode.
This collection is populated by CDCR, auto commits are enabled.

I don't see any errors in the logs. I manually sent a commit to the
collection, the above numbers are after the commit.

The source collection has only 2 replicas per shard
shard1_replica1   1513054
shard2_replica1   1512487


What could cause this? How can I address it? How do we prevent it from
happening again?

-- 


This message and any attachment are confidential and may be
privileged or 
otherwise protected from disclosure. If you are not the intended
recipient, 
you must not copy this message or attachment or disclose the
contents to 
any other person. If you have received this transmission in error,
please 
notify the sender immediately and delete the message and any attachment

from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do
not accept liability for any omissions or errors in this 
message which may
arise as a result of E-Mail-transmission or for damages 
resulting from any
unauthorized changes of the content of this message and 
any attachment thereto.
Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee
that this message is free of viruses and does 
not accept liability for any
damages caused by any virus transmitted 
therewith.



Click http://www.emdgroup.com/disclaimer 
 to access the
German, French, Spanish 
and Portuguese versions of this disclaimer.


AW: indexing two words, searching single word

2018-08-03 Thread Clemens Wyss DEV
+1 ;)

-Ursprüngliche Nachricht-
Von: Susheel Kumar  
Gesendet: Freitag, 3. August 2018 14:40
An: solr-user@lucene.apache.org
Betreff: Re: indexing two words, searching single word

and as you suggested, use stop word before shingles...

On Fri, Aug 3, 2018 at 8:10 AM, Clemens Wyss DEV 
wrote:

> 
>   
>   
>outputUnigrams="true" tokenSeparator=""/>  
> 
>
> seems to "work"
>
> -Ursprüngliche Nachricht-
> Von: Clemens Wyss DEV 
> Gesendet: Freitag, 3. August 2018 13:46
> An: solr-user@lucene.apache.org
> Betreff: AW: indexing two words, searching single word
>
> >Because you probably are not looking for "andthe" kind of tokens
> (unfortunately) I guess I am, as we don't know what people enter...
>
> > a shingle plus regex to remove whitespace
> sounds interesting. How would that filter-chain look like? That would 
> be an type="index"-analyzer?
> I guess we could shingle after stop-word-filtering and I quess 
> maxShingleSize="2" would suffice
>
> -Ursprüngliche Nachricht-
> Von: Alexandre Rafalovitch 
> Gesendet: Freitag, 3. August 2018 13:33
> An: solr-user 
> Betreff: Re: indexing two words, searching single word
>
> But what is your generic problem then. Because you probably are not 
> looking for "andthe" kind of tokens.
>
> However a shingle plus regex to remove whitespace can give you "anytwo 
> wordstogether smooshed" tokens in the index.
>
> Regards,
>  Alex
>
>
> On Fri, Aug 3, 2018, 7:19 AM Clemens Wyss DEV, 
> wrote:
>
> > Hi Markus,
> > thanks for the quick answer.
> >
> > "sound stage" was just an example. We are looking for a generic 
> > solution ...
> >
> > Is it "ok" to apply an NGRamFilter for query-analyzing?
> > 
> > 
> > 
> >  > maxGramSize="15" />
> > 
> >
> > I guess (besides the performance impact) this reduces search results 
> > accuracy?
> >
> > -Clemens
> >
> > -Ursprüngliche Nachricht-
> > Von: Markus Jelsma 
> > Gesendet: Freitag, 3. August 2018 12:43
> > An: solr-user@lucene.apache.org
> > Betreff: RE: indexing two words, searching single word
> >
> > Hello,
> >
> > If your case is English you could use synonyms to work around the 
> > problem of the few compound words of the language. However, would 
> > you be dealing with a Germanic compound language, the 
> > HyphenationCompoundWordTokenFilter
> > [1] or DictionaryCompoundWordTokenFilter are a better choice. The 
> > former is much more flexible but has its drawbacks.
> >
> > Regards,
> > Markus
> >
> >
> > https://lucene.apache.org/core/7_4_0/analyzers-common/org/apache/luc
> > en 
> > e/analysis/compound/HyphenationCompoundWordTokenFilterFactory.html
> >
> >
> >
> > -Original message-
> > > From:Clemens Wyss DEV 
> > > Sent: Friday 3rd August 2018 12:22
> > > To: solr-user@lucene.apache.org
> > > Subject: indexing two words, searching single word
> > >
> > > Sounds like a rather simple issue:
> > > if I index "sound stage" and search for "soundstage" I get no hits
> > >
> > > What am I doing wrong
> > > a) when indexing
> > > b) when searching
> > > ?
> > >
> > > Thx in advance
> > > - Clemens
> > >
> >
>


Re: Support multiple language tokens in same field

2018-08-03 Thread Shawn Heisey

On 8/3/2018 1:10 AM, Nitesh Kumar wrote:

As I discussed above,  in some special case, we have a situation where
these fields ( field1, field2  etc..) value can be in *CJK *pattern. That
means  field1, field2 store plain *English *text or *CJK *text. Hence, in
case of choosing *StandardTokenizer, *while indexing/query it works fine
when it has to deal with plain *English text*, whereas in the case of *CJK
text *it doesn't work appropriately.


We have one index where fields can contain both English and CJK.  The 
customer is in Japan.  I designed it to work properly with all CJK 
characters, not just Japanese.


This is the fieldType I came up with after a LOT of research.  Most of 
the information that was useful came from a series of blog posts:


https://apaste.info/Vfwf

I used a paste website because line wrapping within an email would have 
made it difficult to copy.  The paste expires in one month.


This analysis chain uses the ICU classes that are included as a contrib 
module with Solr, as well as one custom jar:


https://github.com/sul-dlss/CJKFoldingFilter/blob/master/src/edu/stanford/lucene/analysis/CJKFoldingFilterFactory.java

The blog posts I used to create my schema can be found here:

http://discovery-grindstone.blogspot.com/2014/

Some people might find the ICUFoldingFilterFactory too aggressive.  If 
so, replace it with ASCIIFoldingFilterFactory and 
ICUNormalizer2FilterFactory.  This is what we're actually using -- the 
customer didn't want the kinds of matches that the ICU class allowed.


Using edismax with an unusual value for the "mm" parameter might solve 
some of your other issues.  This is discussed in parts 8 and 12 of the 
blog series.


I have one note for you about your analysis chain.  I notice you have a 
filter listed before the tokenizer.  Solr will always apply the 
tokenizer first -- the ASCIIFoldingFilterFactory that you have listed 
first is in fact being run second.  Solr will always run CharFilter 
entries first, then the tokenizer, then Filter entries.


Thanks,
Shawn



Re: indexing two words, searching single word

2018-08-03 Thread Susheel Kumar
and as you suggested, use stop word before shingles...

On Fri, Aug 3, 2018 at 8:10 AM, Clemens Wyss DEV 
wrote:

> 
>   
>   
>outputUnigrams="true" tokenSeparator=""/> 
> 
>
> seems to "work"
>
> -Ursprüngliche Nachricht-
> Von: Clemens Wyss DEV 
> Gesendet: Freitag, 3. August 2018 13:46
> An: solr-user@lucene.apache.org
> Betreff: AW: indexing two words, searching single word
>
> >Because you probably are not looking for "andthe" kind of tokens
> (unfortunately) I guess I am, as we don't know what people enter...
>
> > a shingle plus regex to remove whitespace
> sounds interesting. How would that filter-chain look like? That would be
> an type="index"-analyzer?
> I guess we could shingle after stop-word-filtering and I quess
> maxShingleSize="2" would suffice
>
> -Ursprüngliche Nachricht-
> Von: Alexandre Rafalovitch 
> Gesendet: Freitag, 3. August 2018 13:33
> An: solr-user 
> Betreff: Re: indexing two words, searching single word
>
> But what is your generic problem then. Because you probably are not
> looking for "andthe" kind of tokens.
>
> However a shingle plus regex to remove whitespace can give you "anytwo
> wordstogether smooshed" tokens in the index.
>
> Regards,
>  Alex
>
>
> On Fri, Aug 3, 2018, 7:19 AM Clemens Wyss DEV, 
> wrote:
>
> > Hi Markus,
> > thanks for the quick answer.
> >
> > "sound stage" was just an example. We are looking for a generic
> > solution ...
> >
> > Is it "ok" to apply an NGRamFilter for query-analyzing?
> > 
> > 
> > 
> >  > maxGramSize="15" />
> > 
> >
> > I guess (besides the performance impact) this reduces search results
> > accuracy?
> >
> > -Clemens
> >
> > -Ursprüngliche Nachricht-
> > Von: Markus Jelsma 
> > Gesendet: Freitag, 3. August 2018 12:43
> > An: solr-user@lucene.apache.org
> > Betreff: RE: indexing two words, searching single word
> >
> > Hello,
> >
> > If your case is English you could use synonyms to work around the
> > problem of the few compound words of the language. However, would you
> > be dealing with a Germanic compound language, the
> > HyphenationCompoundWordTokenFilter
> > [1] or DictionaryCompoundWordTokenFilter are a better choice. The
> > former is much more flexible but has its drawbacks.
> >
> > Regards,
> > Markus
> >
> >
> > https://lucene.apache.org/core/7_4_0/analyzers-common/org/apache/lucen
> > e/analysis/compound/HyphenationCompoundWordTokenFilterFactory.html
> >
> >
> >
> > -Original message-
> > > From:Clemens Wyss DEV 
> > > Sent: Friday 3rd August 2018 12:22
> > > To: solr-user@lucene.apache.org
> > > Subject: indexing two words, searching single word
> > >
> > > Sounds like a rather simple issue:
> > > if I index "sound stage" and search for "soundstage" I get no hits
> > >
> > > What am I doing wrong
> > > a) when indexing
> > > b) when searching
> > > ?
> > >
> > > Thx in advance
> > > - Clemens
> > >
> >
>


AW: indexing two words, searching single word

2018-08-03 Thread Clemens Wyss DEV

  
  
   


seems to "work"

-Ursprüngliche Nachricht-
Von: Clemens Wyss DEV  
Gesendet: Freitag, 3. August 2018 13:46
An: solr-user@lucene.apache.org
Betreff: AW: indexing two words, searching single word

>Because you probably are not looking for "andthe" kind of tokens
(unfortunately) I guess I am, as we don't know what people enter...

> a shingle plus regex to remove whitespace
sounds interesting. How would that filter-chain look like? That would be an 
type="index"-analyzer?
I guess we could shingle after stop-word-filtering and I quess 
maxShingleSize="2" would suffice

-Ursprüngliche Nachricht-
Von: Alexandre Rafalovitch 
Gesendet: Freitag, 3. August 2018 13:33
An: solr-user 
Betreff: Re: indexing two words, searching single word

But what is your generic problem then. Because you probably are not looking for 
"andthe" kind of tokens.

However a shingle plus regex to remove whitespace can give you "anytwo 
wordstogether smooshed" tokens in the index.

Regards,
 Alex


On Fri, Aug 3, 2018, 7:19 AM Clemens Wyss DEV,  wrote:

> Hi Markus,
> thanks for the quick answer.
>
> "sound stage" was just an example. We are looking for a generic 
> solution ...
>
> Is it "ok" to apply an NGRamFilter for query-analyzing?
> 
> 
> 
>  maxGramSize="15" />
> 
>
> I guess (besides the performance impact) this reduces search results 
> accuracy?
>
> -Clemens
>
> -Ursprüngliche Nachricht-
> Von: Markus Jelsma 
> Gesendet: Freitag, 3. August 2018 12:43
> An: solr-user@lucene.apache.org
> Betreff: RE: indexing two words, searching single word
>
> Hello,
>
> If your case is English you could use synonyms to work around the 
> problem of the few compound words of the language. However, would you 
> be dealing with a Germanic compound language, the 
> HyphenationCompoundWordTokenFilter
> [1] or DictionaryCompoundWordTokenFilter are a better choice. The 
> former is much more flexible but has its drawbacks.
>
> Regards,
> Markus
>
>
> https://lucene.apache.org/core/7_4_0/analyzers-common/org/apache/lucen
> e/analysis/compound/HyphenationCompoundWordTokenFilterFactory.html
>
>
>
> -Original message-
> > From:Clemens Wyss DEV 
> > Sent: Friday 3rd August 2018 12:22
> > To: solr-user@lucene.apache.org
> > Subject: indexing two words, searching single word
> >
> > Sounds like a rather simple issue:
> > if I index "sound stage" and search for "soundstage" I get no hits
> >
> > What am I doing wrong
> > a) when indexing
> > b) when searching
> > ?
> >
> > Thx in advance
> > - Clemens
> >
>


AW: indexing two words, searching single word

2018-08-03 Thread Clemens Wyss DEV
>Because you probably are not looking for "andthe" kind of tokens
(unfortunately) I guess I am, as we don't know what people enter...

> a shingle plus regex to remove whitespace
sounds interesting. How would that filter-chain look like? That would be an 
type="index"-analyzer?
I guess we could shingle after stop-word-filtering and I quess 
maxShingleSize="2" would suffice

-Ursprüngliche Nachricht-
Von: Alexandre Rafalovitch  
Gesendet: Freitag, 3. August 2018 13:33
An: solr-user 
Betreff: Re: indexing two words, searching single word

But what is your generic problem then. Because you probably are not looking for 
"andthe" kind of tokens.

However a shingle plus regex to remove whitespace can give you "anytwo 
wordstogether smooshed" tokens in the index.

Regards,
 Alex


On Fri, Aug 3, 2018, 7:19 AM Clemens Wyss DEV,  wrote:

> Hi Markus,
> thanks for the quick answer.
>
> "sound stage" was just an example. We are looking for a generic 
> solution ...
>
> Is it "ok" to apply an NGRamFilter for query-analyzing?
> 
> 
> 
>  maxGramSize="15" />
> 
>
> I guess (besides the performance impact) this reduces search results 
> accuracy?
>
> -Clemens
>
> -Ursprüngliche Nachricht-
> Von: Markus Jelsma 
> Gesendet: Freitag, 3. August 2018 12:43
> An: solr-user@lucene.apache.org
> Betreff: RE: indexing two words, searching single word
>
> Hello,
>
> If your case is English you could use synonyms to work around the 
> problem of the few compound words of the language. However, would you 
> be dealing with a Germanic compound language, the 
> HyphenationCompoundWordTokenFilter
> [1] or DictionaryCompoundWordTokenFilter are a better choice. The 
> former is much more flexible but has its drawbacks.
>
> Regards,
> Markus
>
>
> https://lucene.apache.org/core/7_4_0/analyzers-common/org/apache/lucen
> e/analysis/compound/HyphenationCompoundWordTokenFilterFactory.html
>
>
>
> -Original message-
> > From:Clemens Wyss DEV 
> > Sent: Friday 3rd August 2018 12:22
> > To: solr-user@lucene.apache.org
> > Subject: indexing two words, searching single word
> >
> > Sounds like a rather simple issue:
> > if I index "sound stage" and search for "soundstage" I get no hits
> >
> > What am I doing wrong
> > a) when indexing
> > b) when searching
> > ?
> >
> > Thx in advance
> > - Clemens
> >
>


Re: indexing two words, searching single word

2018-08-03 Thread Alexandre Rafalovitch
But what is your generic problem then. Because you probably are not looking
for "andthe" kind of tokens.

However a shingle plus regex to remove whitespace can give you "anytwo
wordstogether smooshed" tokens in the index.

Regards,
 Alex


On Fri, Aug 3, 2018, 7:19 AM Clemens Wyss DEV,  wrote:

> Hi Markus,
> thanks for the quick answer.
>
> "sound stage" was just an example. We are looking for a generic solution
> ...
>
> Is it "ok" to apply an NGRamFilter for query-analyzing?
> 
> 
> 
>  maxGramSize="15" />
> 
>
> I guess (besides the performance impact) this reduces search results
> accuracy?
>
> -Clemens
>
> -Ursprüngliche Nachricht-
> Von: Markus Jelsma 
> Gesendet: Freitag, 3. August 2018 12:43
> An: solr-user@lucene.apache.org
> Betreff: RE: indexing two words, searching single word
>
> Hello,
>
> If your case is English you could use synonyms to work around the problem
> of the few compound words of the language. However, would you be dealing
> with a Germanic compound language, the HyphenationCompoundWordTokenFilter
> [1] or DictionaryCompoundWordTokenFilter are a better choice. The former is
> much more flexible but has its drawbacks.
>
> Regards,
> Markus
>
>
> https://lucene.apache.org/core/7_4_0/analyzers-common/org/apache/lucene/analysis/compound/HyphenationCompoundWordTokenFilterFactory.html
>
>
>
> -Original message-
> > From:Clemens Wyss DEV 
> > Sent: Friday 3rd August 2018 12:22
> > To: solr-user@lucene.apache.org
> > Subject: indexing two words, searching single word
> >
> > Sounds like a rather simple issue:
> > if I index "sound stage" and search for "soundstage" I get no hits
> >
> > What am I doing wrong
> > a) when indexing
> > b) when searching
> > ?
> >
> > Thx in advance
> > - Clemens
> >
>


AW: indexing two words, searching single word

2018-08-03 Thread Clemens Wyss DEV
Hi Markus,
thanks for the quick answer. 

"sound stage" was just an example. We are looking for a generic solution ...

Is it "ok" to apply an NGRamFilter for query-analyzing?






I guess (besides the performance impact) this reduces search results accuracy?

-Clemens

-Ursprüngliche Nachricht-
Von: Markus Jelsma  
Gesendet: Freitag, 3. August 2018 12:43
An: solr-user@lucene.apache.org
Betreff: RE: indexing two words, searching single word

Hello,

If your case is English you could use synonyms to work around the problem of 
the few compound words of the language. However, would you be dealing with a 
Germanic compound language, the HyphenationCompoundWordTokenFilter [1] or 
DictionaryCompoundWordTokenFilter are a better choice. The former is much more 
flexible but has its drawbacks.

Regards,
Markus

https://lucene.apache.org/core/7_4_0/analyzers-common/org/apache/lucene/analysis/compound/HyphenationCompoundWordTokenFilterFactory.html

 
 
-Original message-
> From:Clemens Wyss DEV 
> Sent: Friday 3rd August 2018 12:22
> To: solr-user@lucene.apache.org
> Subject: indexing two words, searching single word
> 
> Sounds like a rather simple issue:
> if I index "sound stage" and search for "soundstage" I get no hits
> 
> What am I doing wrong 
> a) when indexing
> b) when searching
> ?
> 
> Thx in advance
> - Clemens
> 


Support multiple language tokens in same field

2018-08-03 Thread Nitesh Kumar
Hi,

We work in the proxy business, for our customer, where to achieve certain
business need,  we change some field information and send it to the Solr.
Below are the configurations of Solr schema.xml for a particular field at
Solr side.


  




  
  





  



   
   

As I discussed above,  in some special case, we have a situation where
these fields ( field1, field2  etc..) value can be in *CJK *pattern. That
means  field1, field2 store plain *English *text or *CJK *text. Hence, in
case of choosing *StandardTokenizer, *while indexing/query it works fine
when it has to deal with plain *English text*, whereas in the case of *CJK
text *it doesn't work appropriately.

When we index a CJK text with the current configuration, it breaks each
character (Sometimes it pairs multiple characters also) and index it. As an
example
if we index a text

field1: "*맯뭕禪玸킆諘叜葸*", according to StandardTokenizer logic it breaks into
  맯  뭕 禪玸 킆諘 叜葸 and index in it.

Later, when we search on the same field with the similar text -
q : field1: "*맯뭕禪玸킆諘叜葸*", it gives the result along with it more irrelevant
result also. Our assumption is, when *Lucene *breaks and build multiple
tokens for querying, it performs *OR *operation with all tokens. Hence, Any
single token  from (맯 OR 뭕  OR禪玸  OR킆諘  OR   叜  OR   葸)
will present any record, it will return that record as a result.

We also tried *LetterTokenizer*, but it doesn't behave same as like
*StandardTokenizer
*in many cases. We also tried *Copy field* options, but it is also not
feasible as the application layer is not so flexible to determine CJK token
and change the query at runtime.

Please suggest any approach for indexing or querying, so that we could
filter irrelevant results.

Thanks in advance.
Nitesh


RE: indexing two words, searching single word

2018-08-03 Thread Markus Jelsma
Hello,

If your case is English you could use synonyms to work around the problem of 
the few compound words of the language. However, would you be dealing with a 
Germanic compound language, the HyphenationCompoundWordTokenFilter [1] or 
DictionaryCompoundWordTokenFilter are a better choice. The former is much more 
flexible but has its drawbacks.

Regards,
Markus

https://lucene.apache.org/core/7_4_0/analyzers-common/org/apache/lucene/analysis/compound/HyphenationCompoundWordTokenFilterFactory.html

 
 
-Original message-
> From:Clemens Wyss DEV 
> Sent: Friday 3rd August 2018 12:22
> To: solr-user@lucene.apache.org
> Subject: indexing two words, searching single word
> 
> Sounds like a rather simple issue:
> if I index "sound stage" and search for "soundstage" I get no hits
> 
> What am I doing wrong 
> a) when indexing
> b) when searching
> ?
> 
> Thx in advance
> - Clemens
> 


indexing two words, searching single word

2018-08-03 Thread Clemens Wyss DEV
Sounds like a rather simple issue:
if I index "sound stage" and search for "soundstage" I get no hits

What am I doing wrong 
a) when indexing
b) when searching
?

Thx in advance
- Clemens