Re: Database logins and active sessions

2018-02-07 Thread Shawn Heisey

On 2/7/2018 11:40 PM, Srinivas Kashyap wrote:

We have configured Solr index server on tomcat and fetch the data from database 
to index the data. We have implemented delta query indexing based on modify_ts.


What version of Solr?  Just as an FYI:  Since version 5.0, running in 
user-provided containers (like Tomcat) is not a supported configuration.


https://wiki.apache.org/solr/WhyNoWar


In our data-config.xml we have a parent entity and 17 child entity. We have 18 
such solr cores. When we call delta-import on a core, it executes 18 SQL query 
to query database.

Each time delta-import is opening a new session onto database. Log-in and 
log-out though happening at a split second, we are finding millions of login 
and logout at database.

As per our DBA, login and logout are costly operation in terms of server 
resources.

Is there a way to reduce the number of  logins and logouts and have a 
persistent DB connection from solr?


Directly, with a JDBC driver configured in the dataimport handler? 
Probably not.  But it looks like there may be a workaround -- setting up 
a JNDI datasource in your servlet container, and letting that handle the 
connection pooling for you.


http://lucene.472066.n3.nabble.com/how-to-configure-mysql-pool-connection-on-Solr-Server-tp4038974p4039040.html

It is likely that your container can set up connection pooling with most 
JDBC drivers, not just MySQL.


The dataimport handler is a useful module, but it has limitations.  If 
you write your own indexing program that is fully aware of your source 
data, you're likely to get better results.


Something else to consider -- sometimes by clever use of SQL JOINs, you 
can put the information gathering done by child entities into the main 
query of the parent entity.  If you can do that and eliminate all your 
child entities, then Solr will make exactly ONE query to your database 
for any import operation, and you won't need to worry about reusing open 
connections.


Thanks,
Shawn


Re: Design Question

2018-02-07 Thread Emir Arnautović
Hi Deepthi,
Is dictionary static? Can value for some id change? If static, and if query 
performance matters to you, the best and also the simplest solution is to 
denormalise data and store dictionary values with docs.

Alternative is to use join query parser: 
https://lucene.apache.org/solr/guide/6_6/other-parsers.html#OtherParsers-JoiningAcrossCollections
 


HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 8 Feb 2018, at 07:21, Deepthi P  wrote:
> 
> I have a dictionary of 2 ID's and their description which is in a
> collection. There is another solr collection in which each document have 10
> or more ID's(multi valued field). I would like to text search in the
> dictionary and bring back the matched ID's and search these ID's in solr
> collection. For example if I search for 'fox' in the dictionary collection
> and it returns 40 ID's, I would like to sent these 40 ID's to solr
> collection and bring back documents that have these ID's.Please suggest the
> best way to implement this.



Design Question

2018-02-07 Thread Deepthi P
I have a dictionary of 2 ID's and their description which is in a
collection. There is another solr collection in which each document have 10
or more ID's(multi valued field). I would like to text search in the
dictionary and bring back the matched ID's and search these ID's in solr
collection. For example if I search for 'fox' in the dictionary collection
and it returns 40 ID's, I would like to sent these 40 ID's to solr
collection and bring back documents that have these ID's.Please suggest the
best way to implement this.


Database logins and active sessions

2018-02-07 Thread Srinivas Kashyap
Hello,

We have configured Solr index server on tomcat and fetch the data from database 
to index the data. We have implemented delta query indexing based on modify_ts.

In our data-config.xml we have a parent entity and 17 child entity. We have 18 
such solr cores. When we call delta-import on a core, it executes 18 SQL query 
to query database.

Each time delta-import is opening a new session onto database. Log-in and 
log-out though happening at a split second, we are finding millions of login 
and logout at database.

As per our DBA, login and logout are costly operation in terms of server 
resources.

Is there a way to reduce the number of  logins and logouts and have a 
persistent DB connection from solr?

Thanks and Regards,
Srinivas Kashyap

DISCLAIMER: 
E-mails and attachments from TradeStone Software, Inc. are confidential.
If you are not the intended recipient, please notify the sender immediately by
replying to the e-mail, and then delete it without making copies or using it
in any way. No representation is made that this email or any attachments are
free of viruses. Virus scanning is recommended and is the responsibility of
the recipient.

Fwd: Design Question

2018-02-07 Thread Deepthi P
I have a dictionary of 2 ID's and their description which is in a
collection. There is another solr collection in which each document have 10
or more ID's(multi valued field). I would like to text search in the
dictionary and bring back the matched ID's and search these ID's in solr
collection. For example if I search for 'fox' in the dictionary collection
and it returns 40 ID's, I would like to sent these 40 ID's to solr
collection and bring back documents that have these ID's.Please suggest the
best way to implement this.


Normalizing payload values

2018-02-07 Thread Shreya Kampli
Hi,

I am using a payload parser using Payload Score parser as below:
 {!payload_score f=field v=$q func=max includeSpanScore=true}.

The issue is that the payload value in this field is around the range
1-1.
Due to this, the boosts added to other fields are never effective as
maximum of the score affected is due to the payload score.

Is there a way that I can consider only a fraction of the payload score to
affect the whole score during quering?


Hard commits blocked | non-solrcloud v6.6.2

2018-02-07 Thread mmb1234

I am seeing that after some time hard commits in all my solr cores stop and
each one's searcher has an "opened at" date to be hours ago even though they
are continuing to ingesting data successfully (index size increasing
continuously).

http://localhost:8983/solr/#/x-core/plugins?type=core=searcher
"openedAt: 2018-02-08T01:52:24.950Z"

Is there something I am doing incorrectly in my config ?

I setup my solrconfig.xml without  for my "bulk indexing" use
cases.

solrConfig.xml:-
  
none
200

  1
  5

  

  

  ${solr.autoCommit.maxTime:1}
  true


  ${solr.autoSoftCommit.maxTime:-1}
  ${solr.autoSoftCommit.maxDocs:-1}
  false

  

Thread dump:-
"commitScheduler-20-thread-1" #391 prio=5 os_prio=0 tid=0x7ef194011000
nid=0x43a in Object.wait() [0x7ec99533f000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at
org.apache.lucene.index.ConcurrentMergeScheduler.doStall(ConcurrentMergeScheduler.java:616)
- eliminated <0x00027005a0f0> (a
org.apache.lucene.index.ConcurrentMergeScheduler)
at
org.apache.lucene.index.ConcurrentMergeScheduler.maybeStall(ConcurrentMergeScheduler.java:602)
- locked <0x00027005a0f0> (a
org.apache.lucene.index.ConcurrentMergeScheduler)
at
org.apache.lucene.index.ConcurrentMergeScheduler.merge(ConcurrentMergeScheduler.java:524)
- locked <0x00027005a0f0> (a
org.apache.lucene.index.ConcurrentMergeScheduler)
at
org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:2083)
at
org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:487)
at
org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:291)
at
org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:276)
at
org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:235)
at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1980)
at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2189)
at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1926)
at
org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:675)
- locked <0x00026f20bb88> (a java.lang.Object)
at org.apache.solr.update.CommitTracker.run(CommitTracker.java:217)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Bi Gram token generation with fuzzy searches

2018-02-07 Thread Sravan Kumar
@Emir :   The  'sow' parameter in edismax along with the nested query
'_query_' works. Tuning has to be done for desired relevancy.

@Walter:  It would be nice to have SOLR-629 integrated into the project. As
Emir suggested, _query_ caters to my need by by applying fuzzy parameter to
the query. Anyways, I will apply the patch and give it a try.


On Wed, Feb 7, 2018 at 8:42 PM, Walter Underwood 
wrote:

> I think you need the feature in SOLR-629 that adds fuzzy to edismax.
>
> https://issues.apache.org/jira/browse/SOLR-629
>
> The patch on that issue is for Solr 4.x, but I believe someone is working
> on a new patch.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Feb 7, 2018, at 2:10 AM, Emir Arnautović <
> emir.arnauto...@sematext.com> wrote:
> >
> > Hi Sravan,
> > Edismax has ’sow’ parameter that results in edismax to pass query to
> field analysis, but not sure how it will work with fuzzy search. What you
> might do is use _query synthax to separate shingle and non shingle queries,
> e.g.
> > q=_query({!edismax sow=false qf=title_bigrams}$v) OR _query({!edismax
> qf=title}$v)&$v=some movie title
> >
> > HTH,
> > Emir
> > --
> > Monitoring - Log Management - Alerting - Anomaly Detection
> > Solr & Elasticsearch Consulting Support Training - http://sematext.com/
> >
> >
> >
> >> On 7 Feb 2018, at 10:55, Sravan Kumar  wrote:
> >>
> >> We have the following two fields for our movie title search
> >> - title without symbols
> >> a custom analyser with WordDelimiterFilterFactory, SynonymFilterFactory
> and
> >> other filters to retain only alpha numeric characters.
> >> - title with word bi grams
> >> a custom analyser with solr.ShingleFilterFactory to generate "bi gram"
> word
> >> tokens with '_' as separator.
> >>
> >> A custom similarity class is used to make tf & idf values as 1.
> >>
> >> Edismax query parser is used to perform all searches. Phrase boosting
> (pf)
> >> is also used.
> >>
> >> There are couple of issues while searching:
> >> 1>  BiGram field doesn't generate bi grams if the white spaces in the
> query
> >> are not escaped.
> >> - For example, if the query is "pursuit of happyness", then bi grams are
> >> not generated.  This is due to the fact that the edismax query parser
> >> tokenizes based on whitespaces before passing the string to
> >> analyser(correct me if I am wrong).
> >> But in case of "pursuit\ of\ happyness", they are as the string which is
> >> passed to the analyser is with the whitespace.
> >>
> >> 2>  Fuzzy search doesn't work in  whitespace escaped queries.
> >> Ex: "pursuit~2\ of\ happiness~1"
> >>
> >> 3> Edismax's Phrase boosting doesn't work the way it should in
> >> non-whitespace escaped fuzzy queries.
> >>
> >> If the query is "pursuit~2 of happiness~1" (without escaping
> whitespaces)
> >>
> >> fuzzy queries are generated
> >> (title_name:pursuit~2), (title_name:happiness~1) in the parsed query.
> >> But,edismax pf (phrase boost) generates query like
> >> title_name:"pursuit (2 pursuit2) of happiness (1 happiness1)"
> >> This means the analyser got the original query consisting the fuzzy
> >> operator for phrase boosting.
> >>
> >>
> >> 1> How whitespaces should be handled in case of filters like
> >> solr.ShingleFilterFactory to generate bi grams?
> >> 2> If generating bi grams requires whitespaces escaped and fuzzy
> searches
> >> not, how do we accomodate both these in a single solr request and scored
> >> together.
> >>
> >>
> >>
> >> -
> >> --
> >> Regards,
> >> Sravan
> >
>
>


-- 
Regards,
Sravan


Re: Best Practice about solr cloud schema

2018-02-07 Thread Erick Erickson
It can pretty much be used as-is, _except_

you'll find one or more entries in your request handlers like:
_text_

Change "_text_" to something in your schema, that's the default search
field if you don't field-qualify your search terms.

Note that if you take out, for instance, all of your non-english
fieldTypes, you can also remove most of the stuff under the /lang
folder.

I essentially always test this out on a local, stand-alone instance
until I can index a few documents and query them, it's faster than
always having to remember to move them to ZooKeeper

Best,
Erick

On Wed, Feb 7, 2018 at 7:14 PM, Pratik Patel  wrote:
> Hey Eric, thanks for the clarification! What about solrConfig.xml file?
> Sure, it should be customized to suit one's needs but can it be used as a
> base or is it best to create one from scratch ?
>
> Thanks,
> Pratik
>
> On Wed, Feb 7, 2018 at 5:29 PM, Erick Erickson 
> wrote:
>
>> That's really the point of the default managed-schema, to be a base
>> you use for your customizations. In fact, I often _remove_ most of the
>> fields (and especially fieldTypes) that I don't need. This includes
>> dynamic fields, copyFields and the like.
>>
>> Sometimes it's actually easier, though, to just start all over.
>>
>> BTW, do not delete any field that begins and ends with an underscore,
>> e.g. _version_ unless you know exactly what the consequences are
>>
>> Best,
>> Erick
>>
>> On Wed, Feb 7, 2018 at 2:59 PM, Pratik Patel  wrote:
>> > Hello all,
>> >
>> > I have added some fields to default managed-schema file. I was wondering
>> if
>> > it is safe to take default managed-schema file as is and add your own
>> > fields to it in production. What is the best practice for this? As I
>> > understand, it should be safe to use default schema as base if documents
>> > that are going to be indexed in solr will only have newly defined fields
>> in
>> > it. In fact, it helps because the common field types are already defined
>> in
>> > default schema which can be re-used. I looked through the documentation
>> but
>> > couldn't find the answer and more clarity on this would be helpful.
>> >
>> > Is it safe to use default managed-schema file as base add your own fields
>> > to it?
>> >
>> > Thanks,
>> > Pratik
>>


Re: Best Practice about solr cloud schema

2018-02-07 Thread Pratik Patel
Hey Eric, thanks for the clarification! What about solrConfig.xml file?
Sure, it should be customized to suit one's needs but can it be used as a
base or is it best to create one from scratch ?

Thanks,
Pratik

On Wed, Feb 7, 2018 at 5:29 PM, Erick Erickson 
wrote:

> That's really the point of the default managed-schema, to be a base
> you use for your customizations. In fact, I often _remove_ most of the
> fields (and especially fieldTypes) that I don't need. This includes
> dynamic fields, copyFields and the like.
>
> Sometimes it's actually easier, though, to just start all over.
>
> BTW, do not delete any field that begins and ends with an underscore,
> e.g. _version_ unless you know exactly what the consequences are
>
> Best,
> Erick
>
> On Wed, Feb 7, 2018 at 2:59 PM, Pratik Patel  wrote:
> > Hello all,
> >
> > I have added some fields to default managed-schema file. I was wondering
> if
> > it is safe to take default managed-schema file as is and add your own
> > fields to it in production. What is the best practice for this? As I
> > understand, it should be safe to use default schema as base if documents
> > that are going to be indexed in solr will only have newly defined fields
> in
> > it. In fact, it helps because the common field types are already defined
> in
> > default schema which can be re-used. I looked through the documentation
> but
> > couldn't find the answer and more clarity on this would be helpful.
> >
> > Is it safe to use default managed-schema file as base add your own fields
> > to it?
> >
> > Thanks,
> > Pratik
>


Re: How to form a boolean query such that it wont evaluate the right hand side if it isn't necessary

2018-02-07 Thread Erick Erickson
Agree with Walter, this is seeming like an XY problem. Also, Solr does
_not_ implement strict boolean logic, see:
https://lucidworks.com/2011/12/28/why-not-and-or-and-not/

Best,
Erick

On Wed, Feb 7, 2018 at 1:49 PM, Walter Underwood  wrote:
> I understand what you are asking for. Solr doesn’t work like that. Solr is 
> not a programming language Short-circuit evaluation isn’t especially useful 
> for a search engine.
>
> Most of the work is fetching and uncompressing the posting lists. Calculating 
> the score for each document is pretty fast.
>
> Express your query in the Solr/Lucene query language and time it.
>
> If field1:value1 is required and field2:value2 is optional, your query should 
> be expressed like this:
>
> +field1:value1 field2:value2
>
> Also, this is beginning to feel like an X-Y problem. What are you trying to 
> achieve with this evaluation requirement?
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>> On Feb 7, 2018, at 1:41 PM, bbarani  wrote:
>>
>> Walter, It's just that I have a use case (to evaluate one field over other)
>> for which I am trying out multiple solutions in order to avoid making
>> multiple calls to SOLR.
>>
>> I am trying to do a Short-circuit evaluation.
>>
>> Short-circuit evaluation, minimal evaluation, or McCarthy evaluation (after
>> John McCarthy) is the semantics of some Boolean operators in some
>> programming languages in which the second argument is executed or evaluated
>> only if the first argument does not suffice to determine the value of the
>> expression
>>
>>
>>
>> --
>> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


Re: Best Practice about solr cloud schema

2018-02-07 Thread Erick Erickson
That's really the point of the default managed-schema, to be a base
you use for your customizations. In fact, I often _remove_ most of the
fields (and especially fieldTypes) that I don't need. This includes
dynamic fields, copyFields and the like.

Sometimes it's actually easier, though, to just start all over.

BTW, do not delete any field that begins and ends with an underscore,
e.g. _version_ unless you know exactly what the consequences are

Best,
Erick

On Wed, Feb 7, 2018 at 2:59 PM, Pratik Patel  wrote:
> Hello all,
>
> I have added some fields to default managed-schema file. I was wondering if
> it is safe to take default managed-schema file as is and add your own
> fields to it in production. What is the best practice for this? As I
> understand, it should be safe to use default schema as base if documents
> that are going to be indexed in solr will only have newly defined fields in
> it. In fact, it helps because the common field types are already defined in
> default schema which can be re-used. I looked through the documentation but
> couldn't find the answer and more clarity on this would be helpful.
>
> Is it safe to use default managed-schema file as base add your own fields
> to it?
>
> Thanks,
> Pratik


Best Practice about solr cloud schema

2018-02-07 Thread Pratik Patel
Hello all,

I have added some fields to default managed-schema file. I was wondering if
it is safe to take default managed-schema file as is and add your own
fields to it in production. What is the best practice for this? As I
understand, it should be safe to use default schema as base if documents
that are going to be indexed in solr will only have newly defined fields in
it. In fact, it helps because the common field types are already defined in
default schema which can be re-used. I looked through the documentation but
couldn't find the answer and more clarity on this would be helpful.

Is it safe to use default managed-schema file as base add your own fields
to it?

Thanks,
Pratik


Re: can you migrate solr index files from osx to linux

2018-02-07 Thread Jeff Dyke
I forgot to report back on this.  For anyone that runs into it, you need
the entire data directory not just the index directory, at least that's
what made it work for me.

On Thu, Feb 1, 2018 at 9:52 PM, Erick Erickson 
wrote:

> I think SCP will be fine. Shawn's comment is probably the issue.
>
> Best,
> Erick
>
> On Thu, Feb 1, 2018 at 4:34 PM, Shawn Heisey  wrote:
> > On 2/1/2018 4:32 PM, Jeff Dyke wrote:
> >> I just created a tar file, actually a tar.gz file and scp'd to a
> server, at
> >> first i was worried that the gzip caused issues, but as i mentioned no
> >> errors on start up, and i thought i would see some.  @Erick, how would
> you
> >> recommend.  This is going to be less of an issue b/c i need to build the
> >> index programmatically anyway, but would be nice to know if only for
> >> curiosity.  Perhaps making a replication backup and then restoring on
> the
> >> new server would be better.  In the middle of other things now, will
> try a
> >> few of those, plus some other ideas.
> >
> > I think the problem is that you're copying the index files into
> > ${instanceDir}/data and not ${instanceDir}/data/index.  The index
> > directory is what Solr is actually going to use.
> >
> > Delete everything that already exists in the index directory before
> > putting the files in there.
> >
> > You probably don't need to do a full restart, you could probably just
> > reload the core.
> >
> > Thanks,
> > Shawn
>


Judging the MoreLikeThis results for relevancy

2018-02-07 Thread Arnold Bronley
Hi,

I am using MoreLikeThis handler to get related documents for a given
document. To determine if I am getting good results or not, here is what I
do:

The same original document should be returned as a top match.

If it is not, then there is some problem with the relevancy.

Then, as same input document will be 100% match with itself, we can use its
absolute score to compare how other documents (ranked 2nd, ranked 3rd and
so on) are doing in terms of relevancy by comparing their scores to the
score of the top result which is the same input document

Is this a good idea?

Do you see any flaw in this logic?


Re: Spellcheck collations results

2018-02-07 Thread Arnold Bronley
Thanks for replying Alessandro.

I am passing these parameters:

q=polt=polt=json=true=true=7=true=true=true=3=3=true=0.72





On Thu, Jan 25, 2018 at 4:28 AM, alessandro.benedetti 
wrote:

> Can you tell us the request parameters used for the spellcheck ?
>
> In particular are you using these ? (from the wiki) :
>
> " The *spellcheck.maxCollationTries* Parameter
> This parameter specifies the number of collation possibilities for Solr to
> try before giving up. Lower values ensure better performance. Higher values
> may be necessary to find a collation that can return results. The default
> value is 0, which maintains backwards-compatible (Solr 1.4) behavior (do
> not
> check collations). This parameter is ignored if spellcheck.collate is
> false.
>
> The *spellcheck.maxCollationEvaluations* Parameter
> This parameter specifies the maximum number of word correction combinations
> to rank and evaluate prior to deciding which collation candidates to test
> against the index. This is a performance safety-net in case a user enters a
> query with many misspelled words. The default is 10,000 combinations, which
> should work well in most situations. "
>
> Regards
>
>
>
>
>
> -
> ---
> Alessandro Benedetti
> Search Consultant, R Software Engineer, Director
> Sease Ltd. - www.sease.io
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


Relevancy Tuning For Solr With Apache Nutch 2.3

2018-02-07 Thread Mukhopadhyay, Aratrika
Hello ,
 I am attempting to tune my results that I retrieve from solr to boost 
the importance of certain fields. The syntax of the query I am using is as 
follows :
http://localhost:8983/solr/housegov_data/select?indent=on=QUERY=edismax=FIELD1^20.0_FIELD2^0.03=json.
 The issue is that this is not boosting anything in most cases or it isn't 
being able to find any documents that match this criteria. I have used nutch to 
crawl websites and indexed the data to solr. I see that nutch applies an index 
time boost as well. Could that have something to do with this ? Can anyone look 
at the format of this query and enlighten me of any mistakes that I am making.


FYI : I am using a data driven schema.
Regards,
Aratrika Mukhopadhyay


Re: How to form a boolean query such that it wont evaluate the right hand side if it isn't necessary

2018-02-07 Thread Walter Underwood
I understand what you are asking for. Solr doesn’t work like that. Solr is not 
a programming language Short-circuit evaluation isn’t especially useful for a 
search engine.

Most of the work is fetching and uncompressing the posting lists. Calculating 
the score for each document is pretty fast.

Express your query in the Solr/Lucene query language and time it.

If field1:value1 is required and field2:value2 is optional, your query should 
be expressed like this:

+field1:value1 field2:value2

Also, this is beginning to feel like an X-Y problem. What are you trying to 
achieve with this evaluation requirement?

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Feb 7, 2018, at 1:41 PM, bbarani  wrote:
> 
> Walter, It's just that I have a use case (to evaluate one field over other)
> for which I am trying out multiple solutions in order to avoid making
> multiple calls to SOLR. 
> 
> I am trying to do a Short-circuit evaluation.
> 
> Short-circuit evaluation, minimal evaluation, or McCarthy evaluation (after
> John McCarthy) is the semantics of some Boolean operators in some
> programming languages in which the second argument is executed or evaluated
> only if the first argument does not suffice to determine the value of the
> expression
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html



Re: How to form a boolean query such that it wont evaluate the right hand side if it isn't necessary

2018-02-07 Thread bbarani
Walter, It's just that I have a use case (to evaluate one field over other)
for which I am trying out multiple solutions in order to avoid making
multiple calls to SOLR. 

I am trying to do a Short-circuit evaluation.

Short-circuit evaluation, minimal evaluation, or McCarthy evaluation (after
John McCarthy) is the semantics of some Boolean operators in some
programming languages in which the second argument is executed or evaluated
only if the first argument does not suffice to determine the value of the
expression



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Solr Autoscaling multi-AZ rules

2018-02-07 Thread Jeff Wartes
I’ve been messing around with the Solr 7.2 autoscaling framework this week. 
Some things seem trivial, but I’m also running into questions and issues. If 
anyone else has experience with this stuff, I’d be glad to hear it. 
Specifically:


Context:
-One collection, consisting of 42 shards, where up to 6 shards can fit on a 
single node. (which means 7 nodes per Replication Factor)
-Three AZs, each with its own ip_2 value.

Goals:

Goal: Fully utilize available nodes.
Cluster Preference: {“maximize”: "cores”}

Goal: No node should have more than one replica of a given shard
Rule: {"replica": "<2", "shard": "#EACH", "node": "#ANY"}

Goal: No node should have more than 6 shards
Rule: {"replica": "<7", "node":"#ANY"}

Goal: Where possible, distinct RFs should each exist in an AZ.
(Example1: I’d like 7 nodes with a complete RF in AZ 1 and 7 nodes with a 
complete RF in AZ 2, and not end up with, say, both shard2 replicas in AZ 1)
(Example2: If I have 14 nodes in AZ 1 and 7 in AZ 2, I should have two full RFs 
in AZ 1 and one in AZ 2)
Rule: ???

I could have multiple non-strict rules perhaps? Like:
{"replica": "<2", "shard": "#EACH", "ip_2": "1", "strict":false}
{"replica": "<3", "shard": "#EACH", "ip_2": "1", "strict":false}
{"replica": "<4", "shard": "#EACH", "ip_2": "1", "strict":false}
{"replica": "<2", "shard": "#EACH", "ip_2": "2", "strict":false}
{"replica": "<3", "shard": "#EACH", "ip_2": "2", "strict":false}
{"replica": "<4", "shard": "#EACH", "ip_2": "2", "strict":false}
etc
So having more than one RF in an AZ is a technical “violation”, but if 
placement minimizes non-strict violations, replicas would tend to get placed 
correctly.


Given a working set of rules, I’m still having trouble with two things:

  1.  I’ve manually created the “.system” collection, as it didn’t seem to get 
created automatically. However, autoscaling activity is not getting logged to 
it.
  2.  I can’t seem to figure out how to scale up.
 *   I’d presumed editing the collection’s “replicationFactor” would do the 
trick, but it does not.
 *   The “node-up” trigger will serve to replace lost replicas, but won’t 
otherwise take advantage of additional capacity.

   i.  There’s 
a UTILIZENODE command in 7.2, but it appears that’s still something you need to 
trigger manually.

Anyone played with this stuff?


Re: How to form a boolean query such that it wont evaluate the right hand side if it isn't necessary

2018-02-07 Thread Walter Underwood
You don’t get to control the order of execution, other than specifying a filter 
query.

I think you have the wrong mental model of how Solr does search.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Feb 7, 2018, at 1:28 PM, bbarani  wrote:
> 
> You are right. I don't care about the score rather I want a document
> containing specific term in a specific field to be evaluated first before
> checking the next field.
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html



Payload fields

2018-02-07 Thread Brian Yee
Hello,

I am trying to use Payload fields to store per-zone delivery dates for 
products. I have an index where my documents are products and for each product 
we want to store a date by when we can deliver that product for 1-100 different 
zones. Since the payload() function only supports int and float, I am 
representing my dates as 20180210. So my payload field looks like this:
1|20180210 2|20180211 3|20180212 4|20180213 5|20180214

However when I load that data into my index and add 
"fl=Zone2:payload(DeliveryPayload,2)" I get the wrong result in my output. It 
shows me "Zone2": 20180212.

I don't understand why. The only thing I can think of is that the int is too 
big. Is there a size limit? If I reduce it to a 6 or 7 digit int, things work 
as expected. Shouldn't INT handle up to 2B?

My schema.xml looks like this:


  


  







Re: How to form a boolean query such that it wont evaluate the right hand side if it isn't necessary

2018-02-07 Thread bbarani
Thanks Erick. I will check this out.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: How to form a boolean query such that it wont evaluate the right hand side if it isn't necessary

2018-02-07 Thread bbarani
You are right. I don't care about the score rather I want a document
containing specific term in a specific field to be evaluated first before
checking the next field.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: MODIFYCOLLECTION via Solrj

2018-02-07 Thread Erick Erickson
Yeah, sometimes the sugar-methods/classes in SolrJ lag a bit behind
the collections API. but at root about all these classes do is create
a ModifiableSolrParams with all the params you'd specify and make an
http call via the AsyncCollectionAdminRequest.process command last I
knew.

Best,
Erick


On Wed, Feb 7, 2018 at 10:23 AM, Hendrik Haddorp
 wrote:
> Hi,
>
> I'm unable to find how I can do a MODIFYCOLLECTION via Solrj. I would like
> to change the replication factor of a collection but can't find it in the
> Solrj API. Is that not supported?
>
> regards,
> Hendrik


Re: How to form a boolean query such that it wont evaluate the right hand side if it isn't necessary

2018-02-07 Thread Erick Erickson
If you don't care about its contribution to scoring, one option is to
move the clause you want evaluated to an fq clause sitn {!cache=false
cost=101}. see: http://yonik.com/advanced-filter-caching-in-solr/

Best,
Erick

On Wed, Feb 7, 2018 at 12:05 PM, Emir Arnautović
 wrote:
> Hi,
> Also note that score is different if only one term match and if both terms 
> are matched. Your case would make sense if you do not plan to order by score, 
> but as Walter explained, Solr does not go document by document and evaluate 
> query conditions, but it gets list of documents matching each part of boolean 
> query (you can think of it as bitsets) and do union/intersection to get the 
> final result.
>
> HTH,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
>> On 7 Feb 2018, at 19:38, Walter Underwood  wrote:
>>
>> That doesn’t really make sense for Solr query evaluation. It fetches the 
>> posting lists for each term, then walks through them evaluating the query 
>> against all the documents.
>>
>> It can skip a document as soon as it fails the query, but it still has to 
>> fetch the posting lists.
>>
>> So, that feature doesn’t exist because it isn’t useful.
>>
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>>
>>> On Feb 7, 2018, at 9:50 AM, bbarani  wrote:
>>>
>>>
>>> I am trying to figure out a way to form boolean (||) query in SOLR.
>>> Ideally my expectation is that with boolean operator ||, if first term is
>>> true second term shouldn't be evaluated.
>>>
>>> =searchTerms:"testing" || matchStemming:"stemming"
>>> works same as
>>> =searchTerms:"testing" OR matchStemming:"stemming"
>>>
>>> Is there a way to form a boolean query such that it wont evaluate the right
>>> hand side if it isn't necessary?
>>>
>>>
>>>
>>> --
>>> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>>
>


Re: Solr Swap space

2018-02-07 Thread Shawn Heisey

On 2/7/2018 12:01 PM, Susheel Kumar wrote:

Just trying to find where do we set swap space available to Solr process. I
see in our 6.0 instances it was set to 2GB on and on 6.6 instances its set
to 16GB.


Solr has absolutely no involvement or control over swap space.  Neither 
does Java.  This is a function of your operating system's memory 
management, and is typically set up when you first install your OS.


https://www.linux.com/news/all-about-linux-swap-space
https://en.wikipedia.org/wiki/Paging#Windows_NT

If your system is using swap space, it's a strong indication that you 
don't have enough memory installed.  If any of the memory that Solr uses 
is swapped out to disk, Solr performance is going to be REALLY bad.


Thanks,
Shawn


Re: Solr Swap space

2018-02-07 Thread Emir Arnautović
Hi Susheel,
Swap space is OS thing, not Solr thing. You should see how to disable swap 
space or at least set swappiness to some low number on your OS.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 7 Feb 2018, at 20:01, Susheel Kumar  wrote:
> 
> Hello,
> 
> Just trying to find where do we set swap space available to Solr process. I
> see in our 6.0 instances it was set to 2GB on and on 6.6 instances its set
> to 16GB.
> 
> Thanks,
> Susheel



Re: How to form a boolean query such that it wont evaluate the right hand side if it isn't necessary

2018-02-07 Thread Emir Arnautović
Hi,
Also note that score is different if only one term match and if both terms are 
matched. Your case would make sense if you do not plan to order by score, but 
as Walter explained, Solr does not go document by document and evaluate query 
conditions, but it gets list of documents matching each part of boolean query 
(you can think of it as bitsets) and do union/intersection to get the final 
result.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 7 Feb 2018, at 19:38, Walter Underwood  wrote:
> 
> That doesn’t really make sense for Solr query evaluation. It fetches the 
> posting lists for each term, then walks through them evaluating the query 
> against all the documents.
> 
> It can skip a document as soon as it fails the query, but it still has to 
> fetch the posting lists.
> 
> So, that feature doesn’t exist because it isn’t useful.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
>> On Feb 7, 2018, at 9:50 AM, bbarani  wrote:
>> 
>> 
>> I am trying to figure out a way to form boolean (||) query in SOLR.
>> Ideally my expectation is that with boolean operator ||, if first term is
>> true second term shouldn't be evaluated.
>> 
>> =searchTerms:"testing" || matchStemming:"stemming" 
>> works same as 
>> =searchTerms:"testing" OR matchStemming:"stemming" 
>> 
>> Is there a way to form a boolean query such that it wont evaluate the right
>> hand side if it isn't necessary?
>> 
>> 
>> 
>> --
>> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
> 



DataWorks Summit San Jose -Call For Abstract closes this Friday

2018-02-07 Thread Ana Castro
Hi Folks,

This Friday is the last day to submit abstracts and talks in around Solr and 
Big Data Search.  Could you please help reach out to others people in the Solr 
community to get the word out?

Regards,
Ana Castro
[cid:image001.jpg@01D3A004.BEC85630]

Hi Folks,

DataWorks Summit San Jose is in 
full swing.   DWS San Jose will be June 17-21.  The deadline to submit 
DataWorks Summit abstracts for San Jose is Feb 9th.   Please consider 
submitting an abstract and help encourage the Solr community/users/customers to 
do the same (tweet, post, blog, apache forums …).  The key details are below 
and a sample e-mail you can forward is provided.

Please reach out to the customers/partners and Apache Committers to get 
abstracts submitted.  Feel free to copy the footer DWS San Jose image below and 
add it to your e-mail signature.

I have also attached a slide you can use in your presentations to let customers 
know about the event.

Here are the key links:
https://dataworkssummit.com/abstracts/
Link to submit abstract
https://dataworkssummit.com/san-jose-2018/ - 
tracks
Link that describes the tracks this year.

We need your help to continue to get the word out.  Can you please take a few 
moments and post on your favorite social channel? `
Tweet to RT: https://twitter.com/DataWorksSummit/status/955917606413897729
LinkedIn Post to Share: 
https://www.linkedin.com/feed/update/urn:li:activity:6361701131600609280
Facebook Post to 
Share:https://www.facebook.com/DataWorksSummit/photos/a.1124797800909597.1073741858.265537176835668/1629781057077933/?type=3


Write your own post to share to your networks. Images are attached to use:
Tweet copy:
Are you interested in speaking at @DataWorksSummit San Jose? Submit your 
abstract before the Feb 9 deadline! https://dataworkssummit.com/san-jose-2018/ 
#DWS18

Facebook/LinkedIn Post:
Submit your abstract for DataWorks Summit San Jose before the Feb 9 deadline! 
More info here:https://dataworkssummit.com/san-jose-2018/


Sample Content for your Email:
Subject: DataWorks Summit San Jose -Submit Your Abstract Now!

Master the possibilities for next-gen big data. DataWorks Summit is the 
industry’s premier event focusing on next-generation big data solutions. Join 
us and learn from industry experts and peers about how open source technologies 
such as Apache Hadoop, Apache Spark, Apache NiFi and extended Big Data 
ecosystem, enables you to leverage data to drive predictive analytics, 
distributed deep-learning and artificial intelligence initiatives across global 
organizations.

Would you like to share your knowledge with the best and brightest in the open 
source Big Data community and be recognized as an industry expert? The 
DataWorks Summit Organizing Committee invites you to submit an 
abstract to be considered for the 
summit in San Jose on June 17-21.

We are looking for abstracts for the following 
tracks:
· Data Warehousing and Operational Data Stores
· Artificial Intelligence and Data Science
· Big Compute and Storage
· IoT and Streaming
· Cyber Security
· Governance & Security
· Cloud & Operations
· Enterprise Adoption

To learn more about the process and submit your abstract, please click 
here.

The submission deadline is Friday, Feb 9, 2018.
Don't miss this opportunity - submit your abstract now!
SUBMIT YOUR ABSTRACT 

If you have general questions or require any further information about the 
summit, please contact Justin 
Mounts.  If you 
have specific questions about submitting an abstract, please reach out to me 
(Rafael 
Coss).



Solr Swap space

2018-02-07 Thread Susheel Kumar
Hello,

Just trying to find where do we set swap space available to Solr process. I
see in our 6.0 instances it was set to 2GB on and on 6.6 instances its set
to 16GB.

Thanks,
Susheel


Re: How to form a boolean query such that it wont evaluate the right hand side if it isn't necessary

2018-02-07 Thread Walter Underwood
That doesn’t really make sense for Solr query evaluation. It fetches the 
posting lists for each term, then walks through them evaluating the query 
against all the documents.

It can skip a document as soon as it fails the query, but it still has to fetch 
the posting lists.

So, that feature doesn’t exist because it isn’t useful.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Feb 7, 2018, at 9:50 AM, bbarani  wrote:
> 
> 
> I am trying to figure out a way to form boolean (||) query in SOLR.
> Ideally my expectation is that with boolean operator ||, if first term is
> true second term shouldn't be evaluated.
> 
> =searchTerms:"testing" || matchStemming:"stemming" 
> works same as 
> =searchTerms:"testing" OR matchStemming:"stemming" 
> 
> Is there a way to form a boolean query such that it wont evaluate the right
> hand side if it isn't necessary?
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html



MODIFYCOLLECTION via Solrj

2018-02-07 Thread Hendrik Haddorp

Hi,

I'm unable to find how I can do a MODIFYCOLLECTION via Solrj. I would 
like to change the replication factor of a collection but can't find it 
in the Solrj API. Is that not supported?


regards,
Hendrik


How to form a boolean query such that it wont evaluate the right hand side if it isn't necessary

2018-02-07 Thread bbarani

I am trying to figure out a way to form boolean (||) query in SOLR.
Ideally my expectation is that with boolean operator ||, if first term is
true second term shouldn't be evaluated.

=searchTerms:"testing" || matchStemming:"stemming" 
works same as 
=searchTerms:"testing" OR matchStemming:"stemming" 

Is there a way to form a boolean query such that it wont evaluate the right
hand side if it isn't necessary?



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Highlighting over date fields

2018-02-07 Thread LOPEZ-CORTES Mariano-ext
It's possible to use highlighting over date fields ?

We've tried but we've got no highlighting response for the field.



Re: Long GC Pauses

2018-02-07 Thread Shawn Heisey

On 2/7/2018 8:08 AM, Shawn Heisey wrote:
If your queries are producing the correct results,
then I will tell you that the "summary" part of your query example is 
quite possibly completely unnecessary


After further thought, I have concluded that this part of what I said is 
probably completely wrong.


But I do not think that you need *any* of the fourteen bare wildcards 
that are in your query.


Thanks,
Shawn


Re: Bi Gram token generation with fuzzy searches

2018-02-07 Thread Walter Underwood
I think you need the feature in SOLR-629 that adds fuzzy to edismax. 

https://issues.apache.org/jira/browse/SOLR-629

The patch on that issue is for Solr 4.x, but I believe someone is working on a 
new patch.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Feb 7, 2018, at 2:10 AM, Emir Arnautović  
> wrote:
> 
> Hi Sravan,
> Edismax has ’sow’ parameter that results in edismax to pass query to field 
> analysis, but not sure how it will work with fuzzy search. What you might do 
> is use _query synthax to separate shingle and non shingle queries, e.g.
> q=_query({!edismax sow=false qf=title_bigrams}$v) OR _query({!edismax 
> qf=title}$v)&$v=some movie title
> 
> HTH,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
> 
> 
> 
>> On 7 Feb 2018, at 10:55, Sravan Kumar  wrote:
>> 
>> We have the following two fields for our movie title search
>> - title without symbols
>> a custom analyser with WordDelimiterFilterFactory, SynonymFilterFactory and
>> other filters to retain only alpha numeric characters.
>> - title with word bi grams
>> a custom analyser with solr.ShingleFilterFactory to generate "bi gram" word
>> tokens with '_' as separator.
>> 
>> A custom similarity class is used to make tf & idf values as 1.
>> 
>> Edismax query parser is used to perform all searches. Phrase boosting (pf)
>> is also used.
>> 
>> There are couple of issues while searching:
>> 1>  BiGram field doesn't generate bi grams if the white spaces in the query
>> are not escaped.
>> - For example, if the query is "pursuit of happyness", then bi grams are
>> not generated.  This is due to the fact that the edismax query parser
>> tokenizes based on whitespaces before passing the string to
>> analyser(correct me if I am wrong).
>> But in case of "pursuit\ of\ happyness", they are as the string which is
>> passed to the analyser is with the whitespace.
>> 
>> 2>  Fuzzy search doesn't work in  whitespace escaped queries.
>> Ex: "pursuit~2\ of\ happiness~1"
>> 
>> 3> Edismax's Phrase boosting doesn't work the way it should in
>> non-whitespace escaped fuzzy queries.
>> 
>> If the query is "pursuit~2 of happiness~1" (without escaping whitespaces)
>> 
>> fuzzy queries are generated
>> (title_name:pursuit~2), (title_name:happiness~1) in the parsed query.
>> But,edismax pf (phrase boost) generates query like
>> title_name:"pursuit (2 pursuit2) of happiness (1 happiness1)"
>> This means the analyser got the original query consisting the fuzzy
>> operator for phrase boosting.
>> 
>> 
>> 1> How whitespaces should be handled in case of filters like
>> solr.ShingleFilterFactory to generate bi grams?
>> 2> If generating bi grams requires whitespaces escaped and fuzzy searches
>> not, how do we accomodate both these in a single solr request and scored
>> together.
>> 
>> 
>> 
>> -
>> -- 
>> Regards,
>> Sravan
> 



Re: Long GC Pauses

2018-02-07 Thread Shawn Heisey

On 2/7/2018 5:20 AM, Maulin Rathod wrote:

Further analyzing issue we found that asking for too many rows (e.g. 
rows=1000) can cause full GC problem as mentioned in below link.


This is because when you ask for 10 million rows, Solr allocates a 
memory structure capable of storing information for each of those 10 
million rows, even before it knows how many documents are actually going 
to match the query.  This problem is mentioned by Toke's blog post you 
linked.


Bare wildcard queries can also lead to big problems with memory churn, 
and are not recommended.  Your query has a bare "*" included in it 
FOURTEEN times, on the summary field.  The name of that field suggests 
that it will have a very high term count.  If it does have a lot of 
unique terms, then ONE wildcard is going to be horrifically slow and 
consume a ton of memory.  Fourteen of them is going to be particularly 
insane.  You've also got a number of wildcards with text prefixes, which 
will not be as bad as the bare wildcard, but can still chew up a lot of 
memory and time.


I suspect that entire "summary" part of your query generation needs to 
be reworked.


You also have wildcards in the part of the query on the "title" field.

The kind of query you do with wildcards can often be completely replaced 
with ngram or edgengram filtering on the analysis chain, usually with a 
big performance advantage.


I suspect that the large number of wildcards is a big part of why your 
example query took 83 seconds to execute.  There may have also been some 
nasty GC pauses during the query.


You still have not answered the questions asked early in this thread 
about memory.  Is the heap 40GB, or is that the total memory installed 
in the server?  What is the total size of all Solr heaps on the machine, 
how much total memory is in the server, and how much index data (both 
document count and disk space size) is being handled by all the Solr 
instances on that machine?


The portion of your GC log that you included is too short, and has also 
been completely mangled by being pasted into an email.  If you want it 
analyzed, we will need a full copy of the logfile, without any 
modification, which likely means you need to use a file sharing site to 
transport it.


What I *can* decipher from your GC log suggests that your heap size may 
actually be 48GB, not 40GB.  After the big GC event, there was a little 
over 17GB of heap memory allocations remaining.  So my first bit of 
advice is to try reducing the heap size.  Without a large GC log, my 
current thought is to make it half what it currently is -- 24GB.  With a 
more extensive GC log, I could make a more accurate recommendation.  My 
second bit of advice would be to eliminate as many wildcards from your 
query as you can.  If your queries are producing the correct results, 
then I will tell you that the "summary" part of your query example is 
quite possibly completely unnecessary, and is going to require a LOT of 
memory.




Additional advice, not really related to the main discussion:

Some of the query looks like it is a perfect candidate for extraction 
into filter queries.  Any portion of the query that is particularly 
static is probably going to benefit from being changed into a filter 
query.  Possible filters you could use based on what I see:


fq=isFolderActive:true
fq=isXref:false
fq=*:* -document_type_id:(3 7)

If your index activity is well-suited for heavy filterCache usage, 
filters like this can achieve incredible speedups.


A lot of the other things in the query appear to be for ID values that 
are likely to change for every user.  Query clauses like that are not a 
good fit for filter queries.


Thanks,
Shawn


Re: SynonymGraphFilterFactory with WordDelimiterGraphFilterFactory usage

2018-02-07 Thread Steve Rowe
Thanks Webster,

I created https://issues.apache.org/jira/browse/SOLR-11955 to work on this.

--
Steve
www.lucidworks.com

> On Feb 6, 2018, at 2:47 PM, Webster Homer  wrote:
> 
> I noticed that in some of the current example schemas that are shipped with
> Solr, there is a fieldtype, text_en_splitting, that feeds the output
> of SynonymGraphFilterFactory into WordDelimiterGraphFilterFactory. So if
> this isn't supported, the example should probably be updated or removed.
> 
> On Mon, Feb 5, 2018 at 10:27 AM, Steve Rowe  wrote:
> 
>> Hi Александр,
>> 
>>> On Feb 5, 2018, at 11:19 AM, Shawn Heisey  wrote:
>>> 
>>> There should be no problem with using them together.
>> 
>> I believe Shawn is wrong.
>> 
>> From > org/apache/lucene/analysis/synonym/SynonymGraphFilter.html>:
>> 
>>> NOTE: this cannot consume an incoming graph; results will be undefined.
>> 
>> Unfortunately, the ref guide entry for Synonym Graph Filter <
>> https://lucene.apache.org/solr/guide/7_2/filter-descriptions.html#synonym-
>> graph-filter> doesn’t include a warning about this, but it should, like
>> the warning on Word Delimiter Graph Filter > solr/guide/7_2/filter-descriptions.html#word-delimiter-graph-filter>:
>> 
>>> Note: although this filter produces correct token graphs, it cannot
>> consume an input token graph correctly.
>> 
>> (I’ve just committed a change to the ref guide source to add this also on
>> the Synonym Graph Filter and Managed Synonym Graph Filter entries, to be
>> included in the ref guide for Solr 7.3.)
>> 
>> In short, the combination of the two filters is not supported, because
>> WDGF produces a token graph, which SGF cannot correctly interpret.
>> 
>> Other filters also have this issue, see e.g. > jira/browse/LUCENE-3475> for ShingleFilter; this issue has gotten some
>> attention recently, and hopefully it will inspire fixes elsewhere.
>> 
>> Patches welcome!
>> 
>> --
>> Steve
>> www.lucidworks.com
>> 
>> 
>>> On Feb 5, 2018, at 11:19 AM, Shawn Heisey  wrote:
>>> 
>>> On 2/5/2018 3:55 AM, Александр Шестак wrote:
 
 Hi, I have misunderstanding about usage of SynonymGraphFilterFactory
 and  WordDelimiterGraphFilterFactory. Can they be used together?
 
>>> 
>>> There should be no problem with using them together.  But it is always
>>> possible that the behavior will surprise you, while working 100% as
>>> designed.
>>> 
 I have solr type configured in next way
 
 >>> autoGeneratePhraseQueries="true">
  

>>>generateWordParts="1" generateNumberParts="1"
 splitOnNumerics="1"
catenateWords="1" catenateNumbers="1" catenateAll="0"
 preserveOriginal="1" protected="protwords_en.txt"/>

  
  

>>>generateWordParts="1" generateNumberParts="1"
 splitOnNumerics="1"
catenateWords="0" catenateNumbers="0" catenateAll="0"
 preserveOriginal="1" protected="protwords_en.txt"/>

>>>synonyms="synonyms_en.txt" ignoreCase="true" expand="true"/>
  
 
 
 So on query time it uses SynonymGraphFilterFactory after
 WordDelimiterGraphFilterFactory.
 Synonyms are configured in next way:
 b=>b,boron
 2=>ii,2
 
 Query in solr analysis tool looks so. It is shown that terms after SGF
 have positions 3 and 4. Is it correct? I thought that they should had
 1 and 2 positions.
 
>>> 
>>> What matters is the *relative* positions.  The exact position number
>>> doesn't matter much.  Something new that the Graph implementations use
>>> is the position length.  That feature is necessary for multi-term
>>> synonyms to function correctly in phrase queries.
>>> 
>>> In your analysis screenshot, WDGF creates three tokens.  The two tokens
>>> created by splitting the input are at positions 1 and 2, which I think
>>> is 100% as expected.  It also sets the positionLength of the first term
>>> to 2, probably because it has split that term into 2 additional terms.
>>> 
>>> Then the SGF takes those last two terms and expands them.  Each of the
>>> synonyms is at the same position as the original term, and the relative
>>> positions of the two synonym pairs have not changed -- the second one is
>>> still one higher than the first.  I think the reason that SGF moves the
>>> positions two higher is because the positionLength on the "b2" term is
>>> 2, previously set by WDGF.  Someone with more knowledge about the Graph
>>> implementations may have to speak up as to whether this behavior is
>> correct.
>>> 
>>> Because the relative positions of the split terms don't change when SGF
>>> runs, I think this is probably working as designed.
>>> 
>>> Thanks,
>>> Shawn
>> 
>> 
> 
> -- 
> 
> 
> This message and any attachment are confidential and may 

Re: Multiple consecutive wildcards (**) causes Out-of-memory

2018-02-07 Thread Bjarke Buur Mortensen
Just to clarify:
I can only cause this to happen when using the complexphrase query parser.
Lucene/dismax/edismax parsers are not affected.

2018-02-07 13:09 GMT+01:00 Bjarke Buur Mortensen :

> Hello list,
>
> Whenever I make a query for ** (two consecutive wildcards) it causes my
> Solr to run out of memory.
>
> http://localhost:8983/solr/select?q=**
>
> Why is that?
>
> I realize that this is not a reasonable query to make, but the system
> supports input from users, and they might by accident input this query,
> causing Solr to crash.
>
> I should add that we use the complexphrase query parser as the default
> parser on a Solr 7.1.
>
> Can anyone repro this or explain what causes the problem?
>
> Thanks in advance,
> Bjarke Buur Mortensen
> Senior Software Engineer
> Eluence A/S
>
>
>
>
>
>
>


Re: Long GC Pauses

2018-02-07 Thread Ere Maijala

Hi Maulin,

I'll chime in by referring to my own findings when analyzing Solr 
performance: 
https://www.mail-archive.com/solr-user@lucene.apache.org/msg135857.html


Yonik has a good article about paging: 
http://yonik.com/solr/paging-and-deep-paging/. While it's about deep 
paging, the same issues affect searches that request too many rows. We 
have switched to cursorMark when needing a large set of records and 
found that it, while consuming less memory, also performs much better 
when fetching like 100 000 records. We also limited paging in our UI to 
100 000 first hits, which avoids intermittent issues experienced when a 
user tried to e.g. go to the last result page in a really large result 
set. Unfortunately there's no mechanism that would allow one to present 
a large result set completely while allowing jumping to a random page.


--Ere

Maulin Rathod kirjoitti 7.2.2018 klo 14.20:

Hi Erick,



Thanks for your response. It shows GC pauses in Solr GC logs (refer below solr 
gc log where it shows 138.4138211 sec pause).



Seems like some bad query causes high memory allocation.

Further analyzing issue we found that asking for too many rows (e.g. 
rows=1000) can cause full GC problem as mentioned in below link. We had 
some query which was asking for too many rows….so for now we have reduced 
number rows. After changing this we don’t see any large GC Pause (Max GC pause 
is 3-4 second). So seems like issue is resolved for now…But still please let me 
know if you have any other suggestion which can help us to understand the root 
cause…


https://wiki.apache.org/solr/SolrPerformanceProblems#GC_pause_problems
https://sbdevel.wordpress.com/2015/10/05/speeding-up-core-search/





Solr GC Log

==



2018-01-04T12:13:40.346+: 140501.570: 
[CMS-concurrent-abortable-preclean-start]

{Heap before GC invocations=8909 (full 1001):

par new generation   total 10922688K, used 4187699K [0x8000, 
0x0003a000, 0x0003a000)

   eden space 8738176K,  41% used [0x8000, 0x00015fb1f6d8, 
0x00029556)

   from space 2184512K,  23% used [0x00029556, 0x0002b53cd8b8, 
0x00031aab)

   to   space 2184512K,   0% used [0x00031aab, 0x00031aab, 
0x0003a000)

concurrent mark-sweep generation total 39321600K, used 38082444K 
[0x0003a000, 0x000d, 0x000d)

Metaspace   used 46676K, capacity 47372K, committed 50136K, reserved 51200K

2018-01-04T12:13:40.393+: 140501.618: [GC (Allocation Failure) 140501.618: 
[CMS2018-01-04T12:13:40.534+: 140501.766: 
[CMS-concurrent-abortable-preclean: 0.149/0.196 secs] [Times: user=0.41 
sys=0.00, real=0.19 secs]

  (concurrent mode failure): 38082444K->18127660K(39321600K), 138.3709254 secs] 
42270144K->18127660K(50244288K), [Metaspace: 46676K->46676K(51200K)], 138.3746435 
secs] [Times: user=12.75 sys=22.89, real=138.38 secs]

Heap after GC invocations=8910 (full 1002):

par new generation   total 10922688K, used 0K [0x8000, 
0x0003a000, 0x0003a000)

   eden space 8738176K,   0% used [0x8000, 0x8000, 
0x00029556)

   from space 2184512K,   0% used [0x00029556, 0x00029556, 
0x00031aab)

   to   space 2184512K,   0% used [0x00031aab, 0x00031aab, 
0x0003a000)

concurrent mark-sweep generation total 39321600K, used 18127660K 
[0x0003a000, 0x000d, 0x000d)

Metaspace   used 46676K, capacity 47372K, committed 50136K, reserved 51200K

}

2018-01-04T12:15:58.772+: 140639.993: Total time for which application 
threads were stopped: 138.4138211 seconds, Stopping threads took: 0.0369886 
seconds







Regards,



Maulin





-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: 31 January 2018 22:47
To: solr-user 
Subject: Re: Long GC Pauses



Just to double check, when you san you're seeing 60-200 sec  GC pauses are you 
looking at the GC logs (or using some kind of monitor) or is that the time it 
takes the query to respond to the client? Because a single GC pause that long 
on 40G is unusual no matter what. Another take on Jason's question is For all 
the JVMs you're running, how much _total_ heap is allocated?

And how much physical memory is on the box? I generally start with _at least_ 
half the memory left to the OS



These are fairly horrible, what generates such queries?

AND * AND *



Best,

Erick







On Wed, Jan 31, 2018 at 7:28 AM, Jason Gerlowski 
> wrote:


Hi Maulin,







To clarify, when you said "...allocated 40 GB RAM to each shard."



above, I'm going to assume you meant "to each node" instead.  If you



actually did mean "to each shard" above, please correct me and anyone



who chimes in afterward.







Firstly, it's really hard to even take guesses about 

RE: Long GC Pauses

2018-02-07 Thread Maulin Rathod
Hi Erick,



Thanks for your response. It shows GC pauses in Solr GC logs (refer below solr 
gc log where it shows 138.4138211 sec pause).



Seems like some bad query causes high memory allocation.

Further analyzing issue we found that asking for too many rows (e.g. 
rows=1000) can cause full GC problem as mentioned in below link. We had 
some query which was asking for too many rows….so for now we have reduced 
number rows. After changing this we don’t see any large GC Pause (Max GC pause 
is 3-4 second). So seems like issue is resolved for now…But still please let me 
know if you have any other suggestion which can help us to understand the root 
cause…


https://wiki.apache.org/solr/SolrPerformanceProblems#GC_pause_problems
https://sbdevel.wordpress.com/2015/10/05/speeding-up-core-search/





Solr GC Log

==



2018-01-04T12:13:40.346+: 140501.570: 
[CMS-concurrent-abortable-preclean-start]

{Heap before GC invocations=8909 (full 1001):

par new generation   total 10922688K, used 4187699K [0x8000, 
0x0003a000, 0x0003a000)

  eden space 8738176K,  41% used [0x8000, 0x00015fb1f6d8, 
0x00029556)

  from space 2184512K,  23% used [0x00029556, 0x0002b53cd8b8, 
0x00031aab)

  to   space 2184512K,   0% used [0x00031aab, 0x00031aab, 
0x0003a000)

concurrent mark-sweep generation total 39321600K, used 38082444K 
[0x0003a000, 0x000d, 0x000d)

Metaspace   used 46676K, capacity 47372K, committed 50136K, reserved 51200K

2018-01-04T12:13:40.393+: 140501.618: [GC (Allocation Failure) 140501.618: 
[CMS2018-01-04T12:13:40.534+: 140501.766: 
[CMS-concurrent-abortable-preclean: 0.149/0.196 secs] [Times: user=0.41 
sys=0.00, real=0.19 secs]

 (concurrent mode failure): 38082444K->18127660K(39321600K), 138.3709254 secs] 
42270144K->18127660K(50244288K), [Metaspace: 46676K->46676K(51200K)], 
138.3746435 secs] [Times: user=12.75 sys=22.89, real=138.38 secs]

Heap after GC invocations=8910 (full 1002):

par new generation   total 10922688K, used 0K [0x8000, 
0x0003a000, 0x0003a000)

  eden space 8738176K,   0% used [0x8000, 0x8000, 
0x00029556)

  from space 2184512K,   0% used [0x00029556, 0x00029556, 
0x00031aab)

  to   space 2184512K,   0% used [0x00031aab, 0x00031aab, 
0x0003a000)

concurrent mark-sweep generation total 39321600K, used 18127660K 
[0x0003a000, 0x000d, 0x000d)

Metaspace   used 46676K, capacity 47372K, committed 50136K, reserved 51200K

}

2018-01-04T12:15:58.772+: 140639.993: Total time for which application 
threads were stopped: 138.4138211 seconds, Stopping threads took: 0.0369886 
seconds







Regards,



Maulin





-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: 31 January 2018 22:47
To: solr-user 
Subject: Re: Long GC Pauses



Just to double check, when you san you're seeing 60-200 sec  GC pauses are you 
looking at the GC logs (or using some kind of monitor) or is that the time it 
takes the query to respond to the client? Because a single GC pause that long 
on 40G is unusual no matter what. Another take on Jason's question is For all 
the JVMs you're running, how much _total_ heap is allocated?

And how much physical memory is on the box? I generally start with _at least_ 
half the memory left to the OS



These are fairly horrible, what generates such queries?

AND * AND *



Best,

Erick







On Wed, Jan 31, 2018 at 7:28 AM, Jason Gerlowski 
> wrote:

> Hi Maulin,

>

> To clarify, when you said "...allocated 40 GB RAM to each shard."

> above, I'm going to assume you meant "to each node" instead.  If you

> actually did mean "to each shard" above, please correct me and anyone

> who chimes in afterward.

>

> Firstly, it's really hard to even take guesses about potential causes

> or remediations without more details about your load characteristics

> (average/peak QPS, index size, average document size, etc.).  If no

> one gives any satisfactory advice, please consider uploading

> additional details to help us help you.

>

> Secondly, I don't know anything about the load characteristics you're

> putting on your Solr cluster, but I'm curious whether you've

> experimented with lower RAM settings.  Generally speaking, the more

> RAM you have, the longer your GC pauses are likely to be (even with

> the tuning that various GC settings provide).  If you can get away

> with giving the Solr process less RAM, you should see your GC pauses

> shrink.  Was 40GB chosen after some trial-and-error experimentation, or is it 
> something you could investigate?

>

> For a bit more overview on this, see this slightly outdated (but still

> useful) wiki page:

> 

Multiple consecutive wildcards (**) causes Out-of-memory

2018-02-07 Thread Bjarke Buur Mortensen
Hello list,

Whenever I make a query for ** (two consecutive wildcards) it causes my
Solr to run out of memory.

http://localhost:8983/solr/select?q=**

Why is that?

I realize that this is not a reasonable query to make, but the system
supports input from users, and they might by accident input this query,
causing Solr to crash.

I should add that we use the complexphrase query parser as the default
parser on a Solr 7.1.

Can anyone repro this or explain what causes the problem?

Thanks in advance,
Bjarke Buur Mortensen
Senior Software Engineer
Eluence A/S


Re: Bi Gram token generation with fuzzy searches

2018-02-07 Thread Emir Arnautović
Hi Sravan,
Edismax has ’sow’ parameter that results in edismax to pass query to field 
analysis, but not sure how it will work with fuzzy search. What you might do is 
use _query synthax to separate shingle and non shingle queries, e.g.
q=_query({!edismax sow=false qf=title_bigrams}$v) OR _query({!edismax 
qf=title}$v)&$v=some movie title

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 7 Feb 2018, at 10:55, Sravan Kumar  wrote:
> 
> We have the following two fields for our movie title search
> - title without symbols
> a custom analyser with WordDelimiterFilterFactory, SynonymFilterFactory and
> other filters to retain only alpha numeric characters.
> - title with word bi grams
> a custom analyser with solr.ShingleFilterFactory to generate "bi gram" word
> tokens with '_' as separator.
> 
> A custom similarity class is used to make tf & idf values as 1.
> 
> Edismax query parser is used to perform all searches. Phrase boosting (pf)
> is also used.
> 
> There are couple of issues while searching:
> 1>  BiGram field doesn't generate bi grams if the white spaces in the query
> are not escaped.
> - For example, if the query is "pursuit of happyness", then bi grams are
> not generated.  This is due to the fact that the edismax query parser
> tokenizes based on whitespaces before passing the string to
> analyser(correct me if I am wrong).
> But in case of "pursuit\ of\ happyness", they are as the string which is
> passed to the analyser is with the whitespace.
> 
> 2>  Fuzzy search doesn't work in  whitespace escaped queries.
> Ex: "pursuit~2\ of\ happiness~1"
> 
> 3> Edismax's Phrase boosting doesn't work the way it should in
> non-whitespace escaped fuzzy queries.
> 
> If the query is "pursuit~2 of happiness~1" (without escaping whitespaces)
> 
> fuzzy queries are generated
> (title_name:pursuit~2), (title_name:happiness~1) in the parsed query.
> But,edismax pf (phrase boost) generates query like
> title_name:"pursuit (2 pursuit2) of happiness (1 happiness1)"
> This means the analyser got the original query consisting the fuzzy
> operator for phrase boosting.
> 
> 
> 1> How whitespaces should be handled in case of filters like
> solr.ShingleFilterFactory to generate bi grams?
> 2> If generating bi grams requires whitespaces escaped and fuzzy searches
> not, how do we accomodate both these in a single solr request and scored
> together.
> 
> 
> 
> -
> -- 
> Regards,
> Sravan



Bi Gram token generation with fuzzy searches

2018-02-07 Thread Sravan Kumar
We have the following two fields for our movie title search
- title without symbols
a custom analyser with WordDelimiterFilterFactory, SynonymFilterFactory and
other filters to retain only alpha numeric characters.
- title with word bi grams
a custom analyser with solr.ShingleFilterFactory to generate "bi gram" word
tokens with '_' as separator.

A custom similarity class is used to make tf & idf values as 1.

Edismax query parser is used to perform all searches. Phrase boosting (pf)
is also used.

There are couple of issues while searching:
1>  BiGram field doesn't generate bi grams if the white spaces in the query
are not escaped.
- For example, if the query is "pursuit of happyness", then bi grams are
not generated.  This is due to the fact that the edismax query parser
tokenizes based on whitespaces before passing the string to
analyser(correct me if I am wrong).
But in case of "pursuit\ of\ happyness", they are as the string which is
passed to the analyser is with the whitespace.

2>  Fuzzy search doesn't work in  whitespace escaped queries.
Ex: "pursuit~2\ of\ happiness~1"

3> Edismax's Phrase boosting doesn't work the way it should in
non-whitespace escaped fuzzy queries.

If the query is "pursuit~2 of happiness~1" (without escaping whitespaces)

fuzzy queries are generated
(title_name:pursuit~2), (title_name:happiness~1) in the parsed query.
But,edismax pf (phrase boost) generates query like
title_name:"pursuit (2 pursuit2) of happiness (1 happiness1)"
This means the analyser got the original query consisting the fuzzy
operator for phrase boosting.


1> How whitespaces should be handled in case of filters like
solr.ShingleFilterFactory to generate bi grams?
2> If generating bi grams requires whitespaces escaped and fuzzy searches
not, how do we accomodate both these in a single solr request and scored
together.



-
-- 
Regards,
Sravan


Re: 9000+ CLOSE_WAIT connections in solr v6.2.2 causing it to "die"

2018-02-07 Thread mmb1234
> Maybe this is the issue:
https://github.com/eclipse/jetty.project/issues/2169

Looks like it is the issue. (I've readacted IP addresses below for security
reasons)

solr [ /opt/solr ]$ netstat -ptan | awk '{print $6 " " $7 }' | sort | uniq
-c
   8425 CLOSE_WAIT -
 92 ESTABLISHED -
  1 FIN_WAIT2 -
  1 Foreign Address
  6 LISTEN -
712 TIME_WAIT -
  1 established)

solr [ /opt/solr ]$ echo "run -b
org.eclipse.jetty.server:context=HTTP/1.1@63e2203c,id=0,type=serverconnector
dump " | java
 -jar jmxterm-1.0.0-uber.jar -l localhost:18983 -v silent -n > jettyJmx.out

solr [ /opt/solr ]$ netstat -anp | grep CLOSE_WAIT | head -n1
tcp1  0 10.xxx.x.xx:898310.xxx.x.xxx:53873 
CLOSE_WAIT  1/java

solr [ /opt/solr ]$ grep "10.xxx.x.xxx:53873" jettyJmx.out
 |   |   +-
SelectionKey@5ee4ef9f{i=0}->SelectChannelEndPoint@69feb476{/10.xxx.x.xxx:53873<->8983,Open,in,out,-,-,1712/5000,HttpConnection@c93d7fa}{io=0/0,kio=0,kro=1}

solr [ /opt/solr ]$ cat jettyJmx.out | grep 8983 | grep
SelectChannelEndPoint | grep "{io=0/0,kio=0,kro=1}" | wc -l
8220





--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html