Re: optimize boosting parameters

2020-12-07 Thread Radu Gheorghe
Hi Derek,

Ah, then my reply was completely off :)

I don’t really see a better way. Maybe other than changing termfreq to field, 
if the numeric field has docValues? That may be faster, but I don’t know for 
sure.

Best regards,
Radu
--
Sematext Cloud - Full Stack Observability - https://sematext.com
Solr and Elasticsearch Consulting, Training and Production Support

> On 8 Dec 2020, at 06:17, Derek Poh  wrote:
> 
> Hi Radu
> 
> Apologies for not making myself clear.
> 
> I would like to know if there is a more simple or efficient way to craft the 
> boosting parameters based on the requirements.
> 
> For example, I am using 'if', 'map' and 'termfreq' functions in the bf 
> parameters.
> 
> Is there a more efficient or simple function that can be use instead? Or 
> craft the 'formula' it in a more efficient way?
> 
> On 7/12/2020 10:05 pm, Radu Gheorghe wrote:
>> Hi Derek,
>> 
>> It’s hard to tell whether your boosts can be made better without knowing 
>> your data and what users expect of it. Which is a problem in itself.
>> 
>> I would suggest gathering judgements, like if a user queries for X, what doc 
>> IDs do you expect to get back?
>> 
>> Once you have enough of these judgements, you can experiment with boosts and 
>> see how the query results change. There are measures such as nDCG (
>> https://en.wikipedia.org/wiki/Discounted_cumulative_gain#Normalized_DCG
>> ) that can help you measure that per query, and you can average this score 
>> across all your judgements to get an overall measure of how well you’re 
>> doing.
>> 
>> Or even better, you can have something like Quaerite play with boost values 
>> for you:
>> 
>> https://github.com/tballison/quaerite/blob/main/quaerite-examples/README.md#genetic-algorithms-ga-runga
>> 
>> 
>> Best regards,
>> Radu
>> --
>> Sematext Cloud - Full Stack Observability - 
>> https://sematext.com
>> 
>> Solr and Elasticsearch Consulting, Training and Production Support
>> 
>> 
>>> On 7 Dec 2020, at 10:51, Derek Poh 
>>>  wrote:
>>> 
>>> Hi
>>> 
>>> I have added the following boosting requirements to the search query of a 
>>> page. Feedback from monitoring team is that the overall response of the 
>>> page has increased since then.
>>> I am trying to find out if the added boosting parameters (below) could have 
>>> contributed to the increased.
>>> 
>>> The boosting is working as per requirements.
>>> 
>>> May I know if the implemented boosting parameters can be enhanced or 
>>> optimized further?
>>> Hopefully to improve on the response time of the query and the page.
>>> 
>>> Requirements:
>>> 1. If P_SupplierResponseRate is:
>>>a. 3, boost by 0.4
>>>b. 2, boost by 0.2
>>> 
>>> 2. If P_SupplierResponseTime is:
>>>a. 4, boost by 0.4
>>>b. 3, boost by 0.2
>>> 
>>> 3. If P_MWSScore is:
>>>a. between 80-100, boost by 1.6
>>>b. between 60-79, boost by 0.8
>>> 
>>> 4. If P_SupplierRanking is:
>>>a. 3, boost by 0.3
>>>b. 4, boost by 0.6
>>>c. 5, boost by 0.9
>>>b. 6, boost by 1.2
>>> 
>>> Boosting parameters implemented:
>>> bf=map(P_SupplierResponseRate,3,3,0.4,0)
>>> bf=map(P_SupplierResponseRate,2,2,0.2,0)
>>> 
>>> bf=map(P_SupplierResponseTime,4,4,0.4,0)
>>> bf=map(P_SupplierResponseTime,3,3,0.2,0)
>>> 
>>> bf=map(P_MWSScore,80,100,1.6,0)
>>> bf=map(P_MWSScore,60,79,0.8,0)
>>> 
>>> bf=if(termfreq(P_SupplierRanking,3),0.3,if(termfreq(P_SupplierRanking,4),0.6,if(termfreq(P_SupplierRanking,5),0.9,if(termfreq(P_SupplierRanking,6),1.2,0
>>> 
>>> 
>>> I am using Solr 7.7.2
>>> 
>>> --
>>> CONFIDENTIALITY NOTICE 
>>> This e-mail (including any attachments) may contain confidential and/or 
>>> privileged information. If you are not the intended recipient or have 
>>> received this e-mail in error, please inform the sender immediately and 
>>> delete this e-mail (including any attachments) from your computer, and you 
>>> must not use, disclose to anyone else or copy this e-mail (including any 
>>> attachments), whether in whole or in part. 
>>> This e-mail and any reply to it may be monitored for security, legal, 
>>> regulatory compliance and/or other appropriate reasons.
>>> 
>>> 
>> 
> 
> 
> 
> 
> 
> -- 
> CONFIDENTIALITY NOTICE 
> 
> This e-mail (including any attachments) may contain confidential and/or 
> privileged information. If you are not the intended recipient or have 
> received this e-mail in error, please inform the sender immediately and 
> delete this e-mail (including any attachments) from your computer, and you 
> must not use, disclose to anyone else or copy this e-mail (including any 
> attachments), whether in whole or in part. 
> 
> This e-mail and any reply to it may be monitored for security, legal, 
> regulatory compliance and/or other appropriate reasons.
> 
> 



Re: doc for REQUESTSTATUS

2020-12-07 Thread Radu Gheorghe
Hi Elisabeth,

This is the doc for REQUESTSTATUS, apparently only request ID is supported 
indeed: 
https://lucene.apache.org/solr/guide/8_6/coreadmin-api.html#coreadmin-requeststatus

Best regards,
Radu
--
Sematext Cloud - Full Stack Observability - https://sematext.com
Solr and Elasticsearch Consulting, Training and Production Support

> On 7 Dec 2020, at 12:07, elisabeth benoit  wrote:
> 
> Hello all,
> 
> I'm unloading a core with async param then sending query with request id
> 
> http://localhost:8983/solr/admin/cores?action=UNLOAD&core=expressions&async=1001
> http://localhost:8983/solr/admin/cores?action=REQUESTSTATUS&requestid=1001
> 
> 
> and would like to find a piece of doc with all possible values of
> REQUESTSTATUS. Could someone give me a pointer to the doc, I just cant find
> it using a search engine.
> 
> Best regards,
> Elisabeth



Re: optimize boosting parameters

2020-12-07 Thread Radu Gheorghe
Hi Derek,

It’s hard to tell whether your boosts can be made better without knowing your 
data and what users expect of it. Which is a problem in itself.

I would suggest gathering judgements, like if a user queries for X, what doc 
IDs do you expect to get back?

Once you have enough of these judgements, you can experiment with boosts and 
see how the query results change. There are measures such as nDCG 
(https://en.wikipedia.org/wiki/Discounted_cumulative_gain#Normalized_DCG) that 
can help you measure that per query, and you can average this score across all 
your judgements to get an overall measure of how well you’re doing.

Or even better, you can have something like Quaerite play with boost values for 
you:
https://github.com/tballison/quaerite/blob/main/quaerite-examples/README.md#genetic-algorithms-ga-runga

Best regards,
Radu
--
Sematext Cloud - Full Stack Observability - https://sematext.com
Solr and Elasticsearch Consulting, Training and Production Support

> On 7 Dec 2020, at 10:51, Derek Poh  wrote:
> 
> Hi
> 
> I have added the following boosting requirements to the search query of a 
> page. Feedback from monitoring team is that the overall response of the page 
> has increased since then.
> I am trying to find out if the added boosting parameters (below) could have 
> contributed to the increased.
> 
> The boosting is working as per requirements.
> 
> May I know if the implemented boosting parameters can be enhanced or 
> optimized further?
> Hopefully to improve on the response time of the query and the page.
> 
> Requirements:
> 1. If P_SupplierResponseRate is:
>a. 3, boost by 0.4
>b. 2, boost by 0.2
> 
> 2. If P_SupplierResponseTime is:
>a. 4, boost by 0.4
>b. 3, boost by 0.2
> 
> 3. If P_MWSScore is:
>a. between 80-100, boost by 1.6
>b. between 60-79, boost by 0.8
> 
> 4. If P_SupplierRanking is:
>a. 3, boost by 0.3
>b. 4, boost by 0.6
>c. 5, boost by 0.9
>b. 6, boost by 1.2
> 
> Boosting parameters implemented:
> bf=map(P_SupplierResponseRate,3,3,0.4,0)
> bf=map(P_SupplierResponseRate,2,2,0.2,0)
> 
> bf=map(P_SupplierResponseTime,4,4,0.4,0)
> bf=map(P_SupplierResponseTime,3,3,0.2,0)
> 
> bf=map(P_MWSScore,80,100,1.6,0)
> bf=map(P_MWSScore,60,79,0.8,0)
> 
> bf=if(termfreq(P_SupplierRanking,3),0.3,if(termfreq(P_SupplierRanking,4),0.6,if(termfreq(P_SupplierRanking,5),0.9,if(termfreq(P_SupplierRanking,6),1.2,0
> 
> 
> I am using Solr 7.7.2
> 
> --
> CONFIDENTIALITY NOTICE 
> This e-mail (including any attachments) may contain confidential and/or 
> privileged information. If you are not the intended recipient or have 
> received this e-mail in error, please inform the sender immediately and 
> delete this e-mail (including any attachments) from your computer, and you 
> must not use, disclose to anyone else or copy this e-mail (including any 
> attachments), whether in whole or in part. 
> This e-mail and any reply to it may be monitored for security, legal, 
> regulatory compliance and/or other appropriate reasons.
> 



Re: Proximity Search with phrases

2020-12-03 Thread Radu Gheorghe
Hi Mark,

I don’t really get your use-case. Maybe you can provide another example?

In either case, maybe the surround query parser would help? 
https://lucene.apache.org/solr/guide/8_4/other-parsers.html#surround-query-parser

Or span queries in general via the XML query parser? 
https://lucene.apache.org/solr/guide/8_4/other-parsers.html#xml-query-parser

Best regards,
Radu
--
Sematext Cloud - Full Stack Observability - https://sematext.com
Solr and Elasticsearch Consulting, Training and Production Support

> On 27 Nov 2020, at 14:25, Mark R  wrote:
> 
> Use Case: Is it possible to perform a proximity search using phrases for 
> example: "phrase 1" with 10 words of "phrase 2"
> 
> SOLR Version: 8.4.1
> 
> Query using: "(\"word1 word2\"(\"word3 word4\")"~10
> 
> While this returns results seems to be evaluating the words with each other.
> 
> Are stop words removed when querying, I assume yes. ?
> 
> Thanks in advance
> 
> Mark
> 
> 



Re: Shard Lock

2020-12-03 Thread Radu Gheorghe
Wild shot here: two Solr instances started on the same data directory?

Best regards,
Radu
--
Sematext Cloud - Full Stack Observability - https://sematext.com
Solr and Elasticsearch Consulting, Training and Production Support

> On 1 Dec 2020, at 06:25, sambasivarao giddaluri 
>  wrote:
> 
> when checked in to opt/solr/volumes/data/cores/ both 
> k04o95kz_shard2_replica_n10 and k04o95kz_shard3_replica_n16 replicate are not 
> present no idea how they got deleted.
> 
> On Mon, Nov 30, 2020 at 4:13 PM sambasivarao giddaluri 
>  wrote:
> Hi All,
> We are getting below exception from Solr where 3 zk with 3 solr nodes and 3 
> replicas. It was working fine and we got this exception unexpectedly.
> 
>   • k04o95kz_shard2_replica_n10: 
> org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: 
> Index dir 
> '/opt/solr/volumes/data/cores/k04o95kz_shard2_replica_n10/data/index.20201126040543992'
>  of core 'k04o95kz_shard2_replica_n10' is already locked. The most likely 
> cause is another Solr server (or another solr core in this server) also 
> configured to use this directory; other possible causes may be specific to 
> lockType: native
>   • k04o95kz_shard3_replica_n16: 
> org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: 
> Index dir 
> '/opt/solr/volumes/data/cores/k04o95kz_shard3_replica_n16/data/index.20201126040544142'
>  of core 'k04o95kz_shard3_replica_n16' is already locked. The most likely 
> cause is another Solr server (or another solr core in this server) also 
> configured to use this directory; other possible causes may be specific to 
> lockType: native
> 
> 
> 
> 
> 
> 
> Any advice
> 
> Thanks 
> sam



Re: Facet to part of search results

2020-12-03 Thread Radu Gheorghe


> On 3 Dec 2020, at 20:18, Shawn Heisey  wrote:
> 
> On 12/3/2020 9:55 AM, Jae Joo wrote:
>> Is there any way to apply facet to the partial search result?
>> For ex, we have 10m return by "dog" and like to apply facet to first 10K.
>> Possible?
> 
> The point of facets is to provide accurate numbers.
> 
> What would it mean to only apply to the first 10K?  If there are 10 million 
> documents in the query results that contain "dog" then the facet should say 
> 10 million, not 10K.  I do not understand what you're trying to do.
> 

Maybe sampling? I’m not aware of a built-in way to do that. But you could index 
a random float between, say 0 and 100 and then filter out a sample by filtering 
for number

Re: facet.method=smart

2020-12-03 Thread Radu Gheorghe
Hi Jae,

No, it’s not smarter than explicitly defining, for example enum for a 
low-cardinality field.

Think of “smart” as a default path, and explicit definitions as some “hints”. 
You can see that default path in this function: 
https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/search/facet/FacetField.java#L74

Note that I’ve added a PR with a bit more explanations for the “hits” here: 
https://github.com/apache/lucene-solr/pull/2057 But if you’re missing some 
info, please feel free to comment (here or there), I could add some more info.

Best regards,
Radu
--
Sematext Cloud - Full Stack Observability - https://sematext.com
Solr and Elasticsearch Consulting, Training and Production Support

> On 30 Nov 2020, at 22:46, Jae Joo  wrote:
> 
> Is "smart" really smarter than one explicitly defined?
> 
> For "emun" type, would it be faster to define facet.method=enum than smart?
> 
> Jae



What do you usually look for in Solr logs?

2020-11-26 Thread Radu Gheorghe
Hello Solr users,

I've recently added a Solr logs integration
 to our logging SaaS
 and I wanted to ask what would be useful
that I may have missed.

First, there are some regexes to parse Solr Logs here:
https://github.com/sematext/logagent-js/blob/master/patterns.yml#L140-L379
BTW, Logagent is open-source, so you can use it not only with our SaaS, but
with your own Elasticsearch (at least until there's an output plugin for
Solr :D) or you can simply reuse the regexes.

I've tested parsing on a number of Solr versions and they should work on
7.x and later. But what about the extracted fields? Do you see something
interesting that's not already there?

On the frontend side, what do you find useful? Because we have some default
dashboards and we can always add/update them. Let me quickly tell you
what's already included.

The first one you'd see, called Overview, is quite self-explanatory:


The one I use most is called Queries. All widgets have a filter by (+path:*
-path:admin -path:update) and most of them are built around QTime and Hits
metrics:


I won't spam you with more screenshots, but there are other dashboards, too:
- *Errors*: quite similar to Overview, but with a (severity:error) filter
- *Zookeeper*: I wanted to catch ZK communication here, so the filter is
(thread:zk* OR class:zk* OR class.raw:o.a.z.*) and the "usual" breakdown
widgets e.g. by host or severity
- *Overseer*: similar to Zookeeper, but filtering on (thread:overseer* OR
class:ShardLeaderElection* OR class:SyncStrategy)
- *Audit*: this is for Solr Audit Logging
, filtering on
(class.raw:"o.a.s.s.SolrLogAuditLoggerPlugin") Besides the usual
breakdowns, there are audit-specific widgets, like request types over time
- *Start & stop*: I'm trying to catch startup- and shutdown-specific logs
here. The best filters I found are (thread:main) for startup and
(thread:ShutdownMonitor OR thread:closeThreadPool* OR
thread:coreCloseExecutor*) for shutdown. Do you see better criteria?

Last but not least, what do you usually look for in Solr logs? Anything
that we don't cover in the above? Any feedback will be very much
appreciated!

Thanks and best regards,
Radu
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/


Re: Is metrics api enabled by default in solr 8.2

2020-10-14 Thread Radu Gheorghe
Hi,

Yes, the API works by default on 8.2: 
https://lucene.apache.org/solr/guide/8_2/metrics-reporting.html

I don’t know of a way to disable it, but he configuration is described in the 
page above (i.e. on how to configure different reporters).

Best regards,
Radu
--
Sematext Cloud - Full Stack Observability - https://sematext.com
Solr and Elasticsearch Consulting, Training and Production Support

> On 14 Oct 2020, at 06:05, yaswanth kumar  wrote:
> 
> Can I get some info on where to disable or enable metrics api on solr 8.2 ?
> 
> I believe its enabled by default on solr 8.2 , where can I check the
> configurations? and also how can I disable if I want to disable it
> 
> -- 
> Thanks & Regards,
> Yaswanth Kumar Konathala.
> yaswanth...@gmail.com



Re: how to config split authentication methods -- BasicAuth for WebUI, & none (or SSL client) for client connections?

2020-10-14 Thread Radu Gheorghe
Hello,

If you enable authentication, this will work on your HTTP port. Solr won’t make 
a difference on whether the request comes from the Web UI or Dovecot.

I guess the workaround could be to put the web UI behind a proxy like NGINX and 
have authentication there?

But if anyone can have direct HTTP access to Solr, then it’s not really secure.

Best regards,
Radu
--
Sematext Cloud - Full Stack Observability - https://sematext.com
Solr and Elasticsearch Consulting, Training and Production Support

> On 12 Oct 2020, at 05:11, PGNet Dev  wrote:
> 
>  I'm running,
> 
>   solr -version
>   8.6.3
> 
> on
> 
>   uname -rm
>   5.8.13-200.fc32.x86_64 x86_64
> 
>   grep _NAME /etc/os-release
>   PRETTY_NAME="Fedora 32 (Server Edition)"
>   CPE_NAME="cpe:/o:fedoraproject:fedora:32"
> 
> with
> 
>   java -version
>   openjdk version "15" 2020-09-15
>   OpenJDK Runtime Environment 20.9 (build 15+36)
>   OpenJDK 64-Bit Server VM 20.9 (build 15+36, mixed mode, sharing)
> 
> solr's configured for SSL usage.  both client search connections and WebUI 
> access work OK, with EC certs in use
> 
>   SOLR_SSL_KEY_STORE="/srv/ssl/solr.server.EC.pfx"
>   SOLR_SSL_TRUST_STORE="/srv/ssl/solr.server.EC.pfx"
> 
> If I enable BasicAuth, adding
> 
>   /security.json
>   {
>   "authentication":{
>   "blockUnknown": true,
>   "class":"solr.BasicAuthPlugin",
>   "credentials":{
>   "myuser":"jO... Fe..."
> 
>   },
>   "realm":"Solr REALM",
>   "forwardCredentials": false
>   },
>   "authorization":{
>   "class":"solr.RuleBasedAuthorizationPlugin",
>   "permissions":[{
>   "name":"security-edit",
>   "role":"admin"
>   }],
>   "user-role":{
>   "solr":"admin"
>   }
>   }
>   }
> 
> as expected, WebUI requires/accepts valid credentials for access.
> 
> BUT ... client connections, e.g. from a mail MUA using dovecot's fts solr 
> plugin, immediately fail, returning "401 Unauthorized".
> 
> How can solr authentication be configured to split method -- using BasicAuth 
> for WebUI access ONLY, and still allowing the client connections?
> 
> Eventually, I want those client connections to require solr-side SSL client 
> auth.
> Atm, I'd just like to get it working -- _with_ the BasicAuth WebUI protection 
> in place.
> 



Re: Solr Document Update issues

2020-10-14 Thread Radu Gheorghe
Hi,

I wouldn’t commit on every update. The general practice is to use autoCommit 
and autoSoftCommit, so this work is done in background depending on how quickly 
you want data persisted and available for search: 
https://lucene.apache.org/solr/guide/6_6/updatehandlers-in-solrconfig.html#UpdateHandlersinSolrConfig-Commits

Best regards,
Radu
--
Sematext Cloud - Full Stack Observability - https://sematext.com
Solr and Elasticsearch Consulting, Training and Production Support

> On 13 Oct 2020, at 07:18, aparana bhatt  wrote:
> 
> Hi ,
> 
> I have been facing lot of issues in using solr update functionality .
> Multitude of requests respond with
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> * org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
> Error from server at http://192.169.33.86/solr/cms
> : Expected mime type
> application/octet-stream but got text/html.  "-//IETF//DTD HTML 2.0//EN">502 Proxy
> ErrorProxy ErrorThe proxy server received
> an invalid^Mresponse from an upstream server.^MThe proxy server could
> not handle the request  href="/solr/cms/update">POST /solr/cms/update.Reason:
> Error reading from remote server*
> 
> Used solr version -> 6.5.0  Type -> master/Slave config
> Error in solr.log ->
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> *2020-10-07 05:43:50.639 WARN  (qtp142261320-27831) [   x:cms]
> o.a.s.c.SolrCore slow: [cms]  webapp=/solr path=/update
> params={waitSearcher=true&commit=true&softCommit=false&wt=javabin&version=2}
> status=0 QTime=443272020-10-07 05:43:50.640 WARN  (qtp142261320-27837) [
> x:cms] o.a.s.u.DefaultSolrCoreState WARNING - Dangerous
> interruptjava.lang.InterruptedExceptionat
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1326)
>  at
> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.tryLock(ReentrantReadWriteLock.java:871)
>  at
> org.apache.solr.update.DefaultSolrCoreState.lock(DefaultSolrCoreState.java:167)
>  at
> org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:112)
>  at
> org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:618)
>  at
> org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcessorFactory.java:93)
>  at
> org.apache.solr.update.processor.UpdateRequestProcessor.processCommit(UpdateRequestProcessor.java:68)
>  at
> org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalCommit(DistributedUpdateProcessor.java:1895)
>  at
> org.apache.solr.update.processor.DistributedUpdateProcessor.processCommit(DistributedUpdateProcessor.java:1872)
>  at
> org.apache.solr.handler.RequestHandlerUtils.handleCommit(RequestHandlerUtils.java:68)
>  at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:72)
>  at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173)
>  at org.apache.solr.core.SolrCore.execute(SolrCore.java:2440)
> at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723)
>  at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529)
>  at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:347)
>  at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:298)
>  at
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691)
>  at
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
>  at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
>  at
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
>  at
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
>  at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
>  at
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)*
> 
> 
> 
> The rate of update query on master solr is 6 request per min only .
> Solr also slows down and search becomes really slow .
> I don't understand where to look for an issue .
> I have tried to check various parameters in update request , if I do
> softcommit=true and commit =false then updates do not reflect , so i have
> set below options ->
> 
> UpdateRequest updateRequest = new UpdateRequest();
>  updateRequest.setAction( UpdateRequest.ACTION.COMMIT, true,
> true);
> waitsearcher=true ,
> waitflush=true .
> 
> I do not get what is causing the issue . Kindly suggest .
> Also I could not find much help from internet about given issues as well .
> 
> 
> -- 
> Regards
> 
> Aparana Bhatt



Re: Question regarding replica leader

2020-07-19 Thread Radu Gheorghe
Hi Vishal,

I think that’s true, yes. The cluster has a leader (overseer), but this 
particular shard doesn’t seem to have a leader (yet). Logs should give you some 
pointers about why this happens (it may be, for example, that each replica is 
waiting for the other to become a leader, because each missed some updates).

Best regards,
Radu
--
Sematext Cloud - Full Stack Observability - https://sematext.com
Solr and Elasticsearch Consulting, Training and Production Support

> On 20 Jul 2020, at 04:17, Vishal Vaibhav  wrote:
> 
> Hi any pointers on this ?
> 
> On Wed, 15 Jul 2020 at 11:13 AM, Vishal Vaibhav  wrote:
> 
>> Hi Solr folks,
>> 
>> I am using solr cloud 8.4.1 . I am using*
>> `/solr/admin/collections?action=CLUSTERSTATUS`*. Hitting this endpoint I
>> get a list of replicas in which one is active but neither of them is
>> leader. Something like this
>> 
>> "core_node72": {"core": "rules_shard1_replica_n71","base_url": "node3,"
>> node_name": "node3 base url","state": "active","type": "NRT","
>> force_set_state": "false"},"core_node74": {"core":
>> "rules_shard1_replica_n73","base_url": "node1","node_name":
>> "node1_base_url","state": "down","type": "NRT","force_set_state": "false"}
>> }}},"router": {"name": "compositeId"},"maxShardsPerNode": "1","
>> autoAddReplicas": "false","nrtReplicas": "1","tlogReplicas": "0","
>> znodeVersion": 276,"configName": "rules"}},"live_nodes": ["node1","node2",
>> "node3","node4"] And when i see overseer status
>> solr/admin/collections?action=OVERSEERSTATUS I get response like this which
>> shows node 3 as leaderresponseHeader": {"status": 0,"QTime": 66},"leader
>> ": "node 3","overseer_queue_size": 0,"overseer_work_queue_size": 0,"
>> overseer_collection_queue_size": 2,"overseer_operations": ["addreplica",
>> 
>> Does it mean the cluster is having a leader node but there is no leader
>> replica as of now? And why the leader election is not happening?
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 



Re: Log4J Logging to Http

2020-06-17 Thread Radu Gheorghe
Hi Florian,

I don’t know the answer to your specific question, but I would like to suggest 
a different approach. Excuse me in advance, I usually hate suggesting different 
approaches.

The reason why I suggest a different approach is because logging via HTTP can 
be blocking a thread e.g. until a timeout. I wrote a bit more here: 
https://sematext.com/blog/logging-libraries-vs-log-shippers/

In your particular case, I would let Solr log normally (to stdout) and have 
something pick the logs up from the Docker socket. I’m used to Logagent (see 
https://sematext.com/docs/logagent/installation-docker/) which can parse Solr 
logs out of the box (see 
https://github.com/sematext/logagent-js/blob/master/patterns.yml#L140). But 
there are other options, like Fluentd or Logstash.

Best regards,
Radu

> On 17 Jun 2020, at 10:33, Krönert Florian  wrote:
> 
> Hello everyone,
>  
> We want to log our queries to a HTTP endpoint and tried configuring our log4j 
> settings accordingly.
> We are using Solr inside Docker with the official Solr image (version 
> solr:8.3.1).
>  
> As soon as we add a http appender, we receive errors on startup and solr 
> fails to start completely:
>  
> 2020-06-17T07:06:54.976390509Z DEBUG StatusLogger 
> JsonLayout$Builder(propertiesAsList="null", objectMessageAsJsonObject="null", 
> ={}, eventEol="null", compact="null", complete="null", locationInfo="null", 
> properties="true", includeStacktrace="null", stacktraceAsString="null", 
> includeNullDelimiter="null", ={}, charset="null", footerSerializer=null, 
> headerSerializer=null, Configuration(/var/solr/log4j2.xml), footer="null", 
> header="null")
> 2020-06-17T07:06:55.121825039Z 2020-06-17 
> 07:06:55.104:WARN:oejw.WebAppContext:main: Failed startup of context 
> o.e.j.w.WebAppContext@611df6e3{/solr,file:///opt/solr-8.3.1/server/solr-webapp/webapp/,UNAVAILABLE}{/opt/solr-8.3.1/server/solr-webapp/webapp}
> 2020-06-17T07:06:55.121856339Z java.lang.NoClassDefFoundError: Failed to 
> initialize Apache Solr: Could not find necessary SLF4j logging jars. If using 
> Jetty, the SLF4j logging jars need to go in the jetty lib/ext directory. For 
> other containers, the corresponding directory should be used. For more 
> information, see: http://wiki.apache.org/solr/SolrLogging
>  
> It seems that only when using the http appender these jars are needed, 
> without this appender everything works.
> Can you point me in the right direction, where I need to place the needed 
> jars? Seems to be a little special since I only access the /var/solr mount 
> directly, the rest is running in docker.
>  
> Kind Regards,
>  
> Florian Krönert 
> Senior Software Developer
> 
> 
> 
> ORBIS AG | Planckstraße 10 | D-88677 Markdorf
> Phone: +49 7544 50398 21 | Mobile: +49 162 3065972 | E-Mail: 
> florian.kroen...@orbis.de 
> www.orbis.de
> 
> 
> 
> 
> Registered Seat: Saarbrücken
> Commercial Register Court: Amtsgericht Saarbrücken, HRB 12022
> Board of Management: Thomas Gard (Chairman), Michael Jung, Stefan Mailänder, 
> Frank Schmelzer 
> Chairman of the Supervisory Board: Ulrich Holzer
>
>  
> 
>  
> 
> 



Re: How to determine why solr stops running?

2020-06-08 Thread Radu Gheorghe
I assumed it does, based on your description. If you installed it as a service 
(systemd), then systemd can start the service again if it fails. (something 
like Restart=always in your [Service] definition).

But if it doesn’t restart automatically now, I think it’s easier to 
troubleshoot: just check the last logs after it crashed.

Best regards,
Radu

https://sematext.com

> On 8 Jun 2020, at 16:28, Ryan W  wrote:
> 
> "If Solr auto-restarts"
> 
> It doesn't auto-restart.  Is there some auto-restart functionality?  I'm
> not aware of that.
> 
> On Mon, Jun 8, 2020 at 7:10 AM Radu Gheorghe 
> wrote:
> 
>> Hi Ryan,
>> 
>> If Solr auto-restarts, I suppose it's systemd doing that. When it restarts
>> the Solr service, systemd should log this (maybe somethibg like: journalctl
>> --no-pager | grep -i solr).
>> 
>> Then you can go in your Solr logs and check what happened right before that
>> time. Also, check system logs for what happened before Solr was restarted.
>> 
>> Best regards,
>> Radu
>> 
>> https://sematext.com/
>> 
>> joi, 4 iun. 2020, 19:24 Ryan W  a scris:
>> 
>>> Happened again today. Solr stopped running. Apache hasn't stopped in 10
>>> days, so this is not due to a server reboot.
>>> 
>>> Solr is not being run with the oom-killer.  And when I grep for ERROR in
>>> the logs, there is nothing from today.
>>> 
>>> On Mon, May 18, 2020 at 3:15 PM James Greene <
>> ja...@jamesaustingreene.com>
>>> wrote:
>>> 
>>>> I usually do a combination of grepping for ERROR in solr logs and
>>> checking
>>>> journalctl to see if an external program may have killed the process.
>>>> 
>>>> 
>>>> Cheers,
>>>> 
>>>> /
>>>> *   James Austin Greene
>>>> *  www.jamesaustingreene.com
>>>> *  336-lol-nerd
>>>> /
>>>> 
>>>> 
>>>> On Mon, May 18, 2020 at 1:39 PM Erick Erickson <
>> erickerick...@gmail.com>
>>>> wrote:
>>>> 
>>>>> ps aux | grep solr
>>>>> 
>>>>> on a *.nix system will show you all the runtime parameters.
>>>>> 
>>>>>> On May 18, 2020, at 12:46 PM, Ryan W  wrote:
>>>>>> 
>>>>>> Is there a config file containing the start params?  I run solr
>>> like...
>>>>>> 
>>>>>> bin/solr start
>>>>>> 
>>>>>> I have not seen anything in the logs that seems informative. When I
>>>> grep
>>>>> in
>>>>>> the logs directory for 'memory', I see nothing besides a couple
>>> entries
>>>>>> like...
>>>>>> 
>>>>>> 2020-05-14 13:05:56.155 INFO  (main) [   ]
>>>>> o.a.s.h.a.MetricsHistoryHandler
>>>>>> No .system collection, keeping metrics history in memory.
>>>>>> 
>>>>>> I don't know what that entry means, though the date does roughly
>>>> coincide
>>>>>> with the last time solr stopped running.
>>>>>> 
>>>>>> Thank you.
>>>>>> 
>>>>>> 
>>>>>> On Mon, May 18, 2020 at 12:00 PM Erick Erickson <
>>>> erickerick...@gmail.com
>>>>>> 
>>>>>> wrote:
>>>>>> 
>>>>>>> Probably, but check that you are running with the oom-killer,
>> it'll
>>> be
>>>>> in
>>>>>>> your start params.
>>>>>>> 
>>>>>>> But absent that, something external will be the culprit, Solr
>>> doesn't
>>>>> stop
>>>>>>> by itself. Do look at the Solr log once things stop, it should
>> show
>>> if
>>>>>>> someone or something stopped it.
>>>>>>> 
>>>>>>> On Mon, May 18, 2020, 10:43 Ryan W  wrote:
>>>>>>> 
>>>>>>>> I don't see any log file with "oom" in the file name.  Does that
>>> mean
>>>>>>> there
>>>>>>>> hasn't been an out-of-memory issue?  Thanks.
>>>>>>>> 
>>>>>>>> On Thu, May 14, 2020 at 10:05 AM James Greene <
>>>>>>> ja...@jamesaustingreene.com
>>>>>>>>> 
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Check the log for for an OOM crash.  Fatal exceptions will be in
>>> the
>>>>>>> main
>>>>>>>>> solr log and out of memory errors will be in their own -oom log.
>>>>>>>>> 
>>>>>>>>> I've encountered quite a few solr crashes and usually it's when
>>>>>>> there's a
>>>>>>>>> threshold of concurrent users and/or indexing happening.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Thu, May 14, 2020, 9:23 AM Ryan W  wrote:
>>>>>>>>> 
>>>>>>>>>> Hi all,
>>>>>>>>>> 
>>>>>>>>>> I manage a site where solr has stopped running a couple times
>> in
>>>> the
>>>>>>>> past
>>>>>>>>>> week. The server hasn't been rebooted, so that's not the
>> reason.
>>>>>>> What
>>>>>>>>> else
>>>>>>>>>> causes solr to stop running?  How can I investigate why this is
>>>>>>>>> happening?
>>>>>>>>>> 
>>>>>>>>>> Thank you,
>>>>>>>>>> Ryan
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
>> 



Re: Getting to grips with auto-scaling

2020-06-08 Thread Radu Gheorghe
Hi Tom,

To your last two questions, I'd like to vent an alternative design: have
dedicated "hot" and "warm" nodes. That is, 2020+lists will go to the hot
tier, and 2019, 2018,2017+lists go to the warm tier.

Then you can scale the hot tier based on your query load. For the warm
tier, I assume there will be less need for scaling, and if it is, I guess
it's less important for shards of each index to be perfectly balanced (so a
simple "make sure cores are evenly distributed" should be enough).

Granted, this design isn't as flexible as the one you suggested, but it's
simpler. So simple that I've seen it done without autoscaling (just a few
scripts from when you add nodes in each tier).

Best regards,
Radu

https://sematext.com

vin., 5 iun. 2020, 21:59 Tom Evans  a
scris:

> Hi
>
> I'm trying to get a handle on the newer auto-scaling features in Solr.
> We're in the process of upgrading an older SolrCloud cluster from 5.5
> to 8.5, and re-architecture it slightly to improve performance and
> automate operations.
>
> If I boil it down slightly, currently we have two collections, "items"
> and "lists". Both collections have just one shard. We publish new data
> to "items" once each day, and our users search and do analysis on
> them, whilst "lists" contains NRT user-specified collections of ids
> from items, which we join to from "items" in order to allow them to
> restrict their searches/analysis to just docs in their curated lists.
>
> Most of our searches have specific date ranges in them, usually only
> from the last 3 years or so, but sometimes we need to do searches
> across all the data. With the new setup, we want to:
>
> * shard by date (year) to make the hottest data available in smaller shards
> * have more nodes with these shards than we do of the older data.
> * be able to add/remove nodes predictably based upon our clients
> (predictable) query load
> * use TLOG for "items" and NRT for "lists", to avoid unnecessary
> indexing load for "items" and have NRT for "lists".
> * spread cores across two AZ
>
> With that in mind, I came up with a bunch of simplified rules for
> testing, with just 4 shards for "items":
>
> * "lists" collection has one NRT replica on each node
> * "items" collection shard 2020 has one TLOG replica on each node
> * "items" collection shard 2019 has one TLOG replica on 75% of nodes
> * "items" collection shards 2018 and 2017 each have one TLOG replica
> on 50% of nodes
> * all shards have at least 2 replicas if number of nodes > 1
> * no node should have 2 replicas of the same shard
> * number of cores should be balanced across nodes
>
> Eg, with 1 node, I want to see this topology:
> A: items: 2020, 2019, 2018, 2017 + lists
>
> with 2 nodes:
> A: items: 2020, 2019, 2018, 2017 + lists
> B: items: 2020, 2019, 2018, 2017 + lists
>
> and if I add two more nodes:
> A: items: 2020, 2019, 2018 + lists
> B: items: 2020, 2019, 2017 + lists
> C: items: 2020, 2019, 2017 + lists
> D: items: 2020, 2018 + lists
>
> To the questions:
>
> * The type of replica created when nodeAdded is triggered can't be set
> per collection. Either everything gets NRT or everything gets TLOG.
> Even if I specify nrtReplicas=0 when creating a collection, nodeAdded
> will add NRT replicas if configured that way.
> * I'm having difficulty expressing these rules in terms of a policy -
> I can't seem to figure out a way to specify the number of replicas for
> a shard based upon the total number of nodes.
> * Is this beyond the current scope of autoscaling triggers/policies?
> Should I instead use the trigger with a custom plugin action (or to
> trigger a web hook) to be a bit more intelligent?
> * Am I wasting my time trying to ensure there are more replicas of the
> hotter shards than the colder shards? It seems to add a lot of
> complexity - should I just instead think that they aren't getting
> queried much, so won't be using up cache space that the hot shards
> will be using. Disk space is pretty cheap after all (total size for
> "items" + "lists" is under 60GB).
>
> Cheers
>
> Tom
>


Re: How to determine why solr stops running?

2020-06-08 Thread Radu Gheorghe
Hi Ryan,

If Solr auto-restarts, I suppose it's systemd doing that. When it restarts
the Solr service, systemd should log this (maybe somethibg like: journalctl
--no-pager | grep -i solr).

Then you can go in your Solr logs and check what happened right before that
time. Also, check system logs for what happened before Solr was restarted.

Best regards,
Radu

https://sematext.com/

joi, 4 iun. 2020, 19:24 Ryan W  a scris:

> Happened again today. Solr stopped running. Apache hasn't stopped in 10
> days, so this is not due to a server reboot.
>
> Solr is not being run with the oom-killer.  And when I grep for ERROR in
> the logs, there is nothing from today.
>
> On Mon, May 18, 2020 at 3:15 PM James Greene 
> wrote:
>
> > I usually do a combination of grepping for ERROR in solr logs and
> checking
> > journalctl to see if an external program may have killed the process.
> >
> >
> > Cheers,
> >
> > /
> > *   James Austin Greene
> > *  www.jamesaustingreene.com
> > *  336-lol-nerd
> > /
> >
> >
> > On Mon, May 18, 2020 at 1:39 PM Erick Erickson 
> > wrote:
> >
> > > ps aux | grep solr
> > >
> > > on a *.nix system will show you all the runtime parameters.
> > >
> > > > On May 18, 2020, at 12:46 PM, Ryan W  wrote:
> > > >
> > > > Is there a config file containing the start params?  I run solr
> like...
> > > >
> > > > bin/solr start
> > > >
> > > > I have not seen anything in the logs that seems informative. When I
> > grep
> > > in
> > > > the logs directory for 'memory', I see nothing besides a couple
> entries
> > > > like...
> > > >
> > > > 2020-05-14 13:05:56.155 INFO  (main) [   ]
> > > o.a.s.h.a.MetricsHistoryHandler
> > > > No .system collection, keeping metrics history in memory.
> > > >
> > > > I don't know what that entry means, though the date does roughly
> > coincide
> > > > with the last time solr stopped running.
> > > >
> > > > Thank you.
> > > >
> > > >
> > > > On Mon, May 18, 2020 at 12:00 PM Erick Erickson <
> > erickerick...@gmail.com
> > > >
> > > > wrote:
> > > >
> > > >> Probably, but check that you are running with the oom-killer, it'll
> be
> > > in
> > > >> your start params.
> > > >>
> > > >> But absent that, something external will be the culprit, Solr
> doesn't
> > > stop
> > > >> by itself. Do look at the Solr log once things stop, it should show
> if
> > > >> someone or something stopped it.
> > > >>
> > > >> On Mon, May 18, 2020, 10:43 Ryan W  wrote:
> > > >>
> > > >>> I don't see any log file with "oom" in the file name.  Does that
> mean
> > > >> there
> > > >>> hasn't been an out-of-memory issue?  Thanks.
> > > >>>
> > > >>> On Thu, May 14, 2020 at 10:05 AM James Greene <
> > > >> ja...@jamesaustingreene.com
> > > 
> > > >>> wrote:
> > > >>>
> > >  Check the log for for an OOM crash.  Fatal exceptions will be in
> the
> > > >> main
> > >  solr log and out of memory errors will be in their own -oom log.
> > > 
> > >  I've encountered quite a few solr crashes and usually it's when
> > > >> there's a
> > >  threshold of concurrent users and/or indexing happening.
> > > 
> > > 
> > > 
> > >  On Thu, May 14, 2020, 9:23 AM Ryan W  wrote:
> > > 
> > > > Hi all,
> > > >
> > > > I manage a site where solr has stopped running a couple times in
> > the
> > > >>> past
> > > > week. The server hasn't been rebooted, so that's not the reason.
> > > >> What
> > >  else
> > > > causes solr to stop running?  How can I investigate why this is
> > >  happening?
> > > >
> > > > Thank you,
> > > > Ryan
> > > >
> > > 
> > > >>>
> > > >>
> > >
> > >
> >
>


Re: Shingles behavior

2020-05-21 Thread Radu Gheorghe
Turns out, it’s down to setting enableGraphQueries=false in the field 
definition. I completely missed that :(

> On 21 May 2020, at 07:49, Radu Gheorghe  wrote:
> 
> Hi Alex, long time no see :)
> 
> I tried with sow, and that basically invalidates query-time shingles (it only 
> mathes mona OR lisa OR smile).
> 
> I'm using shingles at both index and query time as a substitute for pf2 and 
> pf3: the more shingles I match, the more relevant the document. Also, higher 
> order shingles naturally get lower frequencies, meaning they get a "natural" 
> boost.
> 
> Best regards,
> Radu
> 
> joi, 21 mai 2020, 00:28 Alexandre Rafalovitch  a scris:
> Did you try it with 'sow' parameter both ways? I am not sure I fully
> understand the question, especially with shingling on both passes
> rather than just indexing one. But at least it is something to try and
> is one of the difference areas between Solr and ES.
> 
> Regards,
>Alex.
> 
> On Tue, 19 May 2020 at 05:59, Radu Gheorghe  
> wrote:
> >
> > Hello Solr users,
> >
> > I’m quite puzzled about how shingles work. The way tokens are analysed 
> > looks fine to me, but the query seems too restrictive.
> >
> > Here’s the sample use-case. I have three documents:
> >
> > mona lisa smile
> > mona lisa
> > mona
> >
> > I have a shingle filter set up like this (both index- and query-time):
> >
> > >  > > maxShingleSize=“4”/>
> >
> > When I query for “Mona Lisa smile” (no quotes), I expect to get all three 
> > documents back, in that order. Because the first document matches all the 
> > terms:
> >
> > mona
> > mona lisa
> > mona lisa smile
> > lisa
> > lisa smile
> > smile
> >
> > And the second one matches only some, and the third document only matches 
> > one.
> >
> > Instead, I only get the first document back. That’s because the query 
> > expects all the “words” to match:
> >
> > > "parsedquery":"+DisjunctionMaxQuery+shingle_field:mona 
> > > +usage_query_view_tags:lisa +shingle_field:smile) (+shingle_field:mona 
> > > +shingle_field:lisa smile) (+shingle_field:mona lisa 
> > > +shingle_field:smile) shingle_field:mona lisa smile)))”,
> >
> > The query above is generated by the Edismax query parser, when I’m using 
> > “shingle_field” as “df”.
> >
> > Is there a way to get “any of the words” to match? I’ve tried all the 
> > options I can think of:
> > - different query parsers
> > - q.OP=OR
> > - mm=0 (or 1 or 0% or 10% or…)
> >
> > Nothing seems to change the parsed query from the above.
> >
> > I’ve compared this to the behaviour of Elasticsearch. There, I get “OR” by 
> > default, and minimum_should_match works as expected. The only difference I 
> > see between the two, on the analysis side, is that tokens start at 0 in 
> > Elasticsearch and at 1 in Solr. I doubt that’s the problem, because I see 
> > that the default “text_en”, for example, also starts at position 1.
> >
> > Is it just a bug that mm doesn’t work in the context of shingles? Or is 
> > there a workaround?
> >
> > Thanks and best regards,
> > Radu



Re: Shingles behavior

2020-05-20 Thread Radu Gheorghe
Hi Alex, long time no see :)

I tried with sow, and that basically invalidates query-time shingles (it
only mathes mona OR lisa OR smile).

I'm using shingles at both index and query time as a substitute for pf2 and
pf3: the more shingles I match, the more relevant the document. Also,
higher order shingles naturally get lower frequencies, meaning they get a
"natural" boost.

Best regards,
Radu

joi, 21 mai 2020, 00:28 Alexandre Rafalovitch  a scris:

> Did you try it with 'sow' parameter both ways? I am not sure I fully
> understand the question, especially with shingling on both passes
> rather than just indexing one. But at least it is something to try and
> is one of the difference areas between Solr and ES.
>
> Regards,
>    Alex.
>
> On Tue, 19 May 2020 at 05:59, Radu Gheorghe 
> wrote:
> >
> > Hello Solr users,
> >
> > I’m quite puzzled about how shingles work. The way tokens are analysed
> looks fine to me, but the query seems too restrictive.
> >
> > Here’s the sample use-case. I have three documents:
> >
> > mona lisa smile
> > mona lisa
> > mona
> >
> > I have a shingle filter set up like this (both index- and query-time):
> >
> > >  maxShingleSize=“4”/>
> >
> > When I query for “Mona Lisa smile” (no quotes), I expect to get all
> three documents back, in that order. Because the first document matches all
> the terms:
> >
> > mona
> > mona lisa
> > mona lisa smile
> > lisa
> > lisa smile
> > smile
> >
> > And the second one matches only some, and the third document only
> matches one.
> >
> > Instead, I only get the first document back. That’s because the query
> expects all the “words” to match:
> >
> > > "parsedquery":"+DisjunctionMaxQuery+shingle_field:mona
> +usage_query_view_tags:lisa +shingle_field:smile) (+shingle_field:mona
> +shingle_field:lisa smile) (+shingle_field:mona lisa +shingle_field:smile)
> shingle_field:mona lisa smile)))”,
> >
> > The query above is generated by the Edismax query parser, when I’m using
> “shingle_field” as “df”.
> >
> > Is there a way to get “any of the words” to match? I’ve tried all the
> options I can think of:
> > - different query parsers
> > - q.OP=OR
> > - mm=0 (or 1 or 0% or 10% or…)
> >
> > Nothing seems to change the parsed query from the above.
> >
> > I’ve compared this to the behaviour of Elasticsearch. There, I get “OR”
> by default, and minimum_should_match works as expected. The only difference
> I see between the two, on the analysis side, is that tokens start at 0 in
> Elasticsearch and at 1 in Solr. I doubt that’s the problem, because I see
> that the default “text_en”, for example, also starts at position 1.
> >
> > Is it just a bug that mm doesn’t work in the context of shingles? Or is
> there a workaround?
> >
> > Thanks and best regards,
> > Radu
>


Shingles behavior

2020-05-19 Thread Radu Gheorghe
Hello Solr users,

I’m quite puzzled about how shingles work. The way tokens are analysed looks 
fine to me, but the query seems too restrictive.

Here’s the sample use-case. I have three documents:

mona lisa smile
mona lisa
mona

I have a shingle filter set up like this (both index- and query-time):

>  maxShingleSize=“4”/>

When I query for “Mona Lisa smile” (no quotes), I expect to get all three 
documents back, in that order. Because the first document matches all the terms:

mona
mona lisa
mona lisa smile
lisa
lisa smile
smile

And the second one matches only some, and the third document only matches one.

Instead, I only get the first document back. That’s because the query expects 
all the “words” to match:

> "parsedquery":"+DisjunctionMaxQuery+shingle_field:mona 
> +usage_query_view_tags:lisa +shingle_field:smile) (+shingle_field:mona 
> +shingle_field:lisa smile) (+shingle_field:mona lisa +shingle_field:smile) 
> shingle_field:mona lisa smile)))”,

The query above is generated by the Edismax query parser, when I’m using 
“shingle_field” as “df”.

Is there a way to get “any of the words” to match? I’ve tried all the options I 
can think of:
- different query parsers
- q.OP=OR
- mm=0 (or 1 or 0% or 10% or…)

Nothing seems to change the parsed query from the above.

I’ve compared this to the behaviour of Elasticsearch. There, I get “OR” by 
default, and minimum_should_match works as expected. The only difference I see 
between the two, on the analysis side, is that tokens start at 0 in 
Elasticsearch and at 1 in Solr. I doubt that’s the problem, because I see that 
the default “text_en”, for example, also starts at position 1.

Is it just a bug that mm doesn’t work in the context of shingles? Or is there a 
workaround?

Thanks and best regards,
Radu

Re: Which Solr metrics do you find important?

2020-04-29 Thread Radu Gheorghe
Thanks Matthew and Walter. OK, so you both use the clusterstatus output in
your regular monitoring. This seems to be missing from what we have now (we
collect everything else you mentioned, like response time percentiles, disk
IO, etc). So I guess clusterstatus deserves a priority bump :)

Best regards,
Radu
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/


On Tue, Apr 28, 2020 at 6:47 PM Walter Underwood 
wrote:

> I also have some Python that pull stuff from clusterstatus and sends it to
> InfluxDB.
>
> We wrote a servlet filter that intercepts requests to Solr and sends
> performance data
> to monitoring. That gives us per-request handler traffic and response time
> percentiles.
>
> Telegraf for CPU, run queue, disk IO, etc.
>
> CloudWatch for load balancer traffic, errors, and healthy host count.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Apr 28, 2020, at 8:00 AM, matthew sporleder 
> wrote:
> >
> > I think clusterstatus is how you find some of that stuff.
> >
> > I wrote this when I was using datadog to supplement what they offered:
> > https://github.com/msporleder/dd-solrcloud/blob/master/solrcloud.py
> > (sorry for crappy python) and it got me most of the monitoring I
> > needed for my particular situation.
> >
> >
> >
> >
> > On Tue, Apr 28, 2020 at 10:52 AM Radu Gheorghe
> >  wrote:
> >>
> >> Thanks a lot, Matthew! OK, so you do care about the size of tlogs. As
> well
> >> as Collections API stuff (clusterstatus, overseerstatus).
> >>
> >> And DIH, I didn't think that these stats would be interesting, but
> surely
> >> they are for people who use DIH :)
> >>
> >> Best regards,
> >> Radu
> >> --
> >> Monitoring - Log Management - Alerting - Anomaly Detection
> >> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
> >>
> >>
> >> On Tue, Apr 28, 2020 at 4:17 PM matthew sporleder  >
> >> wrote:
> >>
> >>> size-on-disk of cores, size of tlogs, DIH stats over time, last
> >>> modified date of cores
> >>>
> >>> The most important alert-type things are -- collections in recovery or
> >>> down state, solrcloud election events, various error rates
> >>>
> >>> It's also important to be able to tie these back to aliases so you are
> >>> only monitoring cores you care about, even if their backing collection
> >>> name changes every so often
> >>>
> >>>
> >>>
> >>> On Tue, Apr 28, 2020 at 7:57 AM Radu Gheorghe
> >>>  wrote:
> >>>>
> >>>> Hi fellow Solr users,
> >>>>
> >>>> I'm looking into improving our Solr monitoring
> >>>> <https://sematext.com/docs/integration/solr/> and I was curious on
> which
> >>>> metrics you consider relevant.
> >>>>
> >>>> From what we currently have, I'm only really missing fieldCache.
> Which we
> >>>> collect, but not show in the UI yet (unless you add a custom chart -
> >>> we'll
> >>>> add it to default soon).
> >>>>
> >>>> You can click on a demo account <https://apps.sematext.com/demo>
> >>> (there's a
> >>>> Solr app there called PH.Prod.Solr7) to see what we already collect,
> but
> >>>> I'll write it here in short:
> >>>> - query rate and latency (you can group per handler, per core, per
> >>>> collection if it's SolrCloud)
> >>>> - index size (number of segments, files...)
> >>>> - indexing: added/deleted docs, commits
> >>>> - caches (size, hit ratio, warmup...)
> >>>> - OS- and JVM-level metrics (from CPU iowait to GC latency and
> everything
> >>>> in between)
> >>>>
> >>>> Anything that we should add?
> >>>>
> >>>> I went through the Metrics API output, and the only significant thing
> I
> >>> can
> >>>> think of is the transaction log. But to be honest I never checked
> those
> >>>> metrics in practice.
> >>>>
> >>>> Or maybe there's something outside the Metrics API that would be
> useful?
> >>> I
> >>>> thought about the breakdown of shards that are up/down/recovering...
> as
> >>>> well as replica types. We plan on adding those, but there's a
> challenge
> >>> in
> >>>> de-duplicating metrics. Because one would install one agent per node,
> and
> >>>> I'm not aware of a way to show only local shards in the Collections
> API
> >>> ->
> >>>> CLUSTERSTATUS.
> >>>>
> >>>> Thanks in advance for any feedback that you may have!
> >>>> Radu
> >>>> --
> >>>> Monitoring - Log Management - Alerting - Anomaly Detection
> >>>> Solr & Elasticsearch Consulting Support Training -
> http://sematext.com/
> >>>
>
>


Re: Which Solr metrics do you find important?

2020-04-28 Thread Radu Gheorghe
Thanks a lot, Matthew! OK, so you do care about the size of tlogs. As well
as Collections API stuff (clusterstatus, overseerstatus).

And DIH, I didn't think that these stats would be interesting, but surely
they are for people who use DIH :)

Best regards,
Radu
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/


On Tue, Apr 28, 2020 at 4:17 PM matthew sporleder 
wrote:

> size-on-disk of cores, size of tlogs, DIH stats over time, last
> modified date of cores
>
> The most important alert-type things are -- collections in recovery or
> down state, solrcloud election events, various error rates
>
> It's also important to be able to tie these back to aliases so you are
> only monitoring cores you care about, even if their backing collection
> name changes every so often
>
>
>
> On Tue, Apr 28, 2020 at 7:57 AM Radu Gheorghe
>  wrote:
> >
> > Hi fellow Solr users,
> >
> > I'm looking into improving our Solr monitoring
> > <https://sematext.com/docs/integration/solr/> and I was curious on which
> > metrics you consider relevant.
> >
> > From what we currently have, I'm only really missing fieldCache. Which we
> > collect, but not show in the UI yet (unless you add a custom chart -
> we'll
> > add it to default soon).
> >
> > You can click on a demo account <https://apps.sematext.com/demo>
> (there's a
> > Solr app there called PH.Prod.Solr7) to see what we already collect, but
> > I'll write it here in short:
> > - query rate and latency (you can group per handler, per core, per
> > collection if it's SolrCloud)
> > - index size (number of segments, files...)
> > - indexing: added/deleted docs, commits
> > - caches (size, hit ratio, warmup...)
> > - OS- and JVM-level metrics (from CPU iowait to GC latency and everything
> > in between)
> >
> > Anything that we should add?
> >
> > I went through the Metrics API output, and the only significant thing I
> can
> > think of is the transaction log. But to be honest I never checked those
> > metrics in practice.
> >
> > Or maybe there's something outside the Metrics API that would be useful?
> I
> > thought about the breakdown of shards that are up/down/recovering... as
> > well as replica types. We plan on adding those, but there's a challenge
> in
> > de-duplicating metrics. Because one would install one agent per node, and
> > I'm not aware of a way to show only local shards in the Collections API
> ->
> > CLUSTERSTATUS.
> >
> > Thanks in advance for any feedback that you may have!
> > Radu
> > --
> > Monitoring - Log Management - Alerting - Anomaly Detection
> > Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>


Which Solr metrics do you find important?

2020-04-28 Thread Radu Gheorghe
Hi fellow Solr users,

I'm looking into improving our Solr monitoring
 and I was curious on which
metrics you consider relevant.

>From what we currently have, I'm only really missing fieldCache. Which we
collect, but not show in the UI yet (unless you add a custom chart - we'll
add it to default soon).

You can click on a demo account  (there's a
Solr app there called PH.Prod.Solr7) to see what we already collect, but
I'll write it here in short:
- query rate and latency (you can group per handler, per core, per
collection if it's SolrCloud)
- index size (number of segments, files...)
- indexing: added/deleted docs, commits
- caches (size, hit ratio, warmup...)
- OS- and JVM-level metrics (from CPU iowait to GC latency and everything
in between)

Anything that we should add?

I went through the Metrics API output, and the only significant thing I can
think of is the transaction log. But to be honest I never checked those
metrics in practice.

Or maybe there's something outside the Metrics API that would be useful? I
thought about the breakdown of shards that are up/down/recovering... as
well as replica types. We plan on adding those, but there's a challenge in
de-duplicating metrics. Because one would install one agent per node, and
I'm not aware of a way to show only local shards in the Collections API ->
CLUSTERSTATUS.

Thanks in advance for any feedback that you may have!
Radu
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/


Re: Filtered join in Solr?

2020-02-05 Thread Radu Gheorghe
Hi Edward,

Thanks a lot for your reply!

Subquery is what I had in mind, too, for designs 1) and 3) to bring
back the other side of the relationship. Except that I always queries
movies and subqueried users.

If I do it the other way around, like you did, then I'm able to
filter. I can't quite filter by date (didn't manage to get parameter
substitution AND date parsing to work), but if I index dates are UNIX
timestamps (in a plong field), then it works with something like:

q=family:Smith&fl=*,movies:[subquery]&movies.q={!terms f=id
v=$row.watched_movies}&movies.fq={!frange l=$row.born}release_date

The only trouble is, now I can't facet on movie metadata and I can't
get the total number of movies. Or maybe I can and I don't know how?
Besides going through all results (e.g. via a streaming expression).

Still, for getting top N results, this should work. Thanks again for the idea!

Sharding is doable, in a few ways that I can see:
- if data is "normalized", then the "subquery" side has to be a single
shard replicated to all nodes. This won't work if the subquery looks
at movies, because there are tons of movies
- but I'm thinking I could use the new XCJF query parser and have
everything sharded: https://issues.apache.org/jira/browse/SOLR-13749
- or, if I denormalize like in 3) then I could throw everything in one
collection and route by movie ID. Or keep movies and users in separate
collections and still route by movie ID. But this means more expensive
updates/indexing :(

Best regards,
Radu

On Wed, Feb 5, 2020 at 1:00 AM Edward Ribeiro  wrote:
>
> Just for the sake of an imagined scenario, you could use the [subquery] doc
> transformer. A query like the one below:
>
> /select?q=family: Smith&fq=watched_movies:[* TO *]&fl=*,
> movies:[subquery]&movies.q={!terms f=id v=$row.watched_movies}
>
> Would bring back the results below:
>
> { "responseHeader":{
> "status":0,
> "QTime":0,
> "params":{
>   "movies.q":"{!terms f=id v=$row.watched_movies}",
>   "q":"family: Smith",
>   "fl":"*, movies:[subquery]",
>   "fq":"watched_movies:[* TO *]"}},
>   "response":{"numFound":2,"start":0,"docs":[
>   {
> "id":"user_1",
> "name":["Jane"],
> "family":["Smith"],
> "born":["1990-01-01T00:00:00Z"],
> "watched_movies":["1",
>   "3"],
> "_version_":1657646162820202496,
> "movies":{"numFound":2,"start":0,"docs":[
> {
>   "id":"1",
>   "title":["Rambo 1"],
>   "release_date":["1978-01-01T00:00:00Z"],
>   "_version_":1657646123722997760},
> {
>   "id":"3",
>   "title":["300 Spartaans"],
>   "release_date":["2005-01-01T00:00:00Z"],
>   "_version_":1657646123726143488}]
> }},
>   {
> "id":"user_2",
> "title":["Joe"],
> "family":["Smith"],
> "born":["1970-01-01T00:00:00Z"],
> "watched_movies":["2"],
> "_version_":1657646162827542528,
> "movies":{"numFound":1,"start":0,"docs":[
> {
>   "id":"2",
>   "title":["Rambo 5"],
>   "release_date":["1998-01-01T00:00:00Z"],
>   "_version_":1657646123725094912}]
> }}]
>   }}
>
> But I wasn't able to filter on date (I could filter a specific date
> using movies.fq={!term f=release_date v=2005-01-01T00:00:00Z} but not on
> range) nor could I perform facets in the children of the above example. It
> probably only works on a single node too. Finally, there a couple of
> parameters that can be important but that I ommited for the sake of brevity
> and clarity: movies.limit=100 and movies.sort=release_date DESC
>
>
> Best,
> Edward
>
>
>
> On Tue, Feb 4, 2020 at 11:17 AM Radu Gheorghe 
> wrote:
> >
> > Hello Solr users,
> >
> > How would you design a filtered join scenario?
> >
> > Say I have a bunch of movies (excuse any inaccuracies, this is an
>

Filtered join in Solr?

2020-02-04 Thread Radu Gheorghe
Hello Solr users,

How would you design a filtered join scenario?

Say I have a bunch of movies (excuse any inaccuracies, this is an
imagined scenario):

curl -XPOST -H 'Content-Type: application/json'
'localhost:8983/solr/test/update?commitWithin=1000' --data-binary '
[{
"id": "1",
"title": "Rambo 1",
"release_date": "1978-01-01"
},
{
"id": "2",
"title": "Rambo 5",
"release_date": "1998-01-01"
},
{
"id": "3",
"title": "300 Spartaans",
"release_date": "2005-01-01"
}]'

And a bunch of users of certain families who watched those movies:

curl -XPOST -H 'Content-Type: application/json'
'localhost:8983/solr/test/update?commitWithin=1000' --data-binary '
[{
"id": "user_1",
"name": "Jane",
"family": "Smith",
"born": "1990-01-01",
"watched_movies": ["1", "3"]
},
{
"id": "user_2",
"title": "Joe",
"family": "Smith",
"born": "1970-01-01",
"watched_movies": ["2"]
},
{
"id": "user_3",
"title": "Radu",
"family": "Gheorghe,
"born": "1985-01-01",
"watched_movies": ["1", "2", "3"]
}]'

They don't have to be in the same collection. The important question
is how to get:
- movies watched by user of family Smith
- after they were born
- including the matching users
- I'd like to be able to facet on movie metadata, but I don't need to
facet on user metadata, just to be able to retrieve those fields

The above query should bring back Rambo 5 and 300, with Joe and Jane
respectively. I wouldn't get Rambo 1, because although Jane watched
it, the movie was released before she was born.

Here are some options that I have in mind:
1) using the join query parser (or the newer XCJF) to do the join
itself. Then have some sort of plugin pull the "born" value or each
corresponding user (via some subquery) and filter movies afterwards.
Normalized, but likely painfully slow

2) similar approach with 1), in a streaming expression. Again,
normalized, but slow (we're talking billions of movies, millions of
users). And limited support for facets.

3) have some sort of denormalization. For example, pre-compute
matching users for every movie, then just use join/XCJF to do the
actual join. This makes indexing/updates expensive and potentially
complicated

4) normalization with nested documents. This is best for searches, but
pretty much a no-go for indexing/updates. In this imaginary use-case,
there are binge-watchers who might watch a billion movies in a week,
making us reindex everything

Do you see better ways?

Thanks in advance and best regards,
Radu


solr-diagnostics: utility for collecting info from the Solr installation

2020-01-16 Thread Radu Gheorghe
Hello Solr users :)

We just published a small tool that collects diagnostics information:
configs, logs, metrics API output, etc as well as system info (dmesg,
netstat, top...). I thought others might find it interesting, so
here's a short blog post that describes it:
https://sematext.com/blog/solr-diagnostics/

Oh, and by "just published", I mean about two years ago :) It needs
more love, for example it doesn't work on Windows yet (contributions
welcome!), but we've already used it on N clusters and found it
useful.

Please let me know if you have any questions or feedback. Or even
better, please open an issue or submit a PR :)

Thanks and best regards,
Radu
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/


Re: Partial results from streaming expressions (i.e. making them "stream")

2018-01-17 Thread Radu Gheorghe
Hi Joel, thanks for your follow-up!

Indeed, that's my experience as well - that the export handler streams
data fast enough. Though now that you mention batches, I'm curious if
that batch size is configurable or something like that.

The higher level issue is that I need to show results to the user as
quickly as possible. For example, imagine a rollup on a relatively
high cardinality field, but with lots of documents per user as well. I
want to show counters as soon as they come up, instead of when I have
all of them.

To have results coming in as quickly as possible, I need data to be
streamed quickly (latency-wise) between source and decorators, as well
as from the Solr node receiving the initial request to the client
(UI).

The first part seem to be already happening in my tests (though I've
heard complaints that it doesn't - I'll come back to it if I
misunderstood something), but I can't get partial results to the HTTP
client issuing the original requests.

Does this clarify my issue?

Thanks again and best regards,
Radu
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


On Wed, Jan 17, 2018 at 8:59 PM, Joel Bernstein  wrote:
> I'm not sure I understand the issue fully. From a streaming standpoint, you
> get real streamed data from the /export handler. When you use the export
> handler the bitset for the search results is materialized in memory, but
> all result are sorted/streamed in batches. This allows the exported handler
> to export result sets of any size.
>
> The underlying buffer sizes are really abstracted away and not meant to
> dealt with.
>
> What's the higher level issue you are concerned with?
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Wed, Jan 17, 2018 at 8:54 AM, Radu Gheorghe 
> wrote:
>
>> Hello,
>>
>> I have some updates on this, but it's still not very clear for me how
>> to move forward.
>>
>> The good news is, that between sources and decorators, data seems to
>> be really streamed. I hope I tested this the right way, by simply
>> adding a log message to ReducerStream saying "hey, I got this tuple".
>> Now I have two nodes, nodeA with data and nodeB with a dummy
>> collection. If I hit nodeB's /stream endpoint and ask it for, say, a
>> unique() to wrap the previously mentioned expression (with 1s sleeps),
>> I see a log from ReducerStream every second. Good.
>>
>> Now, the final result (to the client, via curl), only gets to me after
>> N seconds (where N is the number of results I get). I did some more
>> digging on this front, too. Let's assume we have chunked encoding
>> re-enabled (that's a must) and no other change (if I flush() the
>> FastWriter, say, after every tuple, then I get every tuple as it's
>> computed, but I'm trying to explore the buffers). I've noticed the
>> following:
>> - the first response comes after ~64K, then I get chunks of 32K each
>> - at this point, if I set response.setBufferSize() in
>> HttpSolrCall.writeResponse() to a small size (say, 128), I get the
>> first reply after 32K and then 8K chunks
>> - I thought that maybe in this context I could lower BUFSIZE in
>> FastWriter, but that didn't seem to make any change :(
>>
>> That said, I'm not sure it's worth looking into these buffers any
>> deeper, because shrinking them might negatively affect other results
>> (e.g. regular searches or facets). It sounds like the way forward
>> would be that manual flushing, with chunked encoding enabled. I could
>> imagine adding some parameters in the lines of "flush every N tuples
>> or M milliseconds", that would be computed per-request, or at least
>> globally to the /stream handler.
>>
>> What do you think? Would such a patch be welcome, to add these
>> parameters? But it still requires chunked encoding - would reverting
>> SOLR-8669 be a problem? Or maybe there's a more elegant way to enable
>> chunked encoding, maybe only for streams?
>>
>> Best regards,
>> Radu
>> --
>> Performance Monitoring * Log Analytics * Search Analytics
>> Solr & Elasticsearch Support * http://sematext.com/
>>
>>
>> On Mon, Jan 15, 2018 at 10:58 AM, Radu Gheorghe
>>  wrote:
>> > Hello fellow solr-users!
>> >
>> > Currently, if I do an HTTP request to receive some data via streaming
>> > expressions, like:
>> >
>> > curl --data-urlencode 'expr=search(test,
>> >q="foo_s:*",
>> >f

Re: Partial results from streaming expressions (i.e. making them "stream")

2018-01-17 Thread Radu Gheorghe
Hello,

I have some updates on this, but it's still not very clear for me how
to move forward.

The good news is, that between sources and decorators, data seems to
be really streamed. I hope I tested this the right way, by simply
adding a log message to ReducerStream saying "hey, I got this tuple".
Now I have two nodes, nodeA with data and nodeB with a dummy
collection. If I hit nodeB's /stream endpoint and ask it for, say, a
unique() to wrap the previously mentioned expression (with 1s sleeps),
I see a log from ReducerStream every second. Good.

Now, the final result (to the client, via curl), only gets to me after
N seconds (where N is the number of results I get). I did some more
digging on this front, too. Let's assume we have chunked encoding
re-enabled (that's a must) and no other change (if I flush() the
FastWriter, say, after every tuple, then I get every tuple as it's
computed, but I'm trying to explore the buffers). I've noticed the
following:
- the first response comes after ~64K, then I get chunks of 32K each
- at this point, if I set response.setBufferSize() in
HttpSolrCall.writeResponse() to a small size (say, 128), I get the
first reply after 32K and then 8K chunks
- I thought that maybe in this context I could lower BUFSIZE in
FastWriter, but that didn't seem to make any change :(

That said, I'm not sure it's worth looking into these buffers any
deeper, because shrinking them might negatively affect other results
(e.g. regular searches or facets). It sounds like the way forward
would be that manual flushing, with chunked encoding enabled. I could
imagine adding some parameters in the lines of "flush every N tuples
or M milliseconds", that would be computed per-request, or at least
globally to the /stream handler.

What do you think? Would such a patch be welcome, to add these
parameters? But it still requires chunked encoding - would reverting
SOLR-8669 be a problem? Or maybe there's a more elegant way to enable
chunked encoding, maybe only for streams?

Best regards,
Radu
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


On Mon, Jan 15, 2018 at 10:58 AM, Radu Gheorghe
 wrote:
> Hello fellow solr-users!
>
> Currently, if I do an HTTP request to receive some data via streaming
> expressions, like:
>
> curl --data-urlencode 'expr=search(test,
>q="foo_s:*",
>fl="foo_s",
>sort="foo_s asc",
>qt="/export")'
> http://localhost:8983/solr/test/stream
>
> I get all results at once. This is more obvious if I simply introduce
> a one-second sleep in CloudSolrStream: with three documents, the
> request takes about three seconds, and I get all three docs after
> three seconds.
>
> Instead, I would like to get documents in a more "streaming" way. For
> example, after X seconds give me what you already have. Or if an
> Y-sized buffer fills up, give me all the tuples you have, then resume.
>
> Any ideas/opinions in terms of how I could achieve this? With or
> without changing Solr's code?
>
> Here's what I have so far:
> - this is normal with non-chunked HTTP/1.1. You get all results at
> once. If I revert this patch[1] and get Solr to use chunked encoding,
> I get partial results every... what seems to be a certain size between
> 16KB and 32KB
> - I couldn't find a way to manually change this... what I assume is a
> buffer size, but failed so far. I've tried changing Jetty's
> response.setBufferSize() in HttpSolrCall (maybe the wrong place to do
> it?) and also tried changing the default 8KB buffer in FastWriter
> - manually flushing the writer (in JSONResponseWriter) gives the
> expected results (in combination with chunking)
>
> The thing is, even if I manage to change the buffer size, I assume
> that will apply to all requests (not just streaming expressions). I
> assume that ideally it would be configurable per request. As for
> manual flushing, that would require changes to the streaming
> expressions themselves. Would that be the way to go? What do you
> think?
>
> [1] https://issues.apache.org/jira/secure/attachment/12787283/SOLR-8669.patch
>
> Best regards,
> Radu
> --
> Performance Monitoring * Log Analytics * Search Analytics
> Solr & Elasticsearch Support * http://sematext.com/


Partial results from streaming expressions (i.e. making them "stream")

2018-01-15 Thread Radu Gheorghe
Hello fellow solr-users!

Currently, if I do an HTTP request to receive some data via streaming
expressions, like:

curl --data-urlencode 'expr=search(test,
   q="foo_s:*",
   fl="foo_s",
   sort="foo_s asc",
   qt="/export")'
http://localhost:8983/solr/test/stream

I get all results at once. This is more obvious if I simply introduce
a one-second sleep in CloudSolrStream: with three documents, the
request takes about three seconds, and I get all three docs after
three seconds.

Instead, I would like to get documents in a more "streaming" way. For
example, after X seconds give me what you already have. Or if an
Y-sized buffer fills up, give me all the tuples you have, then resume.

Any ideas/opinions in terms of how I could achieve this? With or
without changing Solr's code?

Here's what I have so far:
- this is normal with non-chunked HTTP/1.1. You get all results at
once. If I revert this patch[1] and get Solr to use chunked encoding,
I get partial results every... what seems to be a certain size between
16KB and 32KB
- I couldn't find a way to manually change this... what I assume is a
buffer size, but failed so far. I've tried changing Jetty's
response.setBufferSize() in HttpSolrCall (maybe the wrong place to do
it?) and also tried changing the default 8KB buffer in FastWriter
- manually flushing the writer (in JSONResponseWriter) gives the
expected results (in combination with chunking)

The thing is, even if I manage to change the buffer size, I assume
that will apply to all requests (not just streaming expressions). I
assume that ideally it would be configurable per request. As for
manual flushing, that would require changes to the streaming
expressions themselves. Would that be the way to go? What do you
think?

[1] https://issues.apache.org/jira/secure/attachment/12787283/SOLR-8669.patch

Best regards,
Radu
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


autoAddReplicas doesn't respect replicationFactor?

2017-10-03 Thread Radu Gheorghe
Hello,

I'm trying to figure out if this is an issue or I'm doing something
wrong. Basically, with Solr 6.6.1 running on HDFS (Hadoop 2.7.4), I
see that if I create a one-shard collection with replicationFactor=1
(on one node), then add a second node, it creates a new replica on
this new node. I thought this isn't expected, and that Solr will only
try to keep the number of replicas according to replicationFactor.

Here are some reproducing steps:
- have HDFS set up according to
https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html#Pseudo-Distributed_Operation
- have Solr 6.6.1 downloaded and extracted
- I also have a Zookeeper set up separately, if it matters (just
extracted and started)
- then, start Solr like

bin/solr start -c -Dsolr.directoryFactory=HdfsDirectoryFactory
-Dsolr.lock.type=hdfs -Dsolr.hdfs.home=hdfs://localhost:9000/solr

- upload a config and create a collection:

bin/solr zk upconfig -n hdfs -d
./server/solr/configsets/data_driven_schema_configs/conf -z
localhost:2181

curl 
"http://localhost:8983/solr/admin/collections?action=CREATE&name=hdfs1&numShards=1&replicationFactor=1&autoAddReplicas=true&collection.configName=hdfs";

- add a document (again, it shouldn't matter)
- start a second node (I copied the whole extracted Solr, just in case):

bin/solr start -c -p 8984 -z localhost:2181
-Dsolr.directoryFactory=HdfsDirectoryFactory -Dsolr.lock.type=hdfs
-Dsolr.hdfs.home=hdfs://localhost:9000/solr

At this point I have two replicas of my shard (one on each node).

Am I missing something or is this a bug? Maybe replicationFactor=1 is
an edge case?

Best regards,
Radu
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


Re: Multiple rollups/facets in one streaming aggregation?

2016-08-16 Thread Radu Gheorghe
Thanks a lot, Joel, for your very fast and informative reply!

We'll chew on this and add a Jira if we're going on this route.
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


On Tue, Aug 16, 2016 at 8:29 PM, Joel Bernstein  wrote:
> For the initial implementation we could skip the merge piece if that helps
> get things done faster. In this scenario the metrics could be gathered
> after some parallel operation, then there would be no need for a merge.
> Sample syntax:
>
> metrics(parallel(join())
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Tue, Aug 16, 2016 at 1:25 PM, Joel Bernstein  wrote:
>
>> The concept of a MetricStream was in the early designs but hasn't yet been
>> implemented. Now might be a good time to work on the implementation.
>>
>> The MetricStream wraps a stream and gathers metrics in memory, continuing
>> to emit the tuples from the underlying stream. This allows multiple
>> MetricStreams to operate over the same stream without transforming the
>> stream. Psuedo code for a metric expression syntax is below:
>>
>> metrics(metrics(search())
>>
>> The MetricStream delivers it's metrics through the EOF Tuple. So the
>> MetricStream simply adds the finished aggregations to the EOF Tuple and
>> returns it. If we're going to support parallel metric gathering then we'll
>> also need to support the merging of the metrics. Something like this:
>>
>> metrics(parallel(metrics(join())
>>
>> Where the metrics wrapping the parallel function would need to collect the
>> EOF tuples from each worker and the merge the metrics and then emit the
>> merged metrics in and EOF Tuple.
>>
>> If you think this meets your needs, feel free to create a jira and add
>> begin a patch and I can help get it committed.
>>
>>
>> Joel Bernstein
>> http://joelsolr.blogspot.com/
>>
>> On Tue, Aug 16, 2016 at 11:52 AM, Radu Gheorghe <
>> radu.gheor...@sematext.com> wrote:
>>
>>> Hello Solr users :)
>>>
>>> Right now it seems that if I want to rollup on two different fields
>>> with streaming expressions, I would need to do two separate requests.
>>> This is too slow for our use-case, when we need to do joins before
>>> sorting and rolling up (because we'd have to re-do the joins).
>>>
>>> Since in our case we are actually looking for some not-necessarily
>>> accurate facets (top N), the best solution we could come up with was
>>> to implement a new stream decorator that implements an algorithm like
>>> Count-min sketch[1] which would run on the tuples provided by the
>>> stream function it wraps. This would have two big wins for us:
>>> 1) it would do the facet without needing to sort on the facet field,
>>> so we'll potentially save lots of memory
>>> 2) because sorting isn't needed, we could do multiple facets in one go
>>>
>>> That said, I have two (broad) questions:
>>> A) is there a better way of doing this? Let's reduce the problem to
>>> streaming aggregations, where the assumption is that we have multiple
>>> collections where data needs to be joined, and then facet on fields
>>> from all collections. But maybe there's a better algorithm, something
>>> out of the box or closer to what is offered out of the box?
>>> B) whatever the best way is, could we do it in a way that can be
>>> contributed back to Solr? Any hints on how to do that? Just another
>>> decorator?
>>>
>>> Thanks and best regards,
>>> Radu
>>> --
>>> Performance Monitoring * Log Analytics * Search Analytics
>>> Solr & Elasticsearch Support * http://sematext.com/
>>>
>>> [1] https://en.wikipedia.org/wiki/Count%E2%80%93min_sketch
>>>
>>
>>


Multiple rollups/facets in one streaming aggregation?

2016-08-16 Thread Radu Gheorghe
Hello Solr users :)

Right now it seems that if I want to rollup on two different fields
with streaming expressions, I would need to do two separate requests.
This is too slow for our use-case, when we need to do joins before
sorting and rolling up (because we'd have to re-do the joins).

Since in our case we are actually looking for some not-necessarily
accurate facets (top N), the best solution we could come up with was
to implement a new stream decorator that implements an algorithm like
Count-min sketch[1] which would run on the tuples provided by the
stream function it wraps. This would have two big wins for us:
1) it would do the facet without needing to sort on the facet field,
so we'll potentially save lots of memory
2) because sorting isn't needed, we could do multiple facets in one go

That said, I have two (broad) questions:
A) is there a better way of doing this? Let's reduce the problem to
streaming aggregations, where the assumption is that we have multiple
collections where data needs to be joined, and then facet on fields
from all collections. But maybe there's a better algorithm, something
out of the box or closer to what is offered out of the box?
B) whatever the best way is, could we do it in a way that can be
contributed back to Solr? Any hints on how to do that? Just another
decorator?

Thanks and best regards,
Radu
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/

[1] https://en.wikipedia.org/wiki/Count%E2%80%93min_sketch