Re: How to block expensive solr queries

2019-10-10 Thread Wei
On Wed, Oct 9, 2019 at 9:59 AM Wei  wrote:

> Thanks all. I debugged a bit and see timeAllowed does not limit stats
> call. Also I think it would be useful for solr to support a white list or
> black list of operations as Toke suggested. Will create jira for it.
> Currently seems the only option to explore is adding filter to solr's
> embedded jetty.  Does anyone have experience doing that? Do I also need to
> change SolrDispatchFilter?
>
> On Tue, Oct 8, 2019 at 3:50 AM Toke Eskildsen  wrote:
>
>> On Mon, 2019-10-07 at 10:18 -0700, Wei wrote:
>> > /solr/mycollection/select?stats=true=unique_ids
>> > cdistinct=true
>> ...
>> > Is there a way to block certain solr queries based on url pattern?
>> > i.e. ignore the stats.calcdistinct request in this case.
>>
>> It sounds like it is possible for users to issue arbitrary queries
>> against your Solr installation. As you have noticed, it makes it easy
>> to perform a Denial Of Service (intentional or not). Filtering out
>> stats.calcdistinct won't help with the next request for
>> group.ngroups=true, facet.field=unique_id=1,
>> rows=1 or something fifth.
>>
>> I recommend you flip your logic and only allow specific types of
>> requests and put limits on those. To my knowledge that is not a build-
>> in feature of Solr.
>>
>> - Toke Eskildsem, Royal Danish Library
>>
>>
>>


Solr-8.2.0 Cannot create collection on CentOS 7.7

2019-10-10 Thread Peter Davie
I have just installed Solr 8.2.0 on CentOS 7.7.1908.   Java version is 
as follows:


openjdk version "11.0.4" 2019-07-16 LTS
OpenJDK Runtime Environment 18.9 (build 11.0.4+11-LTS)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.4+11-LTS, mixed mode, sharing)

I am using the following commad to create a collection "test" on Solr Cloud:

solr create_collection -c test

The output from the command follows:

WARNING: Using _default configset with data driven schema functionality. 
NOT RECOMMENDED for production use.
 To turn off: bin/solr config -c test -p 8983 -action 
set-user-property -property update.autoCreateFields -value false


ERROR: Failed to create collection 'test' due to: Underlying core 
creation failed while creating collection: test


The problem seems to be caused by the following error:

Caused by: java.time.format.DateTimeParseException: Text 
'2019-10-11T04:46:03.971Z' could not be parsed: null
    at 
java.base/java.time.format.DateTimeFormatter.createError(DateTimeFormatter.java:2017)
    at 
java.base/java.time.format.DateTimeFormatter.parse(DateTimeFormatter.java:1920)
    at 
org.apache.solr.update.processor.ParseDateFieldUpdateProcessorFactory.parseInstant(ParseDateFieldUpdateProcessorFactory.java:230)
    at 
org.apache.solr.update.processor.ParseDateFieldUpdateProcessorFactory.validateFormatter(ParseDateFieldUpdateProcessorFactory.java:214)


Note that I have tested this and it is working on Windows 10 with Solr 
8.2.0 using the following Java version:


openjdk version "11.0.2" 2019-01-15
OpenJDK Runtime Environment 18.9 (build 11.0.2+9)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.2+9, mixed mode)

The full detail from the solr.log file follows:

2019-10-11 04:45:58.361 INFO  (qtp195801026-19) [   ] 
o.a.s.h.a.CollectionsHandler Invoked Collection Action :create with 
params 
replicationFactor=1=-1=test=test=CREATE=1=json 
and sendToOCPQueue=true
2019-10-11 04:45:58.445 INFO 
(OverseerThreadFactory-9-thread-1-processing-n:192.168.1.33:8983_solr) 
[   ] o.a.s.c.a.c.CreateCollectionCmd Create collection test
2019-10-11 04:45:58.735 INFO 
(OverseerStateUpdate-72057977101680640-192.168.1.33:8983_solr-n_00) 
[   ] o.a.s.c.o.SliceMutator createReplica() {

  "operation":"ADDREPLICA",
  "collection":"test",
  "shard":"shard1",
  "core":"test_shard1_replica_n1",
  "state":"down",
  "base_url":"http://192.168.1.33:8983/solr;,
  "type":"NRT",
  "waitForFinalState":"false"}
2019-10-11 04:45:59.114 INFO  (qtp195801026-21) [ 
x:test_shard1_replica_n1] o.a.s.h.a.CoreAdminOperation core create 
command 
qt=/admin/cores=core_node2=test=true=test_shard1_replica_n1=CREATE=1=test=shard1=javabin=2&

replicaType=NRT
2019-10-11 04:45:59.119 INFO  (qtp195801026-21) [ 
x:test_shard1_replica_n1] o.a.s.c.TransientSolrCoreCacheDefault 
Allocating transient cache for 2147483647 transient cores
2019-10-11 04:46:00.389 INFO  (qtp195801026-21) [c:test s:shard1 
r:core_node2 x:test_shard1_replica_n1] o.a.s.c.RequestParams conf 
resource params.json loaded . version : 0
2019-10-11 04:46:00.390 INFO  (qtp195801026-21) [c:test s:shard1 
r:core_node2 x:test_shard1_replica_n1] o.a.s.c.RequestParams request 
params refreshed to version 0
2019-10-11 04:46:00.424 INFO  (qtp195801026-21) [c:test s:shard1 
r:core_node2 x:test_shard1_replica_n1] o.a.s.c.SolrResourceLoader 
[test_shard1_replica_n1] Added 61 libs to classloader, from paths: 
[/opt/solr-8.2.0/contrib/clustering/lib, 
/opt/solr-8.2.0/contrib/extraction/lib, 
/opt/solr-8.2.0/contrib/langid/lib, /o

pt/solr-8.2.0/contrib/velocity/lib, /opt/solr-8.2.0/dist]
2019-10-11 04:46:00.814 INFO  (qtp195801026-21) [c:test s:shard1 
r:core_node2 x:test_shard1_replica_n1] o.a.s.c.SolrConfig Using Lucene 
MatchVersion: 8.2.0
2019-10-11 04:46:01.323 INFO  (qtp195801026-21) [c:test s:shard1 
r:core_node2 x:test_shard1_replica_n1] o.a.s.s.IndexSchema 
[test_shard1_replica_n1] Schema name=default-config
2019-10-11 04:46:03.017 INFO  (qtp195801026-21) [c:test s:shard1 
r:core_node2 x:test_shard1_replica_n1] o.a.s.s.IndexSchema Loaded schema 
default-config/1.6 with uniqueid field id
2019-10-11 04:46:03.205 INFO  (qtp195801026-21) [c:test s:shard1 
r:core_node2 x:test_shard1_replica_n1] o.a.s.c.CoreContainer Creating 
SolrCore 'test_shard1_replica_n1' using configuration from collection 
test, trusted=true
2019-10-11 04:46:03.212 INFO  (qtp195801026-21) [c:test s:shard1 
r:core_node2 x:test_shard1_replica_n1] o.a.s.m.r.SolrJmxReporter JMX 
monitoring for 'solr.core.test.shard1.replica_n1' (registry 
'solr.core.test.shard1.replica_n1') enabled at server: 
com.sun.jmx.mbeanserver.JmxMBeanServer@606d8acf
2019-10-11 04:46:03.258 INFO  (qtp195801026-21) [c:test s:shard1 
r:core_node2 x:test_shard1_replica_n1] o.a.s.c.SolrCore 
[[test_shard1_replica_n1] ] Opening new SolrCore at 
[/opt/solr-8.2.0/server/solr/test_shard1_replica_n1], 
dataDir=[/opt/solr-8.2.0/server/solr/test_shard1_replica_n1/data/]
2019-10-11 04:46:03.496 INFO  (qtp195801026-21) [c:test s:shard1 

igain query parser generating invalid output

2019-10-10 Thread Peter Davie

Hi,

I apologise in advance for the length of this email, but I want to share 
my discovery steps to make sure that I haven't missed anything during my 
investigation...


I am working on a classification project and will be using the 
classify(model()) stream function to classify documents.  I have noticed 
that models generated include many noise terms from the (lexically) 
early part of the term list.  To test, I have used the /BBC articles 
fulltext and category //dataset from Kaggle/ 
(https://www.kaggle.com/yufengdev/bbc-fulltext-and-category). I have 
indexed the data into a Solr collection (news_categories) and am 
performing the following operation to generate a model for documents 
categorised as "BUSINESS" (only keeping the 100th iteration):


having(
    train(
        news_categories,
        features(
        news_categories,
        zkHost="localhost:9983",
        q="*:*",
        fq="role:train",
        fq="category:BUSINESS",
        featureSet="business",
        field="body",
        outcome="positive",
        numTerms=500
        ),
        fq="role:train",
        fq="category:BUSINESS",
        zkHost="localhost:9983",
        name="business_model",
        field="body",
        outcome="positive",
        maxIterations=100
    ),
    eq(iteration_i, 100)
)

The output generated includes "noise" terms, such as the following 
"1,011.15", "10.3m", "01", "02", "03", "10.50", "04", "05", "06", "07", 
"09", and these terms all have the same value for idfs_ds ("-Infinity").


Investigating the "features()" output, it seems that the issue is that 
the noise terms are being returned with NaN for the score_f field:


    "docs": [
  {
    "featureSet_s": "business",
    "score_f": "NaN",
    "term_s": "1,011.15",
    "idf_d": "-Infinity",
    "index_i": 1,
    "id": "business_1"
  },
  {
    "featureSet_s": "business",
    "score_f": "NaN",
    "term_s": "10.3m",
    "idf_d": "-Infinity",
    "index_i": 2,
    "id": "business_2"
  },
  {
    "featureSet_s": "business",
    "score_f": "NaN",
    "term_s": "01",
    "idf_d": "-Infinity",
    "index_i": 3,
    "id": "business_3"
  },
  {
    "featureSet_s": "business",
    "score_f": "NaN",
    "term_s": "02",
    "idf_d": "-Infinity",
    "index_i": 4,
    "id": "business_4"
  },...

I have examined the code within 
org/apache/solr/client/solrj/io/streamFeatureSelectionStream.java and 
see that the scores being returned by {!igain} include NaN values, as 
follows:


{
  "responseHeader":{
    "zkConnected":true,
    "status":0,
    "QTime":20,
    "params":{
  "q":"*:*",
  "distrib":"false",
  "positiveLabel":"1",
  "field":"body",
  "numTerms":"300",
  "fq":["category:BUSINESS",
    "role:train",
    "{!igain}"],
  "version":"2",
  "wt":"json",
  "outcome":"positive",
  "_":"1569982496170"}},
  "featuredTerms":[
    "0","NaN",
    "0.0051","NaN",
    "0.01","NaN",
    "0.02","NaN",
    "0.03","NaN",

Looking intoorg/apache/solr/search/IGainTermsQParserPlugin.java, it 
seems that when a term is not included in the positive or negative 
documents, the docFreq calculation (docFreq = xc + nc) is 0, which means 
that subsequent calculations result in NaN (division by 0) which 
generates these meaningless values for the computed score.


I have patched a local version of Solr to skip terms for which docFreq 
is 0 in the finish() method of IGainTermsQParserPlugin and this is now 
the result:


{
  "responseHeader":{
    "zkConnected":true,
    "status":0,
    "QTime":260,
    "params":{
  "q":"*:*",
  "distrib":"false",
  "positiveLabel":"1",
  "field":"body",
  "numTerms":"300",
  "fq":["category:BUSINESS",
    "role:train",
    "{!igain}"],
  "version":"2",
  "wt":"json",
  "outcome":"positive",
  "_":"1569983546342"}},
  "featuredTerms":[
    "3",-0.0173133558644304,
    "authority",-0.0173133558644304,
    "brand",-0.0173133558644304,
    "commission",-0.0173133558644304,
    "compared",-0.0173133558644304,
    "condition",-0.0173133558644304,
    "continuing",-0.0173133558644304,
    "deficit",-0.0173133558644304,
    "expectation",-0.0173133558644304,

To my (admittedly inexpert) eye, it seems like this is producing more 
reasonable results.


With this change in place, train() now produces:

    "idfs_ds": [
  0.6212826193303013,
  0.6434237452075148,
  0.7169578292536639,
  0.741349282377823,
  0.86843471069652,
  1.0140549006400466,
  1.0639267306802198,
  1.0753554265038423,...

|"terms_ss": [ "â", "company", "market", "firm", "month", "analyst", 
"chief", "time",|||...| I am not sure if I have missed anything, but this seems like it's 
producing better outcomes. I would appreciate any input on whether I 
have missed 

Re: Zk Status Error

2019-10-10 Thread Shawn Heisey

On 10/10/2019 9:00 AM, mdsholund wrote:

I am also getting this error using ZK 3.5.5 and Solr 7.7.2.  I have
whitelisted mntr but still get a similar exception

2019-10-10 14:59:01.799 ERROR (qtp591391158-152) [   ] o.a.s.s.HttpSolrCall
null:java.lang.ArrayIndexOutOfBoundsException: 1
 at
org.apache.solr.handler.admin.ZookeeperStatusHandler.monitorZookeeper(ZookeeperStatusHandler.java:189)

As far as I know my ensemble is working fine.  The output of the mntr
command looks like it is all at least two values.  Is there a way that I can
see what it is choking on?


You may be running into this problem:

https://issues.apache.org/jira/browse/SOLR-13672

ZK 3.5 changed the output of the "conf" 4lw command in a way that is 
incompatible with Solr code.  We consider the problem to be a ZK bug, 
but worked around it in Solr because ZK's typical release schedule is 
very slow.  A new Solr release will come out long before a new ZK release.


Even Solr 8.2.0, which was updated to the ZK 3.5.5 client, has this 
problem.  It will be fixed in 8.3.0 when that version is released.


Thanks,
Shawn


AutoAddReplicas doesn't work with TLOG and PULL replicas

2019-10-10 Thread David Kovář
We would like to use SOLR in "Master-Slave" configuration (3 TLOG 
replicas as a "master" and several PULL replicas as "slave" for read 
queries). AutoAddReplicas option is turned on. Here is example of 
initialization query:

http://10.0.48.200:9092/solr/admin/collections?action=CREATE=true=search_cz=10=search_index=2=2=2=routing_key=compositeId

On picture_1 is screenshot of live configuration from Admin UI.

Then I restart one server which hosts "Slave" nodes. After about 2 
minutes when new server starts, AutoAddReplicas process in SOLR creates 
new replicas on new server, but it doesn't comply with replica type. It 
always start NRT replica, which is wrong.


See picture_2 after server restarting.

Do you have any solution how to automatically survive one server crash 
(auto create replica on new server with correct type and migrate data) 
when using TLOG and PULL replicas?


Thank you for answer.
David Kovar



RE: Zk Status Error

2019-10-10 Thread mdsholund
I am also getting this error using ZK 3.5.5 and Solr 7.7.2.  I have
whitelisted mntr but still get a similar exception

2019-10-10 14:59:01.799 ERROR (qtp591391158-152) [   ] o.a.s.s.HttpSolrCall
null:java.lang.ArrayIndexOutOfBoundsException: 1
at
org.apache.solr.handler.admin.ZookeeperStatusHandler.monitorZookeeper(ZookeeperStatusHandler.java:189)


As far as I know my ensemble is working fine.  The output of the mntr
command looks like it is all at least two values.  Is there a way that I can
see what it is choking on?



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html