from:"Erick Erickson"

Re: Range faceting on timestamp field

2020-12-24 Thread Erick Erickson

Then you need to form your start at relative to your timezone.

What I’d actually recommend is that if you need to bucket by day,
you index the day in a separate field. Of course, if you have to
bucket by day in arbitrary timezones that won’t work…..

Best,
Erick

> On Dec 24, 2020, at 4:42 PM, ufuk yılmaz  wrote:
> 
> Hello all,
> 
> I have a plong field in my schema representing a Unix timestamp
> 
> 
> 
> I’m doing a range facet over this field to find which event occured on which 
> day. I’m setting “start” on some date at 00:00 o’clock, end on another, and 
> setting gap to 86400 (total seconds in a day)
> ...
> "type": "range",
> "field": "timestamp_s",
> "start": 1338498000,
> "end": 1339275600,
> "gap": 86400,
> ...
> 
> Lets say that an event occured at 19:00 GMT+00. This facet puts it in the 
> bucket of that day, which starts at 00:00. I’m living in GMT+2 timezone, so 
> clock was 21:00 and that event occured on the same day with me, which is all 
> good and correct.
> 
> Another event occured at 23:00 GMT+00, Day 2. At that time, it was 01:00 Day 
> 3 here. Faceting puts the event at Day 2 00:00’s bucket, when converted to my 
> timezone, puts the event on Day 2. But it was Day 3 here when the event 
> happened...
> 
> I wish I didn’t bore the hell out of you. Do you have any suggestion to solve 
> this problem? Unfortunately my timestamp field is not a date field and I need 
> to show the results from my perspective, not from the universal time.
> 
> Have a nice day!
> 
> Sent from Mail for Windows 10
>

Re: Data Import Handler (DIH) - Installing and running

2020-12-23 Thread Erick Erickson

Have you done what the message says and looked at your Solr log? If so,
what information is there?

> On Dec 23, 2020, at 5:13 AM, DINSD | SPAutores 
>  wrote:
> 
> Hi,
> 
> I'm trying to install the package "data-import-handler", since it was 
> discontinued from core SolR distro.
> 
> https://github.com/rohitbemax/dataimporthandler
> 
> However, as soon as the first command is carried out
> 
> solr -c -Denable.packages=true
> 
> I get this screen in web interface
> 
> 
> 
> Has anyone been through this, or have any idea why it's happening ?
> 
> Thanks for any help
> Rui Pimentel
> 
> 
> 
> DINSD - Departamento de Informática / SPA Digital
> Av. Duque de Loulé, 31 - 1069-153 Lisboa  PORTUGAL
> T (+ 351) 21 359 44 36 / (+ 351) 21 359 44 00  F (+ 351) 21 353 02 57
>  informat...@spautores.pt
>  www.SPAutores.pt
> 
> Please consider the environment before printing this email 
> 
> Esta mensagem electrónica, incluindo qualquer dos seus anexos, contém 
> informação PRIVADA, CONFIDENCIAL e de DIVULGAÇÃO PROIBIDA,e destina-se 
> unicamente à pessoa e endereço electrónico acima indicados. Se não for o 
> destinatário desta mensagem, agradecemos que a elimine e nos comunique de 
> imediato através do telefone  +351 21 359 44 00 ou por email para: 
> ge...@spautores.pt 
> 
> This electronic mail transmission including any attachment hereof, contains 
> information that is PRIVATE, CONFIDENTIAL and PROTECTED FROM DISCLOSURE, and 
> it is only for the use of the person and the e-mail address above indicated. 
> If you have received this electronic mail transmission in error, please 
> destroy it and notify us immediately through the telephone number  +351 21 
> 359 44 00 or at the e-mail address:  ge...@spautores.pt
>

Re: Solr cloud facet query returning incorrect results

2020-12-21 Thread Erick Erickson

This should work as you expect, so the first thing I’d do 
is add &debug=query and see the parsed query in both cases.

If that doesn’t show anything, please post the 
full debug response in both cases.

Best,
Erick

> On Dec 21, 2020, at 4:31 AM, Alok Bhandari  wrote:
> 
> Hello All ,
> 
> we are using Solr6.2 , in schema that we use we have an integer field. For
> a given query we want to know how many documents have duplicate value for
> the field , for an example how many documents have same doc_id=10.
> 
> So to find this information we fire a query to solr-cloud with following
> parameters
> 
> "q":"organization:abc",
>  "facet.limit":"10",
>  "facet.field":"doc_id",
>  "indent":"on",
>  "fl":"archive_id",
>  "facet.mincount":"2",
>"facet":"true",
> 
> 
> But in response we get that there are no documents having duplicate
> doc_id as in facet query response we are not getting any facet_counts
> , but if I change the query to "q":"organization:abc AND doc_id:10"
> then in response I can see that there are 3 docs with doc_id=10.
> 
> This behavior seems contrary to how facets behave , so wanted to know
> if there is any possible reason for this type of behavior.

Re: solrCloud client socketTimeout initiates retries

2020-12-18 Thread Erick Erickson

Right, there are several alternatives. Try going here:
http://jirasearch.mikemccandless.com/search.py?index=jira

and search for “circuit breaker” and you’ll find a bunch
of JIRAs. Unfortunately, some are in 8.8..

That said, some of the circuit breakers are in much earlier
releases. Would it suffice until you can upgrade to set
the circuit breakers?

One problem with your solution is that the query keeps
on running, admittedly on only one replica of each shard.
With circuit breakers, the query itself is stoped, thus freeing
up resources.

Additionally, if you see a pattern (for instance, certain
wildcard patterns) you could intercept that before sending.

Best,
Erick

> On Dec 18, 2020, at 8:52 AM, kshitij tyagi  wrote:
> 
> Hi Erick,
> 
> I agree but in a huge cluster the retries keeps on happening, cant we have
> this feature implemented in client.
> i was referring to this jira
> https://issues.apache.org/jira/browse/SOLR-10479
> We have seen that some malicious queries come to system which takes
> significant time and these queries propagating to other solr servers choke
> the entire cluster.
> 
> Regards,
> kshitij
> 
> 
> 
> 
> 
> On Fri, Dec 18, 2020 at 7:12 PM Erick Erickson 
> wrote:
> 
>> Why do you want to do this? This sounds like an XY problem, you
>> think you’re going to solve some problem X by doing Y. Y in this case
>> is setting the numServersToTry, but you haven’t explained what X,
>> the problem you’re trying to solve is.
>> 
>> Offhand, this seems like a terrible idea. If you’re requests are timing
>> out, what purpose is served by _not_ trying the next one on the
>> list? With, of course, a much longer timeout interval…
>> 
>> The code is structured that way on the theory that you want the request
>> to succeed and the system needs to be tolerant of momentary
>> glitches due to network congestion, reading indexes into memory, etc.
>> Bypassing that assumption needs some justification….
>> 
>> Best,
>> Erick
>> 
>>> On Dec 18, 2020, at 6:23 AM, kshitij tyagi 
>> wrote:
>>> 
>>> Hi,
>>> 
>>> We have a Solrcloud setup and are using CloudSolrClient, What we are
>> seeing
>>> is if socketTimeoutOccurs then the same request is sent to other solr
>>> server.
>>> 
>>> So if I set socketTimeout to a very low value say 100ms and my query
>> takes
>>> around 200ms then client tries to query second server, then next and so
>>> on(basically all available servers with same query).
>>> 
>>> I see that we have *numServersToTry* in LBSolrClient class but not able
>> to
>>> set this using CloudSolrClient. Using this we can restrict the above
>>> feature.
>>> 
>>> Should a jira be created to support numServersToTry by CloudSolrClient?
>> Or
>>> is there any other way to control the request to other solr servers?.
>>> 
>>> Regards,
>>> kshitij
>> 
>>

Re: Data Import Blocker - Solr

2020-12-18 Thread Erick Erickson

Have you tried escaping that character?

> On Dec 18, 2020, at 2:03 AM, basel altameme  
> wrote:
> 
> Dear,
> While trying to Import & Index data from MySQL DB custom view i am facing the 
> error below:
> Data Config problem: The value of attribute "query" associated with an 
> element type "entity" must not contain the '<' character.
> Please note that in my SQL statements i am using '<>' as an operator for 
> comparing only.
> sample line:
> when (`v`.`live_type_id` <> 1) then 100
> 
> Kindly advice.
> Regards,Basel
>

Re: solrCloud client socketTimeout initiates retries

2020-12-18 Thread Erick Erickson

Why do you want to do this? This sounds like an XY problem, you
think you’re going to solve some problem X by doing Y. Y in this case
is setting the numServersToTry, but you haven’t explained what X,
the problem you’re trying to solve is.

Offhand, this seems like a terrible idea. If you’re requests are timing
out, what purpose is served by _not_ trying the next one on the
list? With, of course, a much longer timeout interval…

The code is structured that way on the theory that you want the request
to succeed and the system needs to be tolerant of momentary
glitches due to network congestion, reading indexes into memory, etc.
Bypassing that assumption needs some justification….

Best,
Erick

> On Dec 18, 2020, at 6:23 AM, kshitij tyagi  wrote:
> 
> Hi,
> 
> We have a Solrcloud setup and are using CloudSolrClient, What we are seeing
> is if socketTimeoutOccurs then the same request is sent to other solr
> server.
> 
> So if I set socketTimeout to a very low value say 100ms and my query takes
> around 200ms then client tries to query second server, then next and so
> on(basically all available servers with same query).
> 
> I see that we have *numServersToTry* in LBSolrClient class but not able to
> set this using CloudSolrClient. Using this we can restrict the above
> feature.
> 
> Should a jira be created to support numServersToTry by CloudSolrClient? Or
> is there any other way to control the request to other solr servers?.
> 
> Regards,
> kshitij

Re: Best example solrconfig.xml?

2020-12-15 Thread Erick Erickson

I’d start with that config set, making sure that “schemaless” is disabled.

Do be aware that some of the defaults have changed, although the big change for 
docValues was there in 6.0.

One thing you might want to do is set uninvertible=false in your schema. 
That’ll cause Solr to barf if you, say, sort, facet, group on a field that does 
_not_ have docValues=true. I suspect this will cause no surprises for you, but 
it’s kind of a nice backstop to keep from having surprises in terms of heap 
size…

Best,
Erick

> On Dec 15, 2020, at 6:56 PM, Walter Underwood  wrote:
> 
> We’re moving from 6.6 to 8.7 and I’m thinking of starting with an 8.7 
> solrconfig.xml and porting our changes into it.
> 
> Is this the best one to start with?
> 
> solr/server/solr/configsets/_default/conf/solrconfig.xml
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>

Re: Solr Collection Reload

2020-12-15 Thread Erick Erickson

Well, there’s no information here to help. 

The first thing I’d check is what the Solr
logs are saying. Especially if you’ve
changed any of your configuration files.

If that doesn’t show anything, I'd take a thread
dump and look at that, perhaps there’s some
deadlock.

But that said, a reload shouldn’t take more time
than a startup…

Best,
Erick

> On Dec 14, 2020, at 5:44 PM, Moulay Hicham  wrote:
> 
> Hi,
> 
> I have an issue with the collection reload API. The reload seems to be
> hanging. It's been in the running state for many days.
> 
> Can you please suggest any documentation which explains the reload
> task under the hood steps?
> 
> FYI. I am using solr 8.1
> 
> Thanks,
> 
> Moulay

Re: No numShards attribute exists in 'core.properties' with the newly added replica

2020-12-08 Thread Erick Erickson

I raised this JIRA: https://issues.apache.org/jira/browse/SOLR-15035

What’s not clear to me is whether numShards should even be in core.properties 
at all, even on the create command. In the state.json file it’s a 
collection-level property and not reflected in the individual replica’s 
information.

However, we should be consistent.

Best,
Erick

> On Dec 8, 2020, at 4:34 AM, Dawn  wrote:
> 
> Hi
> 
>   Solr8.7.0
> 
>   No numShards attribute exists in 'core.properties' with the newly added 
> replica. Causes numShards to be null using CloudDescriptor.
> 
>   Since the ADDREPLICA command does not get numShards property, the 
> coreProps will not save numShards in the constructor that creates the 
> CoreDescriptor, so that the 'core.properties' file will be generated without 
> numShards.
> 
>   Can the numShards attribute function be added to the process of adding 
> replica so that the 'core-properties' file of replica can contain numShards 
> attribute?

Re: optimize boosting parameters

2020-12-08 Thread Erick Erickson

Before worrying about it too much, exactly _how_ much has
the performance changed?

I’ve just been in too many situations where there’s
no objective measure of performance before and after, just
someone saying “it seems slower” and had those performance
changes disappear when a rigorous test is done. Then spent
a lot of time figuring out that the person reporting the 
problem hadn’t had coffee yet. Or the network was slow.
Or….

If it does turn out to be the boosting (and IIRC the
map function can be expensive), can you pre-compute some
number of the boosts? Your requirements look
like they can be computed at index time, then boost
by just the value of the pre-computed field. BTW, boosts < 1.0
_reduce_ the score. I mention that just in case that’s a surprise ;)
Of course that means that to change the boosting you need
to re-index.

 You use termfreq, which changes of course, but
1> if your corpus is updated often enough, the termfreqs will be relatively 
stable.
  in that case you can pre-compute them too.

2> your problem statement has nothing to do with termfreq so why are you
 using it in the first place?

Best,
Erick

> On Dec 8, 2020, at 12:46 AM, Radu Gheorghe  wrote:
> 
> Hi Derek,
> 
> Ah, then my reply was completely off :)
> 
> I don’t really see a better way. Maybe other than changing termfreq to field, 
> if the numeric field has docValues? That may be faster, but I don’t know for 
> sure.
> 
> Best regards,
> Radu
> --
> Sematext Cloud - Full Stack Observability - https://sematext.com
> Solr and Elasticsearch Consulting, Training and Production Support
> 
>> On 8 Dec 2020, at 06:17, Derek Poh  wrote:
>> 
>> Hi Radu
>> 
>> Apologies for not making myself clear.
>> 
>> I would like to know if there is a more simple or efficient way to craft the 
>> boosting parameters based on the requirements.
>> 
>> For example, I am using 'if', 'map' and 'termfreq' functions in the bf 
>> parameters.
>> 
>> Is there a more efficient or simple function that can be use instead? Or 
>> craft the 'formula' it in a more efficient way?
>> 
>> On 7/12/2020 10:05 pm, Radu Gheorghe wrote:
>>> Hi Derek,
>>> 
>>> It’s hard to tell whether your boosts can be made better without knowing 
>>> your data and what users expect of it. Which is a problem in itself.
>>> 
>>> I would suggest gathering judgements, like if a user queries for X, what 
>>> doc IDs do you expect to get back?
>>> 
>>> Once you have enough of these judgements, you can experiment with boosts 
>>> and see how the query results change. There are measures such as nDCG (
>>> https://en.wikipedia.org/wiki/Discounted_cumulative_gain#Normalized_DCG
>>> ) that can help you measure that per query, and you can average this score 
>>> across all your judgements to get an overall measure of how well you’re 
>>> doing.
>>> 
>>> Or even better, you can have something like Quaerite play with boost values 
>>> for you:
>>> 
>>> https://github.com/tballison/quaerite/blob/main/quaerite-examples/README.md#genetic-algorithms-ga-runga
>>> 
>>> 
>>> Best regards,
>>> Radu
>>> --
>>> Sematext Cloud - Full Stack Observability - 
>>> https://sematext.com
>>> 
>>> Solr and Elasticsearch Consulting, Training and Production Support
>>> 
>>> 
 On 7 Dec 2020, at 10:51, Derek Poh 
 wrote:

 Hi

 I have added the following boosting requirements to the search query of a 
 page. Feedback from monitoring team is that the overall response of the 
 page has increased since then.
 I am trying to find out if the added boosting parameters (below) could 
 have contributed to the increased.

 The boosting is working as per requirements.

 May I know if the implemented boosting parameters can be enhanced or 
 optimized further?
 Hopefully to improve on the response time of the query and the page.

 Requirements:
 1. If P_SupplierResponseRate is:
   a. 3, boost by 0.4
   b. 2, boost by 0.2

 2. If P_SupplierResponseTime is:
   a. 4, boost by 0.4
   b. 3, boost by 0.2

 3. If P_MWSScore is:
   a. between 80-100, boost by 1.6
   b. between 60-79, boost by 0.8

 4. If P_SupplierRanking is:
   a. 3, boost by 0.3
   b. 4, boost by 0.6
   c. 5, boost by 0.9
   b. 6, boost by 1.2

 Boosting parameters implemented:
 bf=map(P_SupplierResponseRate,3,3,0.4,0)
 bf=map(P_SupplierResponseRate,2,2,0.2,0)

 bf=map(P_SupplierResponseTime,4,4,0.4,0)
 bf=map(P_SupplierResponseTime,3,3,0.2,0)

 bf=map(P_MWSScore,80,100,1.6,0)
 bf=map(P_MWSScore,60,79,0.8,0)

 bf=if(termfreq(P_SupplierRanking,3),0.3,if(termfreq(P_SupplierRanking,4),0.6,if(termfreq(P_SupplierRanking,5),0.9,if(termfreq(P_SupplierRanking,6),1.2,0

 I am using Solr 7.7.2

 --
 CONFIDENTIALITY NOTICE 
 This e-mail (including any attachments) may contain confidential and/or 
 privileged inform

Re: Is there a way to search for "..." (three dots)?

2020-12-08 Thread Erick Erickson

Yes, but…

Odds are your analysis configuration for the field is removing the dots.

Go to the admin/analysis page, pick your field type and put examples in
the “index” and “query” boxes and you’ll see what I mean.

You need something like WhitespaceTokenizer, as your tokenizer,
and avoid things like WordDelimiter(Graph)FilterFactory.

You’ll find this is tricky though. For instance, if you index
“…something is here”, WhitespaceTokenizer will split this into
“…something”, “is”, “here” and you won’t be able to search for 
“something” since the _token_ is “…something”.

You could use one of the other tokenizers or use one of the
regular expression tokenizers.

Best,
Erick

> On Dec 8, 2020, at 5:56 AM, nettadalet  wrote:
> 
> Hi,
> I need to be able to search for "..." (three dots), meaning the query should
> be "..." and the search should return results that have "..." in their
> names.
> Is there a way to do it?
> Thanks in advance.
> 
> 
> 
> --
> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: is there a way to trigger a notification when a document is deleted in solr

2020-12-07 Thread Erick Erickson

No, it’s marked “unresolved”….

> On Dec 7, 2020, at 9:22 AM, Pushkar Mishra  wrote:
> 
> Hi All
> https://issues.apache.org/jira/browse/SOLR-13609, was this fixed ever ?
> 
> Regards
> 
> On Mon, Dec 7, 2020 at 6:32 PM Pushkar Mishra  wrote:
> 
>> Hi All,
>> 
>> Is there a way to trigger a notification when a document is deleted in
>> solr? Or may be when auto purge gets complete of deleted documents in solr?
>> 
>> Thanks
>> 
>> --
>> Pushkar Kumar Mishra
>> "Reactions are always instinctive whereas responses are always well
>> thought of... So start responding rather than reacting in life"
>> 
>> 
> 
> -- 
> Pushkar Kumar Mishra
> "Reactions are always instinctive whereas responses are always well thought
> of... So start responding rather than reacting in life"

Re: Collection deleted still in zookeeper

2020-12-07 Thread Erick Erickson

What should happen when you delete a collection and _only_ that
collection references the configset has been discussed several
times, and… whatever is chosen is wrong ;)

1> if we delete the configset, then if you want to delete a collection
to insure that you’re starting all over for whatever reason, your
configset is gone and you need to find it again.

2> If we _don’t_ delete the configset, then you can wind up with
obsolete configsets polluting Zookeeper…

3> If we make a copy of the configset every time we make a collection,
then there can be a bazillion of them in a large installation.

Best,
Erick

> On Dec 7, 2020, at 6:52 AM, Marisol Redondo 
>  wrote:
> 
> Thanks Erick for the answer, you gave me the clue to find the issue.
> 
> The real problem is that when I removed the collection using the solr API
> (http://solrintance:port/solr/admin/collections?action=DELETE&name=collectionname)
> the config files are not deleted. I don't know if this is the normal
> behavior in every version of solr (I'm using version 6), but I think when
> deleting the collection, the config files for this collection should be
> removed.
> 
> Anyway, I found that the config where still in the UI/cloud/tree/configs
> and they can be removed using the solr zk -r configs/myconfig and this
> solve the issue.
> 
> Thanks
> 
> 
> 
> 
> 
> 
> On Fri, 4 Dec 2020 at 15:46, Erick Erickson  wrote:
> 
>> This almost always a result of one of two things:
>> 
>> 1> you didn’t upload the config to the correct place or the ZK that Solr
>> uses.
>> or
>> 2> you still have a syntax problem in the config.
>> 
>> The solr.log file on the node that’s failing may have a more useful
>> error message about what’s wrong. Also, you can try validating the XML
>> with one of the online tools.
>> 
>> Are you totally and absolutely sure that, for instance, you’re uploading
>> to the correct Zookeeper? You should be able to look at the admin UI
>> screen and see the ZK address. I’ve seen this happen when people
>> inadvertently use the embedded ZK for one operation but not for the
>> other. Of have the ZK_HOST environment variable pointing to some
>> ZK ensemble that’s used when you start Solr but not when you upload
>> files. Or…
>> 
>> Use the admin UI>>cloud>>tree>>configs>>your_config_name
>> to see if the solrconfig has the correct changes. I’ll often add some
>> bogus comment in the early part of the file that I can use to make
>> sure I’ve uploaded the correct file to the correct place.
>> 
>> I use the "bin/solr zk upconfig” command to move files back and forth
>> FWIW, that
>> avoids, say putting the individual file a in the wrong directory...
>> 
>> Best,
>> Erick
>> 
>>> On Dec 4, 2020, at 9:18 AM, Marisol Redondo <
>> marisol.redondo.gar...@gmail.com> wrote:
>>> 
>>> Hi,
>>> 
>>> When trying to modify the config.xml file for a collection I made a
>> mistake
>>> and the config was wrong. So I removed the collection to create it again
>>> from a backend.
>>> But, although I'm sure I'm using a correct config.xml, solr is still
>>> complaining about the error in the older solrconfig.xml
>>> 
>>> I have tried to removed the collection more than once, I have stopped
>> solr
>>> and zookeeper and still having the same error. It's like zookeeper is
>> still
>>> storing the older solrconfig.xml and don't upload the configuration file
>>> from the new collection.
>>> 
>>> I have tried to
>>> - upload the files
>>> - remove the collection and create it again, but empty
>>> - restore the collection from the backup
>>> And I get always the same error:
>>>  collection_name_shard1_replica1:
>>> 
>> org.apache.solr.common.SolrException:org.apache.solr.common.SolrException:
>>> Could not load conf for core collection_name_shard1_replica1: Error
>> loading
>>> solr config from solrconfig.xml
>>> 
>>> Thanks for your help
>> 
>>

Re: Migrate Legacy Solr Cores to SolrCloud

2020-12-05 Thread Erick Erickson

First thing I’d do is run one of the examples to insure you have Zookeeper set 
up etc. You can create a collection that uses the default configset.

Once that’s done, start with ‘SOLR_HOME/solr/bin/solr zk upconfig’. There’s 
extensive help if you just type “bin/solr zk -help”. You give it the path to an 
existing config directory and a name for the configset in Zookeeper.

Once that’s done, you can create the collection, the admin UI drop-down will 
allow you to choose the configset. Now you have a collection.

To put data in that collection, it would be best to index the data again. If 
you can’t do that, you MUST have created the collection with exactly one shard, 
replicationFactor=1 (leader-only). Shut down Solr and copy your core’s data 
directory (the parent of the index directory) to “the right place”. You’ll 
overwrite an existing data directory with a name like 
collection1_shard1_replica_n1/data. Do _not_ copy the entire core directory up, 
_just_ recursively copy the “data” dir.

Now power Solr back up and you should be good. You can use the collections API 
ADDREPLICA command to build out your collection for HA/DR.

NOTE: if by “existing” you mean an index created with Solr X-2 (i.e. Solr 6 or 
earlier and assuming you’re migrating to Solr 8) this will not work and you’ll 
have to re-index your data. This is not particular to SolrCloud, Lucene will 
refuse to open the index if it was created with any version of Solr earlier 
than the immediately prior major Solr release, i.e. if the index was _created_ 
with Solr 7, you can do the above if you’re moving to Solr 8. If you’re 
migrating to Solr 7, then if the old index was created with Solr 6 you’ll be 
ok….

Best,
Erick

> On Dec 4, 2020, at 3:07 PM, Jay Mandal  
> wrote:
> 
> Hello All,
> Please can some one from the Solr Lucene Community Provide me the Steps on 
> how to migrate an existing Solr legacy Core, data and conf(manage 
> schema,solrconfig.xml files to SolrCloud configuration with collections and 
> shards and where to copy the existing files to reuse the data in the solr 
> knowledgebase.
> Thanks,
> Jayanta.
> 
> Regards,
> 
> Jayanta Mandal
> 4500 S Lakeshore Dr #620, Tempe, AZ 85282
> +1.602-900-1791 ext. 10134|Direct: +1.718-316-0384
> www.anjusoftware.com
> 
> 
> 
> 
> 
> Confidentiality Notice
> 
> This email message, including any attachments, is for the sole use of the 
> intended recipient and may contain confidential and privileged information. 
> Any unauthorized view, use, disclosure or distribution is prohibited. If you 
> are not the intended recipient, please contact the sender by reply email and 
> destroy all copies of the original message. Anju Software, Inc. 4500 S. 
> Lakeshore Drive, Suite 620, Tempe, AZ USA 85282.

Re: What's the most efficient way to check if there are any matches for a query?

2020-12-05 Thread Erick Erickson

Have you looked at the Term Query Parser (_not_ the TermS Query Parser)
or Raw Query Parser? 

https://lucene.apache.org/solr/guide/8_4/other-parsers.html

NOTE: these perform _no_ analysis, so you have to give them the exact term...

These are pretty low level, and if they’re “fast enough” you won’t have to do
any work. You could do some Lucene-level coding I suspect to improve that,
depends on whether you think those are fast enough…

Best,
Erick

> On Dec 5, 2020, at 5:04 AM, Colvin Cowie  wrote:
> 
> Hello,
> 
> I was just wondering. If I don't care about the number of matches for a
> query, let alone what the matches are, just that there is *at least 1*
> match for a query, what's the most efficient way to execute that query (on
> the /select handler)? (Using Solr 8.7)
> 
> As a general approach for a query is "rows=0&sort=id asc" the best I can
> do? Is there a more aggressive short circuit that will stop a searcher as
> soon as it finds a match?
> 
> For a specific case where the query is for a single exact term in an
> indexed field (with or without doc values) is there a different answer?
> 
> Thanks for any suggestions

Re: Collection deleted still in zookeeper

2020-12-04 Thread Erick Erickson

This almost always a result of one of two things:

1> you didn’t upload the config to the correct place or the ZK that Solr uses.
or
2> you still have a syntax problem in the config.

The solr.log file on the node that’s failing may have a more useful
error message about what’s wrong. Also, you can try validating the XML
with one of the online tools.

Are you totally and absolutely sure that, for instance, you’re uploading
to the correct Zookeeper? You should be able to look at the admin UI
screen and see the ZK address. I’ve seen this happen when people 
inadvertently use the embedded ZK for one operation but not for the
other. Of have the ZK_HOST environment variable pointing to some
ZK ensemble that’s used when you start Solr but not when you upload
files. Or…

Use the admin UI>>cloud>>tree>>configs>>your_config_name
to see if the solrconfig has the correct changes. I’ll often add some
bogus comment in the early part of the file that I can use to make
sure I’ve uploaded the correct file to the correct place.

I use the "bin/solr zk upconfig” command to move files back and forth FWIW, that
avoids, say putting the individual file a in the wrong directory...

Best,
Erick

> On Dec 4, 2020, at 9:18 AM, Marisol Redondo 
>  wrote:
> 
> Hi,
> 
> When trying to modify the config.xml file for a collection I made a mistake
> and the config was wrong. So I removed the collection to create it again
> from a backend.
> But, although I'm sure I'm using a correct config.xml, solr is still
> complaining about the error in the older solrconfig.xml
> 
> I have tried to removed the collection more than once, I have stopped solr
> and zookeeper and still having the same error. It's like zookeeper is still
> storing the older solrconfig.xml and don't upload the configuration file
> from the new collection.
> 
> I have tried to
> - upload the files
> - remove the collection and create it again, but empty
> - restore the collection from the backup
> And I get always the same error:
>   collection_name_shard1_replica1:
> org.apache.solr.common.SolrException:org.apache.solr.common.SolrException:
> Could not load conf for core collection_name_shard1_replica1: Error loading
> solr config from solrconfig.xml
> 
> Thanks for your help

Re: Solrj supporting term vector component ?

2020-12-04 Thread Erick Erickson

To expand on Shawn’s comment. There are a lot of built-in helper methods in 
SolrJ, but
they all amount to setting a value in the underlying map of params, which you 
can
do yourself for any parameter you could specify on a URL or cURL command.

For instance, SolrQuery.setStart(start) is just:

this.set(CommonParams.START, start);

and this.set just puts CommonParams.START, start into the underlying parameter 
map.

I’m simplifying some here since the helper methods do some safety checking and
the like, but the take-away is “anything you can set on a URL or specify
in a cURL command can be specified in SolrJ by setting the parameter 
explicitly”.

Best,
Erick

> On Dec 3, 2020, at 1:24 PM, Shawn Heisey  wrote:
> 
> On 12/3/2020 10:20 AM, Deepu wrote:
>> I am planning to use Term vector component for one of the use cases, as per
>> below solr documentation link solrj not supporting Term Vector Component,
>> do you have any other suggestions to use TVC in java application?
>> https://lucene.apache.org/solr/guide/8_4/the-term-vector-component.html#solrj-and-the-term-vector-component
> 
> SolrJ will support just about any query you might care to send, you just have 
> to give it all the required parameters when building the request. All the 
> results will be available, though you'll almost certainly have to provide 
> code yourself that rips apart the NamedList into usable info.
> 
> What is being said in the documentation is that there are not any special 
> objects or methods for doing term vector queries.  It's not saying that it 
> can't be done.
> 
> Thanks,
> Shawn

Re: Solr8.7 - How to optmize my index ?

2020-12-03 Thread Erick Erickson

Dave:

Yeah, every time there’s generic advice, there’s some situations where it’s not 
the best choice ;).

In your situation, you’re trading of some space savings for moving up to 450G 
all at once. Which sounds like it is worthwhile to you, although I’d check perf 
numbers sometime

You may want to check out expungeDeletes. That will deal only with segments 
with more than 10% deleted docs, and may get you most all of the benefits of 
optimize without the problems. Specifically, let’s say you have a segment right 
at the limit (5G by default) that has exactly one deleted doc. Optimize will 
rewrite that, expungeDeletes will not. It’s an open question whether there’s 
any practical difference, ‘cause if all the segments in your index have > 10% 
deleted documents, they all get rewritten in either case….

And the mechanism for optimize changed pretty significantly in Solr 7.5, the 
short form is that before that the result was a single massive segment, whereas 
after that the default max segment size of 5G is respected by default (although 
you can force to one segment if you take explicit actions).

Here are two articles that explain it all:
Pre Solr 7.4: 
https://lucidworks.com/post/segment-merging-deleted-documents-optimize-may-bad/
Post Solr 7.4: 
https://lucidworks.com/post/solr-and-optimizing-your-index-take-ii/

Best,
Erick

> On Dec 2, 2020, at 11:05 PM, Dave  wrote:
> 
> I’m going to go against the advice SLIGHTLY, it really depends on how you 
> have things set up as far as your solr server hosting is done. If you’re 
> searching off the same solr server you’re indexing to, yeah don’t ever 
> optimize it will take care of itself, people much smarter than us, like 
> Erick/Walter/Yonik, have spent time on this and if they say don’t do it don't 
> do it. 
> 
> In my particular use case I do see a measured improvement from optimizing 
> every three or four months.  In my case a large portion, over 75% of the 
> documents, which each measure around 500k to 3mg get reindexed every month, 
> as the fields in the documents change every month, while documents are added 
> to it daily as well.  So when I can go from a 650gb index to a 450gb once in 
> a while it makes a difference if I only have 500gb of memory to work with on 
> the searchers and can fit all the segments straight to memory. Also I use the 
> old set up of master slave, so my indexing server, when it’s optimizing has 
> no impact on the searching servers.  Once the optimized index gets warmed 
> back up in the searcher I do notice improvement in my qtimes (I like to 
> think) however I’ve been using my same integration process of occasional hard 
> optimizations since 1.4, and it might just be i like to watch the index 
> inflate three times the size then shrivel up. Old habits die hard. 
> 
>> On Dec 2, 2020, at 10:28 PM, Matheo Software  
>> wrote:
>> 
>> Hi Erick,
>> Hi Walter,
>> 
>> Thanks for these information,
>> 
>> I will learn seriously about the solr article you gave me. 
>> I thought it was important to always delete and optimize collection.
>> 
>> More information concerning my collection,
>> Index size is about 390Go for 130M docs (3-5ko / doc), around 25 fields 
>> (indexed, stored)
>> All Tuesday I do an update of around 1M docs and all Thusday I do an add new 
>> docs (around 50 000). 
>> 
>> Many thanks !
>> 
>> Regards,
>> Bruno
>> 
>> -Message d'origine-
>> De : Erick Erickson [mailto:erickerick...@gmail.com] 
>> Envoyé : mercredi 2 décembre 2020 14:07
>> À : solr-user@lucene.apache.org
>> Objet : Re: Solr8.7 - How to optmize my index ?
>> 
>> expungeDeletes is unnecessary, optimize is a superset of expungeDeletes.
>> The key difference is commit=true. I suspect if you’d waited until your 
>> indexing process added another doc and committed, you’d have seen the index 
>> size drop.
>> 
>> Just to check, you send the command to my_core but talk about collections.
>> Specifying the collection is sufficient, but I’ll assume that’s a typo and 
>> you’re really saying my_collection.
>> 
>> I agree with Walter like I always do, you shouldn’t be running optimize 
>> without some proof that it’s helping. About the only time I think it’s 
>> reasonable is when you have a static index, unless you can demonstrate 
>> improved performance. The optimize button was removed precisely because it 
>> was so tempting. In much earlier versions of Lucene, it made a demonstrable 
>> difference so was put front and center. In more recent versions of Solr 
>> optimize doesn’t help nearly as much so it was removed.
>> 
>> You say you have 38M deleted documents. How many docum

Re: ConcurrentUpdateSolrClient stall prevention bug in Solr 8.4+

2020-12-03 Thread Erick Erickson

Exactly _how_ are you indexing? In particular, how often are commits happening?

If you’re committing too often, Solr can block until some of the background 
merges are complete. This can happen particularly when you are doing hard 
commits in rapid succession, either through, say, committing from the client 
(which I recommend against in almost all cases) or haveing your  
intervals set too short.

Your autocommit settings should be as long as your application can tolerate,
committing is expensive.

Here’s some background:
https://lucidworks.com/post/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

The other possibility is if you have very long GC pauses, so I’d also
monitor the GC activity and see if you have stop-the-world GC
pauses exceeding 20 seconds coincident with this problem.

Best,
Erick

> On Dec 3, 2020, at 6:12 AM, Sebastian Lutter  
> wrote:
> 
> Hi!
> 
> I run a three nodes Solr 8.5.1 cluster and experienced a bug when updating 
> the index: (adding document)
> 
> {
>   "responseHeader":{
> "rf":3,
> "status":500,
> "QTime":22938},
>   "error":{
> "msg":"Task queue processing has stalled for 20205 ms with 0 remaining 
> elements to process.",
> "trace":"java.io.IOException: Task queue processing has stalled for 20205 
> ms with 0 remaining elements to process.\n\tat 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient.blockUntilFinished(ConcurrentUpdateHttp2SolrClient.java:501)\n\tat
>  
> org.apache.solr.update.StreamingSolrClients.blockUntilFinished(StreamingSolrClients.java:87)\n\tat
>  
> org.apache.solr.update.SolrCmdDistributor.blockAndDoRetries(SolrCmdDistributor.java:265)\n\tat
>  
> org.apache.solr.update.SolrCmdDistributor.distribCommit(SolrCmdDistributor.java:251)\n\tat
>  
> org.apache.solr.update.processor.DistributedZkUpdateProcessor.processCommit(DistributedZkUpdateProcessor.java:201)\n\tat
>  
> org.apache.solr.handler.RequestHandlerUtils.handleCommit(RequestHandlerUtils.java:69)\n\tat
>  
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:72)\n\tat
>  
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:211)\n\tat
>  org.apache.solr.core.SolrCore.execute(SolrCore.java:2596)\n\tat 
> org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:802)\n\tat 
> org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:579)\n\tat 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:420)\n\tat
>  
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:352)\n\tat
>  
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1596)\n\tat
>  
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:545)\n\tat
>  
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)\n\tat
>  
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:590)\n\tat
>  
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)\n\tat
>  
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235)\n\tat
>  
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1607)\n\tat
>  
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)\n\tat
>  
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1297)\n\tat
>  
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188)\n\tat
>  
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:485)\n\tat
>  
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1577)\n\tat
>  
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186)\n\tat
>  
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1212)\n\tat
>  
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)\n\tat
>  
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:221)\n\tat
>  
> org.eclipse.jetty.server.handler.InetAccessHandler.handle(InetAccessHandler.java:177)\n\tat
>  
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:146)\n\tat
>  
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)\n\tat
>  
> org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:322)\n\tat
>  
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)\n\tat
>  org.eclipse.jetty.server.Server.handle(Server.java:500)\n\tat 
> org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:383)\n\tat
>  org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:547)\n\tat 
> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:375)\n\tat 
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:270)\n\tat
>  
> org.eclipse.jetty.i

Re: chaining charFilter

2020-12-02 Thread Erick Erickson

Images are stripped by the mail server, so we can’t see the result.

I looked at master and the admin UI has problems, I just
raised a JIRA, see:
https://issues.apache.org/jira/browse/SOLR-15024

The _functionality_ is fine. If you go to the analysis page
and enter values, you’ll see the transformations work. Although
that screen doesn’t show the CharFitler transformations correctly,
but the tokens at the end are chained.

Best,
Erick

> On Dec 2, 2020, at 9:18 AM, Arturas Mazeika  wrote:
> 
> Hi Solr-Team,
> 
> The manual of charfilters says that one can chain them: (from 
> https://lucene.apache.org/solr/guide/6_6/charfilterfactories.html#CharFilterFactories-solr.MappingCharFilterFactory):
> 
> CharFilters can be chained like Token Filters and placed in front of a 
> Tokenizer. CharFilters can add, change, or remove characters while preserving 
> the original character offsets to support features like highlighting.
> 
> I am trying to filter out some of the chars from some fields, so I can do an 
> efficient and effective faceting later. I tried to chaing charfilters for 
> that purpose:
> 
>  positionIncrementGap="100">
> 
> 
>  pattern="(.*[/\\])([^/\\]+)$"   replacement="$2"/>
>  pattern="([0-9\-]+)T([0-9\-]+)" replacement="$1 $2"/>
> replacement=" "/>
> 
> 
> 
> 
> 
>  stored="true"/>
> 
> but in schema definition I see only the last charfilter 
> 
> 
> Any clues why? 
> 
> Cheers,
> Arturas

Re: Solr8.7 - How to optmize my index ?

2020-12-02 Thread Erick Erickson

expungeDeletes is unnecessary, optimize is a superset of expungeDeletes.
The key difference is commit=true. I suspect if you’d waited until your
indexing process added another doc and committed, you’d have seen
the index size drop.

Just to check, you send the command to my_core but talk about collections.
Specifying the collection is sufficient, but I’ll assume that’s a typo and
you’re really saying my_collection.

I agree with Walter like I always do, you shouldn’t be running 
optimize without some proof that it’s helping. About the only time
I think it’s reasonable is when you have a static index, unless you can
demonstrate improved performance. The optimize button was
removed precisely because it was so tempting. In much earlier
versions of Lucene, it made a demonstrable difference so was put
front and center. In more recent versions of Solr optimize doesn’t
help nearly as much so it was removed.

You say you have 38M deleted documents. How many documents total? If this is
50% of your index, that’s one thing. If it’s 5%, it’s certainly not worth
the effort. You’re rewriting 466G of index, if you’re not seeing demonstrable
performance improvements, that’s a lot of wasted effort…

See: https://lucidworks.com/post/solr-and-optimizing-your-index-take-ii/
and the linked article for what happens in pre 7.5 solr versions.

Best,
Erick

> On Dec 1, 2020, at 2:31 PM, Info MatheoSoftware  
> wrote:
> 
> Hi All,
> 
> 
> 
> I found the solution, I must do :
> 
> curl http://xxx:8983/solr/my_core/update?
> 
> commit=true&expungeDeletes=true
> 
> 
> 
> It works fine
> 
> 
> 
> Thanks,
> 
> Bruno
> 
> 
> 
> 
> 
> 
> 
> De : Matheo Software [mailto:i...@matheo-software.com]
> Envoyé : mardi 1 décembre 2020 13:28
> À : solr-user@lucene.apache.org
> Objet : Solr8.7 - How to optmize my index ?
> 
> 
> 
> Hi All,
> 
> 
> 
> With Solr5.4, I used the UI button but in Solr8.7 UI this button is missing.
> 
> 
> 
> So I decide to use the command line:
> 
> curl http://xxx:8983/solr/my_core/update?optimize=true
> 
> 
> 
> My collection my_core exists of course.
> 
> 
> 
> The answer of the command line is:
> 
> {
> 
>  "responseHeader":{
> 
>"status":0,
> 
>"QTime":18}
> 
> }
> 
> 
> 
> But nothing change.
> 
> I always have 38M deleted docs in my collection and directory size no change
> like with solr5.4.
> 
> The size of the collection stay always at : 466.33Go
> 
> 
> 
> Could you tell me how can I purge deleted docs ?
> 
> 
> 
> Cordialement, Best Regards
> 
> Bruno Mannina
> 
>  www.matheo-software.com
> 
>  www.patent-pulse.com
> 
> Tél. +33 0 970 738 743
> 
> Mob. +33 0 634 421 817
> 
>  facebook (1)
>  1425551717
>  1425551737
>  1425551760
> 
> 
> 
> 
> 
>  _
> 
> 
>  Avast logo
> 
> L'absence de virus dans ce courrier électronique a été vérifiée par le
> logiciel antivirus Avast.
> www.avast.com 
> 
> 
> 
> 
> 
> 
> 
> --
> L'absence de virus dans ce courrier électronique a été vérifiée par le 
> logiciel antivirus Avast.
> https://www.avast.com/antivirus

Re: Need help to configure automated deletion of shard in solr

2020-12-02 Thread Erick Erickson

You can certainly use the TTL logic. Note the TimeRoutedAlias, but
the DocExpirationUpdateFactory. DocExpirationUpdateFactory
operates on each document individually so you can mix-n-match
if you want.

As for knowing when a shard is empty, I suggested a method for that
in one of the earlier e-mails.

If you have a collection per customer, and assuming that a customer
has the same retention policy for all docs, then TimeRoutedAlias would
work.

Best,
Erick

> On Dec 2, 2020, at 12:19 AM, Pushkar Mishra  wrote:
> 
> Hi Erick,
> It is implicit.
> TTL thing I have explored but due to some complications we can't use. that .
> Let me explain the actual use case .
> 
> We have limited space ,we can't keep storing the document for infinite
> time  . So based on the customer's retention policy ,I need to delete the
> documents. And in this process  if any shard gets empty , need to delete
> the shard as well.
> 
> So lets say , is there a way to know, when solr completes the purging of
> deleted documents, then based on that flag we can configure shard deletion
> 
> Thanks
> Pushkar
> 
> On Tue, Dec 1, 2020 at 9:02 PM Erick Erickson 
> wrote:
> 
>> This is still confusing. You haven’t told us what router you are using,
>> compositeId or implicit?
>> 
>> If you’re using compositeId (the default), you will never have empty shards
>> because docs get assigned to shards via a hashing algorithm that
>> distributes
>> them very evenly across all available shards. You cannot delete any
>> shard when using compositeId as your routing method.
>> 
>> If you don’t know which router you’re using, then you’re using compositeId.
>> 
>> NOTE: for the rest, “documents” means non-deleted documents. Solr will
>> take care of purging the deleted documents automatically.
>> 
>> I think you’re making this much more difficult than you need to. Assuming
>> that the total number of documents remains relatively constant, you can
>> just
>> let Solr take care of it all and not bother with trying to individually
>> manage
>> shards by using the default compositeID routing.
>> 
>> If the number of docs increases you might need to use splitshard. But it
>> sounds like the total number of “live” documents isn’t going to increase.
>> 
>> For TTL, if you have a _fixed_ TTL, i.e. the docs should always expire
>> after,
>> say, 30 dayswhich it doesn’t sound like you do, you can use
>> the “Time Routed Alias” option, see:
>> https://lucene.apache.org/solr/guide/7_5/time-routed-aliases.html
>> 
>> Assuming your TTL isn’t a fixed-interval, you can configure
>> DocExpirationUpdateProcessorFactory to deal with TTL automatically.
>> 
>> And if you still think you need to handle this, you need to explain exactly
>> what problem you’re trying to solve because so far it appears that
>> you’re simply taking on way more work than you need to.
>> 
>> Best,
>> Erick
>> 
>>> On Dec 1, 2020, at 9:46 AM, Pushkar Mishra 
>> wrote:
>>> 
>>> Hi Team,
>>> As I explained the use case , can someone help me out to find out the
>>> configuration way to delete the shard here ?
>>> A quick response  will be greatly appreciated.
>>> 
>>> Regards
>>> Pushkar
>>> 
>>> 
>>> On Mon, Nov 30, 2020 at 11:32 PM Pushkar Mishra 
>>> wrote:
>>> 
>>>> 
>>>> 
>>>> On Mon, Nov 30, 2020, 9:15 PM Pushkar Mishra 
>>>> wrote:
>>>> 
>>>>> Hi Erick,
>>>>> First of all thanks for your response . I will check the possibility  .
>>>>> Let me explain my problem  in detail :
>>>>> 
>>>>> 1. We have other use cases where we are making use of listener on
>>>>> postCommit to delete/shift/split the shards . So we have capability to
>>>>> delete the shards .
>>>>> 2. The current use case is , where we have to delete the documents from
>>>>> the shard , and during deletion process(it will be scheduled process,
>> may
>>>>> be hourly or daily, which will delete the documents) , if shards  gets
>>>>> empty (or may be lets  say nominal documents are left ) , then delete
>> the
>>>>> shard.  And I am exploring to do this using configuration .
>>>>> 
>>>> 3. Also it will not be in live shard for sure as only those documents
>> are
>>>> deleted which have TTL got over . TTL could be a month or year.
>>>> 
>>>> Please assist if you have any config based i

Re: Can solr index replacement character

2020-12-01 Thread Erick Erickson

Solr handles UTF-8, so it should be able to. The problem you’ll have is
getting the UTF-8 characters to get through all the various transport
encodings, i.e. if you try to search from a browser, you need to encode
it so the browser passes it through. If you search through SolrJ, it needs
to be encoded at that level. If you use cURL, it needs another….

> On Dec 1, 2020, at 12:30 AM, Eran Buchnick  wrote:
> 
> Hi community,
> During integration tests with new data source I have noticed weird scenario
> where replacement character can't be searched, though, seems to be stored.
> I mean, honestly, I don't want that irrelevant data stored in my index but
> I wondered if solr can index replacement character (U+FFFD �) as string, if
> so, how to search it?
> And in general, is there any built-in char filtration?!
> 
> Thanks

Re: Need help to configure automated deletion of shard in solr

2020-12-01 Thread Erick Erickson

This is still confusing. You haven’t told us what router you are using, 
compositeId or implicit?

If you’re using compositeId (the default), you will never have empty shards
because docs get assigned to shards via a hashing algorithm that distributes
them very evenly across all available shards. You cannot delete any
shard when using compositeId as your routing method.

If you don’t know which router you’re using, then you’re using compositeId.

NOTE: for the rest, “documents” means non-deleted documents. Solr will
take care of purging the deleted documents automatically.

I think you’re making this much more difficult than you need to. Assuming
that the total number of documents remains relatively constant, you can just
let Solr take care of it all and not bother with trying to individually manage
shards by using the default compositeID routing.

If the number of docs increases you might need to use splitshard. But it
sounds like the total number of “live” documents isn’t going to increase.

For TTL, if you have a _fixed_ TTL, i.e. the docs should always expire after,
say, 30 dayswhich it doesn’t sound like you do, you can use
the “Time Routed Alias” option, see:
https://lucene.apache.org/solr/guide/7_5/time-routed-aliases.html

Assuming your TTL isn’t a fixed-interval, you can configure
DocExpirationUpdateProcessorFactory to deal with TTL automatically.

And if you still think you need to handle this, you need to explain exactly
what problem you’re trying to solve because so far it appears that 
you’re simply taking on way more work than you need to.

Best,
Erick

> On Dec 1, 2020, at 9:46 AM, Pushkar Mishra  wrote:
> 
> Hi Team,
> As I explained the use case , can someone help me out to find out the
> configuration way to delete the shard here ?
> A quick response  will be greatly appreciated.
> 
> Regards
> Pushkar
> 
> 
> On Mon, Nov 30, 2020 at 11:32 PM Pushkar Mishra 
> wrote:
> 
>> 
>> 
>> On Mon, Nov 30, 2020, 9:15 PM Pushkar Mishra 
>> wrote:
>> 
>>> Hi Erick,
>>> First of all thanks for your response . I will check the possibility  .
>>> Let me explain my problem  in detail :
>>> 
>>> 1. We have other use cases where we are making use of listener on
>>> postCommit to delete/shift/split the shards . So we have capability to
>>> delete the shards .
>>> 2. The current use case is , where we have to delete the documents from
>>> the shard , and during deletion process(it will be scheduled process, may
>>> be hourly or daily, which will delete the documents) , if shards  gets
>>> empty (or may be lets  say nominal documents are left ) , then delete the
>>> shard.  And I am exploring to do this using configuration .
>>> 
>> 3. Also it will not be in live shard for sure as only those documents are
>> deleted which have TTL got over . TTL could be a month or year.
>> 
>> Please assist if you have any config based idea on this
>> 
>>> Regards
>>> Pushkar
>>> 
>>> On Mon, Nov 30, 2020, 8:48 PM Erick Erickson 
>>> wrote:
>>> 
>>>> Are you using the implicit router? Otherwise you cannot delete a shard.
>>>> And you won’t have any shards that have zero documents anyway.
>>>> 
>>>> It’d be a little convoluted, but you could use the collections COLSTATUS
>>>> Api to
>>>> find the names of all your replicas. Then query _one_ replica of each
>>>> shard with something like
>>>> solr/collection1_shard1_replica_n1/q=*:*&distrib=false
>>>> 
>>>> that’ll return the number of live docs (i.e. non-deleted docs) and if
>>>> it’s zero
>>>> you can delete the shard.
>>>> 
>>>> But the implicit router requires you take complete control of where
>>>> documents
>>>> go, i.e. which shard they land on.
>>>> 
>>>> This really sounds like an XY problem. What’s the use  case you’re trying
>>>> to support where you expect a shard’s number of live docs to drop to
>>>> zero?
>>>> 
>>>> Best,
>>>> Erick
>>>> 
>>>>> On Nov 30, 2020, at 4:57 AM, Pushkar Mishra 
>>>> wrote:
>>>>> 
>>>>> Hi Solr team,
>>>>> 
>>>>> I am using solr cloud.(version 8.5.x). I have a need to find out a
>>>>> configuration where I can delete a shard , when number of documents
>>>> reaches
>>>>> to zero in the shard , can some one help me out to achieve that ?
>>>>> 
>>>>> 
>>>>> It is urgent , so a quick response will be highly appreciated .
>>>>> 
>>>>> Thanks
>>>>> Pushkar
>>>>> 
>>>>> --
>>>>> Pushkar Kumar Mishra
>>>>> "Reactions are always instinctive whereas responses are always well
>>>> thought
>>>>> of... So start responding rather than reacting in life"
>>>> 
>>>> 
> 
> -- 
> Pushkar Kumar Mishra
> "Reactions are always instinctive whereas responses are always well thought
> of... So start responding rather than reacting in life"

Re: Need help to configure automated deletion of shard in solr

2020-11-30 Thread Erick Erickson

Are you using the implicit router? Otherwise you cannot delete a shard.
And you won’t have any shards that have zero documents anyway.

It’d be a little convoluted, but you could use the collections COLSTATUS Api to
find the names of all your replicas. Then query _one_ replica of each
shard with something like
solr/collection1_shard1_replica_n1/q=*:*&distrib=false

that’ll return the number of live docs (i.e. non-deleted docs) and if it’s zero
you can delete the shard.

But the implicit router requires you take complete control of where documents
go, i.e. which shard they land on.

This really sounds like an XY problem. What’s the use  case you’re trying
to support where you expect a shard’s number of live docs to drop to zero?

Best,
Erick

> On Nov 30, 2020, at 4:57 AM, Pushkar Mishra  wrote:
> 
> Hi Solr team,
> 
> I am using solr cloud.(version 8.5.x). I have a need to find out a
> configuration where I can delete a shard , when number of documents reaches
> to zero in the shard , can some one help me out to achieve that ?
> 
> 
> It is urgent , so a quick response will be highly appreciated .
> 
> Thanks
> Pushkar
> 
> -- 
> Pushkar Kumar Mishra
> "Reactions are always instinctive whereas responses are always well thought
> of... So start responding rather than reacting in life"

Re: write.lock file after unloading core

2020-11-30 Thread Erick Erickson

I’m a little confused here. Are you unloading/copying/creating the core on 
master?
I’ll assume so since I can’t really think of how doing this on one of the other
cores would make sense…..

I’m having a hard time wrapping my head around the use-case. You’re 
“delivering a new index”, which I take to mean you’re building a completely new
index somewhere else.

But you’re also updating the target index. What’s the relationship between the
index you’re “delivering” and the update sent while the core is unloaded? Are
the updates _already_ in the index you’re delivering or would you expect them
to be in the new index? Or are they just lost? Or does the indexing program
resend them after the core is created?

The unloaded core should not have any open index writers though. What I’m 
guessing is that updates are coming in before the unload is complete. Instead
of a sleep, have you tried specifying the async parameter and waiting until
REQUESTSTATUS tells you the unload is complete?

Best,
Erick

> On Nov 30, 2020, at 7:41 AM, elisabeth benoit  
> wrote:
> 
> Hello all,
> 
> We are using solr 7.3.1, with master and slave config.
> 
> When we deliver a new index we unload the core, with option delete data dir
> = true, then recreate the data folder and copy the new index files into
> that folder before sending solr a command to recreate the core (with the
> same name).
> 
> But we have, at the same time, some batches indexing non stop the core we
> just unloaded, and it happens quite frequently that we have an error at
> this point, the copy cannot be done, and I guess it is because of a
> write.lock file created by a solr index writer in the index directory.
> 
> Is it possible, when unloading the core, to stop / kill index writer? I've
> tried including a sleep after the unload and before recreation of the index
> folder, it seems to work but I was wondering if a better solution exists.
> 
> Best regards,
> Elisabeth

Re: data import handler deprecated?

2020-11-29 Thread Erick Erickson

If you like Java instead of Python, here’s a skeletal program:

https://lucidworks.com/post/indexing-with-solrj/

It’s simple and single-threaded, but could serve as a basis for
something along the lines that Walter suggests.

And I absolutely agree with Walter that the DB is often where
the bottleneck lies. You might be able to
use multiple threads and/or processes to query the
DB if that’s the case and you can find some kind of partition
key.

You also might (and it depends on the Solr version) be able,
to wrap a jdbc stream in an update decorator.

https://lucene.apache.org/solr/guide/8_0/stream-source-reference.html

https://lucene.apache.org/solr/guide/8_0/stream-decorator-reference.html

Best,
Erick

> On Nov 29, 2020, at 3:04 AM, Walter Underwood  wrote:
> 
> I recommend building an outboard loader, like I did a dozen years ago for
> Solr 1.3 (before DIH) and did again recently. I’m glad to send you my Python
> program, though it reads from a JSONL file, not a database.
> 
> Run a loop fetching records from a database. Put each record into a 
> synchronized
> (thread-safe) queue. Run multiple worker threads, each pulling records from 
> the
> queue, batching them up, and sending them to Solr. For maximum indexing speed
> (at the expense of query performance), count the number of CPUs per shard 
> leader
> and run two worker threads per CPU.
> 
> Adjust the batch size to be maybe 10k to 50k bytes. That might be 20 to 1000 
> documents, depending on the content.
> 
> With this setup, your database will probably be your bottleneck. I’ve had this
> index a million (small) documents per minute to a multi-shard cluster, from a 
> JSONL
> file on local disk.
> 
> Also, don’t worry about finding the leaders and sending the right document to
> the right shard. I just throw the batches at the load balancer and let Solr 
> figure
> it out. That is super simple and amazingly fast.
> 
> If you are doing big batches, building a dumb ETL system with JSONL files in 
> Amazon S3 has some real advantages. It allows loading prod data into a test
> cluster for load benchmarks, for example. Also good for disaster recovery, 
> just
> load the recent batches from S3. Want to know exactly which documents were
> in the index in October? Look at the batches in S3.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
>> On Nov 28, 2020, at 6:23 PM, matthew sporleder  wrote:
>> 
>> I went through the same stages of grief that you are about to start
>> but (luckily?) my core dataset grew some weird cousins and we ended up
>> writing our own indexer to join them all together/do partial
>> updates/other stuff beyond DIH.  It's not difficult to upload docs but
>> is definitely slower so far.  I think there is a bit of a 'clean core'
>> focus going on in solr-land right now and DIH is easy(!) but it's also
>> easy to hit its limits (atomic/partial updates?  wtf is an "entity?"
>> etc) so anyway try to be happy that you are aware of it now.
>> 
>> On Sat, Nov 28, 2020 at 7:41 PM Dmitri Maziuk  
>> wrote:
>>> 
>>> On 11/28/2020 5:48 PM, matthew sporleder wrote:
>>> 
 ...  The bottom of
 that github page isn't hopeful however :)
>>> 
>>> Yeah, "works with MariaDB" is a particularly bad way of saying "BYO JDBC
>>> JAR" :)
>>> 
>>> It's a more general queston though, what is the path forward for users
>>> who with data in two places? Hope that a community-maintained plugin
>>> will still be there tomorrow? Dump our tables to CSV (and POST them) and
>>> roll our own delta-updates logic? Or are we to choose one datastore and
>>> drop the other?
>>> 
>>> Dima
>

Re: Query generation is different for search terms with and without "-"

2020-11-25 Thread Erick Erickson

Parameters, no. You could use a PatternReplaceCharFilterFactory. NOTE:

*FilterFactory are _not_ what you want in this case, they are applied to 
individual tokens after parsing

*CharFiterFactory are invoked on the entire input to the field, although I 
can’t say for certain that even that’s early enough.

There are two other options to consider:
StatelessScriptUpdateProcessor
FieldMutatingUpdateProcessor

Stateless... is probably easiest…

Best,
ERick

> On Nov 24, 2020, at 1:44 PM, Samuel Gutierrez 
>  wrote:
> 
> Are there any good workarounds/parameters we can use to fix this so it
> doesn't have to be solved client side?
> 
> On Tue, Nov 24, 2020 at 7:50 AM matthew sporleder 
> wrote:
> 
>> Is the normal/standard solution here to regex remove the '-'s and
>> combine them into a single token?
>> 
>> On Tue, Nov 24, 2020 at 8:00 AM Erick Erickson 
>> wrote:
>>> 
>>> This is a common point of confusion. There are two phases for creating a
>> query,
>>> query _parsing_ first, then the analysis chain for the parsed result.
>>> 
>>> So what e-dismax sees in the two cases is:
>>> 
>>> Name_enUS:“high tech” -> two tokens, since there are two of them pf2
>> comes into play.
>>> 
>>> Name_enUS:“high-tech” -> there’s only one token so pf2 doesn’t apply,
>> splitting it on the hyphen comes later.
>>> 
>>> It’s especially confusing since the field analysis then breaks up
>> “high-tech” into two tokens that
>>> look the same as “high tech” in the debug response, just without the
>> phrase query.
>>> 
>>> Name_enUS:high
>>> Name_enUS:tech
>>> 
>>> Best,
>>> Erick
>>> 
>>>> On Nov 23, 2020, at 8:32 PM, Samuel Gutierrez <
>> samuel.gutier...@iherb.com.INVALID> wrote:
>>>> 
>>>> I am troubleshooting an issue with ranking for search terms that
>> contain a
>>>> "-" vs the same query that does not contain the dash e.g. "high-tech"
>> vs
>>>> "high tech". The field that I am querying is using the standard
>> tokenizer,
>>>> so I would expect that the underlying lucene query should be the same
>> for
>>>> both versions of the query, however when printing the debug, it appears
>>>> they are generated differently. I know "-" must be escaped as it has
>>>> special meaning in lucene, however escaping does not fix the problem.
>> It
>>>> appears that with the "-" present, the pf2 edismax parameter is not
>>>> respected and omitted from the final query. We use sow=false as we have
>>>> multiterm synonyms and need to ensure they are included in the final
>> lucene
>>>> query. My expectation is that the final underlying lucene query should
>> be
>>>> based on the output  of the field analyzer, however after briefly
>> looking
>>>> at the code for ExtendedDismaxQParser, it appears that there is some
>> string
>>>> processing happening outside of the analysis step which causes the
>>>> unexpected lucene query.
>>>> 
>>>> 
>>>> Solr Debug for "high tech":
>>>> 
>>>> parsedquery: "+(DisjunctionMaxQuery((Name_enUS:high)~0.4)
>>>> DisjunctionMaxQuery((Name_enUS:tech)~0.4))~2
>>>> DisjunctionMaxQuery((Name_enUS:"high tech"~5)~0.4)
>>>> DisjunctionMaxQuery((Name_enUS:"high tech"~4)~0.4)",
>>>> parsedquery_toString: "+(((Name_enUS:high)~0.4
>>>> (Name_enUS:tech)~0.4)~2) (Name_enUS:"high tech"~5)~0.4
>>>> (Name_enUS:"high tech"~4)~0.4",
>>>> 
>>>> 
>>>> Solr Debug for "high-tech"
>>>> 
>>>> parsedquery: "+DisjunctionMaxQueryName_enUS:high
>>>> Name_enUS:tech)~2))~0.4) DisjunctionMaxQuery((Name_enUS:"high
>>>> tech"~5)~0.4)",
>>>> parsedquery_toString: "+(((Name_enUS:high Name_enUS:tech)~2))~0.4
>>>> (Name_enUS:"high tech"~5)~0.4"
>>>> 
>>>> SolrConfig:
>>>> 
>>>> 
>>>>   
>>>> true
>>>> true
>>>> json
>>>> 3<75%
>>>> Name_enUS
>>>> Name_enUS
>>>> 5
>>>> Name_enUS
>>>> 4   
>>>> 3
>>>> 0.4
>>>> explicit
>>>> 100
>>>> false
>>>>   
>>>>   
>>>> edismax
>>>>   
>>>> 
>>>> 
>>>> Schema:
>>>> 
>>>> > positionIncrementGap="100">
>>>> 
>>>>   
>>>>   
>>>>   
>>>>   
>>>> 
>>>> 
>>>> 
>>>> 
>>>> Using Solr 8.6.3
>>>> 
>> 
> 
> -- 
> *The information contained in this message is the sole and exclusive 
> property of ***iHerb Inc.*** and may be privileged and confidential. It may 
> not be disseminated or distributed to persons or entities other than the 
> ones intended without the written authority of ***iHerb Inc.** *If you have 
> received this e-mail in error or are not the intended recipient, you may 
> not use, copy, disseminate or distribute it. Do not open any attachments. 
> Please delete it immediately from your system and notify the sender 
> promptly by e-mail that you have done so.*

Re: Query generation is different for search terms with and without "-"

2020-11-24 Thread Erick Erickson

This is a common point of confusion. There are two phases for creating a query,
query _parsing_ first, then the analysis chain for the parsed result.

So what e-dismax sees in the two cases is:

Name_enUS:“high tech” -> two tokens, since there are two of them pf2 comes into 
play.

Name_enUS:“high-tech” -> there’s only one token so pf2 doesn’t apply, splitting 
it on the hyphen comes later.

It’s especially confusing since the field analysis then breaks up “high-tech” 
into two tokens that
look the same as “high tech” in the debug response, just without the phrase 
query.

Name_enUS:high
Name_enUS:tech

Best,
Erick

> On Nov 23, 2020, at 8:32 PM, Samuel Gutierrez 
>  wrote:
> 
> I am troubleshooting an issue with ranking for search terms that contain a
> "-" vs the same query that does not contain the dash e.g. "high-tech" vs
> "high tech". The field that I am querying is using the standard tokenizer,
> so I would expect that the underlying lucene query should be the same for
> both versions of the query, however when printing the debug, it appears
> they are generated differently. I know "-" must be escaped as it has
> special meaning in lucene, however escaping does not fix the problem. It
> appears that with the "-" present, the pf2 edismax parameter is not
> respected and omitted from the final query. We use sow=false as we have
> multiterm synonyms and need to ensure they are included in the final lucene
> query. My expectation is that the final underlying lucene query should be
> based on the output  of the field analyzer, however after briefly looking
> at the code for ExtendedDismaxQParser, it appears that there is some string
> processing happening outside of the analysis step which causes the
> unexpected lucene query.
> 
> 
> Solr Debug for "high tech":
> 
> parsedquery: "+(DisjunctionMaxQuery((Name_enUS:high)~0.4)
> DisjunctionMaxQuery((Name_enUS:tech)~0.4))~2
> DisjunctionMaxQuery((Name_enUS:"high tech"~5)~0.4)
> DisjunctionMaxQuery((Name_enUS:"high tech"~4)~0.4)",
> parsedquery_toString: "+(((Name_enUS:high)~0.4
> (Name_enUS:tech)~0.4)~2) (Name_enUS:"high tech"~5)~0.4
> (Name_enUS:"high tech"~4)~0.4",
> 
> 
> Solr Debug for "high-tech"
> 
> parsedquery: "+DisjunctionMaxQueryName_enUS:high
> Name_enUS:tech)~2))~0.4) DisjunctionMaxQuery((Name_enUS:"high
> tech"~5)~0.4)",
> parsedquery_toString: "+(((Name_enUS:high Name_enUS:tech)~2))~0.4
> (Name_enUS:"high tech"~5)~0.4"
> 
> SolrConfig:
> 
>  
>
>  true
>  true
>  json
>  3<75%
>  Name_enUS
>  Name_enUS
>  5
>  Name_enUS
>  4   
>  3
>  0.4
>  explicit
>  100
>  false
>
>
>  edismax
>
>  
> 
> Schema:
> 
>  
>  
>
>
>
>
>  
>  
> 
> 
> Using Solr 8.6.3
> 
> -- 
> *The information contained in this message is the sole and exclusive 
> property of ***iHerb Inc.*** and may be privileged and confidential. It may 
> not be disseminated or distributed to persons or entities other than the 
> ones intended without the written authority of ***iHerb Inc.** *If you have 
> received this e-mail in error or are not the intended recipient, you may 
> not use, copy, disseminate or distribute it. Do not open any attachments. 
> Please delete it immediately from your system and notify the sender 
> promptly by e-mail that you have done so.*

Re: Atomic update wrongly deletes child documents

2020-11-24 Thread Erick Erickson

Sure, raise a JIRA. Thanks for the update...

> On Nov 24, 2020, at 4:12 AM, Andreas Hubold  
> wrote:
> 
> Hi,
> 
> I was able to work around the issue. I'm now using a custom
> UpdateRequestProcessor that removes undefined fields, so that I was able to
> remove the catch-all dynamic field "ignored" from my schema.. Of course, one
> has to be careful to not remove fields that are used for nested documents in
> the URP.
> 
> I think it would still make sense to fix the original issue, or at least
> document it as caveat. I'm going to create a JIRA ticket for this soon, if
> that's okay.
> 
> Regards,
> Andreas
> 
> 
> 
> --
> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Unloading and loading a Collection in SolrCloud with external Zookeeper ensemble

2020-11-15 Thread Erick Erickson

I don’t really have any good alternatives. There’s an open JIRA for
this, see: SOLR-6399

This would be a pretty big chunk of work, which is one of the reasons
this JIRA has languished…

Sorry I can’t be more helpful,
Erick

> On Nov 15, 2020, at 11:00 AM, Gajanan  wrote:
> 
> Hi Erick, thanks for the reply.
> I am working on a application where a  solr collection is being created per
> usage of application accumulating lot of them over period of time . In order
> to keep memory requirements under control, I am unloading collections not in
> current usage and loading them whenever required. 
> This was working in non cloud mode with coreAdmin APIs. Now because of
> scaling requirements we want to shift to SolrCloud mode. we want to continue
> with same application design. can you suggest, how to implement a similar
> solution in SolrCloud context.
> 
> -Gajanan
> 
> 
> 
> --
> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Why am I able to sort on a multiValued field?

2020-11-14 Thread Erick Erickson

From the “Common Query Paramters” (sort) section of the ref guide:

"In the case of primitive fields, or SortableTextFields, that are 
multiValued="true" the representative value used for each doc when sorting 
depends on the sort direction: The minimum value in each document is used for 
ascending (asc) sorting, while the maximal value in each document is used for 
descending (desc) sorting.”

Best,
Erick

> On Nov 13, 2020, at 4:36 PM, Andy C  wrote:
> 
> I am adding a new float field to my index that I want to perform range
> searches and sorting on. It will only contain a single value.
> 
> I have an existing dynamic field definition in my schema.xml that I wanted
> to use to avoid having to updating the schema:
> 
>
> stored="true" multiValued="true"/>
> 
> I went ahead and implemented this in a test system (recently updated to
> Solr 8.7), but then it occurred to me that I am not going to be able to
> sort on the field because it is defined as multiValued.
> 
> But to my surprise sorting worked, and gave the expected results.Why? Can
> this behavior be relied on in future releases?
> 
> Appreciate any insights.
> 
> Thanks
> - AndyC -

Re: Unloading and loading a Collection in SolrCloud with external Zookeeper ensemble

2020-11-12 Thread Erick Erickson

As stated in the docs, using the core admin API when using SolrCloud is not 
recommended, 
for just reasons like this. While SolrCloud _does_ use the Core Admin API, it’s 
usage
has to be very precise.

You apparently didn’t heed this warning in the UNLOAD command for the 
collections API:

"Unloading all cores in a SolrCloud collection causes the removal of that 
collection’s metadata from ZooKeeper.”

This latter is what the “non legacy mode…” message is about. In earlier 
versions of Solr,
the ZK information was recreated when Solr found a core.properties file, but 
that had
its own problems so was removed.

Your best bet now is to wipe your directories, create a new collection and 
re-index.

If you absolutely can’t reindex:
0> save away one index directory from every shard, it doesn’t matter which.
1> create the collection, with the exact same number of shards and a 
replicationFactor of 1
2> shut down all the Solr instances
3> copy the index directory from <0> to ’the right place”. For instance, if you
have a collection blah, you’ll have some directory like 
blah_shard1_replica_n1/data/index.
It’s critical that you replace the contents of data/index with the contents 
of the
directory saved in <0> from the _same_ shard, shard1 in this example.
4> start your Solr instances back up
5> use ADDREPLICA to build out the collection to have as many replicas as you 
need.

Good luck!
Erick

> On Nov 12, 2020, at 6:32 AM, Gajanan  wrote:
> 
> I have unloaded all cores of a collection in SolrCloud (8.x.x ) using
> coreAdmin APIs as UNLOAD collection is not available in collections API. Now
> I want reload the unloaded collection using APIs only. 
> When trying with coreAdmin APIs I am getting "Non legacy mode CoreNodeName
> not found." 
> When trying with collections APIs it is reloaded but shows no cores
> available.
> 
> 
> 
> 
> --
> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Using Multiple collections with streaming expressions

2020-11-10 Thread Erick Erickson

You need to open multiple streams, one to each collection then combine them. 
For instance,
open a significantTerms stream to collection1, another to collection2 and wrap 
both
in a merge stream.

Best,
Erick

> On Nov 9, 2020, at 1:58 PM, ufuk yılmaz  wrote:
> 
> For example the streaming expression significantTerms:
> 
> https://lucene.apache.org/solr/guide/8_4/stream-source-reference.html#significantterms
> 
> 
> significantTerms(collection1,
> q="body:Solr",
> field="author",
> limit="50",
> minDocFreq="10",
> maxDocFreq=".20",
> minTermLength="5")
> 
> Solr supports querying multiple collections at once, but I can’t figure  out 
> how I can do that with streaming expressions.
> When I try enclosing them in quotes like:
> 
> significantTerms(“collection1, collection2”,
> q="body:Solr",
> field="author",
> limit="50",
> minDocFreq="10",
> maxDocFreq=".20",
> minTermLength="5")
> 
> It gives the error: "EXCEPTION":"java.io.IOException: Slices not found for \" 
> collection1, collection2\""
> I think Solr thinks quotes as part of the collection names, hence it can’t 
> find slices for it.
> 
> When I just use it without quotes:
> significantTerms(collection1, collection2,…
> It gives the error: "EXCEPTION":"invalid expression 
> significantTerms(collection1, collection2, …
> 
> I tried single quotes, escaping the quotation mark but nothing Works…
> 
> Any ideas?
> 
> Best, ufuk
> 
> Windows 10 için Posta ile gönderildi
>

Re: SolrCloud shows cluster still healthy even the node data directory is deleted

2020-11-09 Thread Erick Erickson

Depends. *nix systems have delete-on-close semantics, that is as
long as there’s a single file handle open, the file will be still be
available to the process using it. Only when the last file handle is
closed will the file actually be deleted.

Solr (Lucene actually) has  file handle open to every file in the index
all the time.

These files aren’t visible when you do a directory listing. So if you
stop Solr, are the files gone? NOTE: When you start Solr again, if
there are existing replicas that are healthy then the entire index
should be copied from another replica….

Best,
Erick

> On Nov 9, 2020, at 3:30 AM, Amy Bai  wrote:
> 
> Hi community,
> 
> I found that SolrCloud won't check the IO status if the SolrCloud process is 
> alive.
> E.g. If I delete the SolrCloud data directory, there are no errors report, 
> and I can still log in to the SolrCloud   Admin UI to create/query 
> collections.
> Is this reasonable?
> Can someone explain why SOLR handles it like this?
> Thanks so much.
> 
> 
> Regards,
> Amy

Re: count mismatch with and without sort param

2020-11-05 Thread Erick Erickson

You need to provide examples in order for anyone to try to help you. Include
1> example docs (just the relevant field(s) will do)
2> your actual output
3> your expected output
4> the query you use.

Best,
Erick

> On Nov 5, 2020, at 10:56 AM, Raveendra Yerraguntla 
>  wrote:
> 
> Hello
> my date query returns mismatch with and without sort parameter. The query 
> parameter is on date field, on which records are inserted few hours ago.
> sort parameter is different from query parameter, which is the expected the 
> count.
> Any clues if how it could be different?
> I am planning to do a commit (one more time) but after looking for clues. Any 
> pointers are appreciated,
> Thanks Ravi

Re: Solr 8.1.1 installation in Azure App service

2020-11-05 Thread Erick Erickson

I _strongly_ recommend you use the collections API CREATE command
rather than try what you’re describing.

You’re trying to mix manually creating core.properties 
files, which was the process for stand-alone Solr, with SolrCloud
and hoping that it somehow gets propagated to Zookeeper. This has
sometimes kind of worked in the past, but by luck rather than design.

This has never been a supported/recommended way of creating
collections. Even the lucky part has been removed in Solr 9.

The fact that you copy the config files to your folder is another red flag.
Configuration files live in Zookeeper as configsets and should never be
copied locally.

If this process is recommended by sitecore, then I suggest you contact
their support as I am certain this won’t be supported by Solr going
forward.

Best,
Erick

> On Nov 5, 2020, at 2:28 AM, Shawn Heisey  wrote:
> 
> On 11/3/2020 11:49 PM, Narayanan, Bhagyasree wrote:
>> Steps we followed for creating Solr App service:
>> 1. Created a blank sitecore 9.3 solution from Azure market place and
>>created a Web app for Solr.
>> 2. Unzipped the Solr 8.1.1 package and copied all the contents to
>>wwwroot folder of the Web app created for Solr using WinSCP/FTP.
>> 3. Created a new Solr core by creating a new folder {index folder} and
>>copied 'conf' from the "/site/wwwroot/server/solr/configsets/_default".
>> 4. Created a core.properties file with numShards=2 & name={index folder}
> 
> Can you give us the precise locations of all core.properties files that you 
> have and ALL of the contents of those files?  There should not be any 
> sensitive information in them -- no passwords or anything like that.
> 
> It would also be helpful to see the entire solr.log file, taken shortly after 
> Solr starts.  The error will have much more detail than you shared in your 
> previous message.
> 
> This mailing list eats attachments.  So for the logfile, you'll need to post 
> the file to a filesharing service and give us a URL.  Dropbox is an example 
> of this.  For the core.properties files, which are not very long, it will 
> probably be best if you paste the entire contents into your email reply.  If 
> you attach files to your email, we won't be able to see them.
> 
> Thanks,
> Shawn

Re: Solr migration related issues.

2020-11-05 Thread Erick Erickson

Oh dear.

You made me look at the reference guide. Ouch. 

We have this nice page “Defining core.properties” that talks about defining 
core.properties. Unfortunately it has _no_ warning about the fact that trying 
to use this in SolrCloud is a really bad idea. As in “don’t do it”. To make 
matters worse, as you have found out painfully, it kinda worked in cloud mode 
in times past.

Then the collections API doc says you can add property.name=value, with no 
mention that the name here should NOT be any of the properties necessary for 
SolrCloud to operate.

The problem here is that adding property.name=value would set that property to 
the _same_ value in all cores. Naming all the replicas for a collection to the 
same thing is not supported officially, if it works in older Solrs that was by 
chance not design. And there’s no special provision for just using that 
property as a prefix. That’s really designed for custom properties. And, by the 
way, “name” is really kind of a no-op, the thing displayed in the drop-down is 
taken from Zookeeper’s node_name. Please don’t try to name that.

I very strongly recommend that you stop trying to do this. Whatever you are 
doing that requires a specific name, I’d change _that_ process to use the names 
assigned by Solr. If it’s just for aesthetics, there’s really no good way to 
change what’s in the drop-down.

Best,
Erick

> On Nov 5, 2020, at 5:25 AM, Modassar Ather  wrote:
> 
> Hi Shawn,
> 
> I understand that we do not need to modify the core.properties and use the
> APIs to create core and collection and that is what I am doing now.
> This question of naming the core as per the choice comes from our older
> setup where we have 12 shards, a collection and core both named the same
> and the core were discovered by core.properties with entries as mentioned
> in my previous mail.
> 
> Thanks for the responses. I will continue with the new collection and core
> created by the APIs and test our indexing and queries.
> 
> Best,
> Modassar
> 
> 
> On Thu, Nov 5, 2020 at 12:58 PM Shawn Heisey  wrote:
> 
>> On 11/4/2020 9:32 PM, Modassar Ather wrote:
>>> Another thing: how can I control the core naming? I want the core name to
>>> be *mycore* instead of *mycore**_shard1_replica_n1*/*mycore*
>>> *_shard2_replica_n2*.
>>> I tried setting it using property.name=*mycore* but it did not work.
>>> What can I do to achieve this? I am not able to find any config option.
>> 
>> Why would you need to this or even want to?  It sounds to me like an XY
>> problem.
>> 
>> http://xyproblem.info/
>> 
>>> I understand the core.properties file is required for core discovery but
>>> when this file is present under a subdirectory of SOLR_HOME I see it not
>>> getting loaded and not available in Solr dashboard.
>> 
>> You should not be trying to manipulate core.properties files yourself.
>> This is especially discouraged when Solr is running in cloud mode.
>> 
>> When you're in cloud mode, the collection information in zookeeper will
>> always be consulted during core discovery.  If the found core is NOT
>> described in zookeeper, it will not be loaded.  And in any recent Solr
>> version when running in cloud mode, a core that is not referenced in ZK
>> will be entirely deleted.
>> 
>> Thanks,
>> Shawn
>>

Re: Solr migration related issues.

2020-11-04 Thread Erick Erickson

inline

> On Nov 4, 2020, at 2:17 AM, Modassar Ather  wrote:
> 
> Thanks Erick and Ilan.
> 
> I am using APIs to create core and collection and have removed all the
> entries from core.properties. Currently I am facing init failure and
> debugging it.
> Will write back if I am facing any issues.
> 

If that means you still _have_ a core.properties file and it’s empty, that won’t
work.

When Solr starts, it goes through “core discovery”. Starting at SOLR_HOME it
recursively descends the directories and whenever it finds a “core.properties”
file says “aha! There’s a replica here. I'll go tell Zookeeper who I am and 
that 
I'm open for business”. It uses the values in core.properties to know what 
collection and shard it belongs to and which replica of that shard it is.

Incidentally, core discovery stops descending and moves to the next sibling
directory when it hits the first core.properties file so you can’t have a 
replica
underneath another replica in your directory tree.

You’ll save yourself a lot of grief if you start with an empty SOLR_HOME (except
for solr.xml if you haven’t put it in Zookeeper. BTW, I’d recommend you do put
put solr.xml in Zookeeper!).

Best,
Erick


> Best,
> Modassar
> 
> On Wed, Nov 4, 2020 at 3:20 AM Erick Erickson 
> wrote:
> 
>> Do note, though, that the default value for legacyCloud changed from
>> true to false so even though you can get it to work by setting
>> this cluster prop I wouldn’t…
>> 
>> The change in the default value is why it’s failing for you.
>> 
>> 
>>> On Nov 3, 2020, at 11:20 AM, Ilan Ginzburg  wrote:
>>> 
>>> I second Erick's recommendation, but just for the record legacyCloud was
>>> removed in (upcoming) Solr 9 and is still available in Solr 8.x. Most
>>> likely this explains Modassar why you found it in the documentation.
>>> 
>>> Ilan
>>> 
>>> 
>>> On Tue, Nov 3, 2020 at 5:11 PM Erick Erickson 
>>> wrote:
>>> 
>>>> You absolutely need core.properties files. It’s just that they
>>>> should be considered an “implementation detail” that you
>>>> should rarely, if ever need to be aware of.
>>>> 
>>>> Scripting manual creation of core.properties files in order
>>>> to define your collections has never been officially supported, it
>>>> just happened to work.
>>>> 
>>>> Best,
>>>> Erick
>>>> 
>>>>> On Nov 3, 2020, at 11:06 AM, Modassar Ather 
>>>> wrote:
>>>>> 
>>>>> Thanks Erick for your response.
>>>>> 
>>>>> I will certainly use the APIs and not rely on the core.properties. I
>> was
>>>>> going through the documentation on core.properties and found it to be
>>>> still
>>>>> there.
>>>>> I have all the solr install scripts based on older Solr versions and
>>>> wanted
>>>>> to re-use the same as the core.properties way is still available.
>>>>> 
>>>>> So does this mean that we do not need core.properties anymore?
>>>>> How can we ensure that the core name is configurable and not
>> dynamically
>>>>> set?
>>>>> 
>>>>> I will try to use the APIs to create the collection as well as the
>> cores.
>>>>> 
>>>>> Best,
>>>>> Modassar
>>>>> 
>>>>> On Tue, Nov 3, 2020 at 5:55 PM Erick Erickson >> 
>>>>> wrote:
>>>>> 
>>>>>> You’re relying on legacyMode, which is no longer supported. In
>>>>>> older versions of Solr, if a core.properties file was found on disk
>> Solr
>>>>>> attempted to create the replica (and collection) on the fly. This is
>> no
>>>>>> longer true.
>>>>>> 
>>>>>> 
>>>>>> Why are you doing it this manually instead of using the collections
>> API?
>>>>>> You can precisely place each replica with that API in a way that’ll
>>>>>> be continued to be supported going forward.
>>>>>> 
>>>>>> This really sounds like an XY problem, what is the use-case you’re
>>>>>> trying to solve?
>>>>>> 
>>>>>> Best,
>>>>>> Erick
>>>>>> 
>>>>>>> On Nov 3, 2020, at 6:39 AM, Modassar Ather 
>>>>>> wrote:
>>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>&

Re: Commits (with openSearcher = true) are too slow in solr 8

2020-11-04 Thread Erick Erickson

I completely agree with Shawn. I’d emphasize that your heap is that large
probably to accommodate badly mis-configured caches.

Why it’s different in 5.4 I don’t quite know, but 10-12
minutes is unacceptable anyway.

My guess is that you made your heaps that large as a consequence of
having low hit rates. If you were using bare NOW in fq clauses,
perhaps you were getting very low hit rates as a result and expanded
the cache size, see:

https://dzone.com/articles/solr-date-math-now-and-filter

At any rate, I _strongly_ recommend that you drop your filterCache
to the default size of 512, and drop your autowarmCount to something
very small, say 16. Ditto for queryResultCache. The documentCache
to maybe 10,000 (autowarm is a no-op for documentCache). Then
drop your heap to something closer to 16G. Then test, tune, test. Do
NOT assume bigger caches are the answer until you have evidence.
Keep reducing your heap size until you start to see GC problems (on 
a test system obviously) to get your lower limit. Then add some
back for your production to give you some breathing room.

Finally, see Uwe’s blog:

https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

to get a sense of why the size on disk is not necessarily a good
indicator of the heap requirements.

Best,
Erick

> On Nov 4, 2020, at 2:40 AM, Shawn Heisey  wrote:
> 
> On 11/3/2020 11:46 PM, raj.yadav wrote:
>> We have two parallel system one is  solr 8.5.2 and other one is solr 5.4
>> In solr_5.4 commit time with opensearcher true is 10 to 12 minutes while in
>> solr_8 it's around 25 minutes.
> 
> Commits on a properly configured and sized system should take a few seconds, 
> not minutes.  10 to 12 minutes for a commit is an enormous red flag.
> 
>> This is our current caching policy of solr_8
>> >  size="32768"
>>  initialSize="6000"
>>  autowarmCount="6000"/>
> 
> This is probably the culprit.  Do you know how many entries the filterCache 
> actually ends up with?  What you've said with this config is "every time I 
> open a new searcher, I'm going to execute up to 6000 queries against the new 
> index."  If each query takes one second, running 6000 of them is going to 
> take 100 minutes.  I have seen these queries take a lot longer than one 
> second.
> 
> Also, each entry in the filterCache can be enormous, depending on the number 
> of docs in the index.  Let's say that you have five million documents in your 
> core.  With five million documents, each entry in the filterCache is going to 
> be 625000 bytes.  That means you need 20GB of heap memory for a full 
> filterCache of 32768 entries -- 20GB of memory above and beyond everything 
> else that Solr requires.  Your message doesn't say how many documents you 
> have, it only says the index is 11GB.  From that, it is not possible for me 
> to figure out how many documents you have.
> 
>> While debugging this we came across this page.
>> https://cwiki.apache.org/confluence/display/SOLR/SolrPerformanceProblems#SolrPerformanceProblems-Slowcommits
> 
> I wrote that wiki page.
> 
>> Here one of the reasons for slow commit is mentioned as:
>> */`Heap size issues. Problems from the heap being too big will tend to be
>> infrequent, while problems from the heap being too small will tend to happen
>> consistently.`/*
>> Can anyone please help me understand the above point?
> 
> If your heap is a lot bigger than it needs to be, then what you'll see is 
> slow garbage collections, but it won't happen very often.  If the heap is too 
> small, then there will be garbage collections that happen REALLY often, 
> leaving few system resources for actually running the program.  This applies 
> to ANY Java program, not just Solr.
> 
>> System config:
>> disk size: 250 GB
>> cpu: (8 vcpus, 64 GiB memory)
>> Index size: 11 GB
>> JVM heap size: 30 GB
> 
> That heap seems to be a lot larger than it needs to be.  I have run systems 
> with over 100GB of index, with tens of millions of documents, on an 8GB heap. 
>  My filterCache on each core had a max size of 64, with an autowarmCount of 
> four ... and commits STILL would take 10 to 15 seconds, which I consider to 
> be very slow.  Most of that time was spent executing those four queries in 
> order to autowarm the filterCache.
> 
> What I would recommend you start with is reducing the size of the 
> filterCache.  Try a size of 128 and an autowarmCount of 8, see what you get 
> for a hit rate on the cache.  Adjust from there as necessary.  And I would 
> reduce the heap size for Solr as well -- your heap requirements should drop 
> dramatically with a reduced filterCache.
> 
> Thanks,
> Shawn

Re: docValues usage

2020-11-04 Thread Erick Erickson

You don’t need to index the field for function queries, see: 
https://lucene.apache.org/solr/guide/8_6/docvalues.html.

Function queries, as opposed to sorting, faceting and grouping are evaluated at 
search time where the  
search process is already parked on the document anyway, so answering the 
question “for doc X, what
is the value of field Y” to compute the score. DocValues are still more 
efficient I think, although I
haven’t measured explicitly...

For sorting, faceting and grouping, it’s a much different story. Take sorting. 
You have to ask
“for field Y, what’s the value in docX and docZ?”. Say you’re parked on docX. 
Doc Z is long gone 
and getting the value for field Y much more expensive.

Also, docValues will not increase memory requirements _unless used_. Otherwise 
they’ll
just sit there on disk. They will certainly increase disk space whether used or 
not.

And _not_ using docValues when you facet, group or sort will also _certainly_ 
increase
your heap requirements since the docValues structure must be built on the heap 
rather
than be in MMapDirectory space.

Best,
Erick

> On Nov 4, 2020, at 5:32 AM, uyilmaz  wrote:
> 
> Hi,
> 
> I'm by no means expert on this so if anyone sees a mistake please correct me.
> 
> I think you need to index this field, since boost functions are added to the 
> query as optional clauses 
> (https://lucene.apache.org/solr/guide/6_6/the-dismax-query-parser.html#TheDisMaxQueryParser-Thebf_BoostFunctions_Parameter).
>  It's like boosting a regular field by putting ^2 next to it in a query. 
> Storing or enabling docValues will unnecesarily consume space/memory.
> 
> On Tue, 3 Nov 2020 16:10:50 -0800
> Wei  wrote:
> 
>> Hi,
>> 
>> I have a couple of primitive single value numeric type fields,  their
>> values are used in boosting functions, but not used in sort/facet. or in
>> returned response.   Should I use docValues for them in the schema?  I can
>> think of the following options:
>> 
>> 1)   indexed=true,  stored=true, docValues=false
>> 2)   indexed=true, stored=false, docValues=true
>> 3)   indexed=false,  stored=false,  docValues=true
>> 
>> What would be the performance implications for these options?
>> 
>> Best,
>> Wei
> 
> 
> -- 
> uyilmaz

Re: when to use stored over docValues and useDocValuesAsStored

2020-11-04 Thread Erick Erickson

> On Nov 4, 2020, at 6:43 AM, uyilmaz  wrote:
> 
> Hi,
> 
> I heavily use streaming expressions and facets, or export large amounts of 
> data from Solr to Spark to make analyses.
> 
> Please correct me if I know wrong:
> 
> + requesting a non-docValues field in a response causes whole document to be 
> decompressed and read from disk

non-docValues fields don’t work at all for many stream spources, IIRC only the 
Topic Stream will work with stored values. The read/decompress/extract cycle 
would be unacceptable performance-wise for large data sets otherwise.

> + streaming expressions and export handler requires every field read to have 
> docValues

Pretty muche.

> 
> - docValues increases index size, therefore memory requirement, stored only 
> uses disk space

Yes. 

> - stored preserves order of multivalued fields

Yes.

> 
> It seems stored is only useful when I have a multivalued field that I care 
> about the index-time order of things, and since I will be using the export 
> handler, it will use docValues anyways and lose the order.

Yes.

> 
> So is there any case that I need stored=true?

Not for export outside of the Topic Stream as above. stored=true is there for 
things like showing the user the original input and highlighting.

> 
> Best,
> ufuk
> 
> -- 
> uyilmaz

Re: Solr migration related issues.

2020-11-03 Thread Erick Erickson

Do note, though, that the default value for legacyCloud changed from
true to false so even though you can get it to work by setting
this cluster prop I wouldn’t…

The change in the default value is why it’s failing for you.


> On Nov 3, 2020, at 11:20 AM, Ilan Ginzburg  wrote:
> 
> I second Erick's recommendation, but just for the record legacyCloud was
> removed in (upcoming) Solr 9 and is still available in Solr 8.x. Most
> likely this explains Modassar why you found it in the documentation.
> 
> Ilan
> 
> 
> On Tue, Nov 3, 2020 at 5:11 PM Erick Erickson 
> wrote:
> 
>> You absolutely need core.properties files. It’s just that they
>> should be considered an “implementation detail” that you
>> should rarely, if ever need to be aware of.
>> 
>> Scripting manual creation of core.properties files in order
>> to define your collections has never been officially supported, it
>> just happened to work.
>> 
>> Best,
>> Erick
>> 
>>> On Nov 3, 2020, at 11:06 AM, Modassar Ather 
>> wrote:
>>> 
>>> Thanks Erick for your response.
>>> 
>>> I will certainly use the APIs and not rely on the core.properties. I was
>>> going through the documentation on core.properties and found it to be
>> still
>>> there.
>>> I have all the solr install scripts based on older Solr versions and
>> wanted
>>> to re-use the same as the core.properties way is still available.
>>> 
>>> So does this mean that we do not need core.properties anymore?
>>> How can we ensure that the core name is configurable and not dynamically
>>> set?
>>> 
>>> I will try to use the APIs to create the collection as well as the cores.
>>> 
>>> Best,
>>> Modassar
>>> 
>>> On Tue, Nov 3, 2020 at 5:55 PM Erick Erickson 
>>> wrote:
>>> 
>>>> You’re relying on legacyMode, which is no longer supported. In
>>>> older versions of Solr, if a core.properties file was found on disk Solr
>>>> attempted to create the replica (and collection) on the fly. This is no
>>>> longer true.
>>>> 
>>>> 
>>>> Why are you doing it this manually instead of using the collections API?
>>>> You can precisely place each replica with that API in a way that’ll
>>>> be continued to be supported going forward.
>>>> 
>>>> This really sounds like an XY problem, what is the use-case you’re
>>>> trying to solve?
>>>> 
>>>> Best,
>>>> Erick
>>>> 
>>>>> On Nov 3, 2020, at 6:39 AM, Modassar Ather 
>>>> wrote:
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> I am migrating from Solr 6.5.1 to Solr 8.6.3. As a part of the entire
>>>>> upgrade I have the first task to install and configure the solr with
>> the
>>>>> core and collection. The solr is installed in SolrCloud mode.
>>>>> 
>>>>> In Solr 6.5.1 I was using the following key values in core.properties
>>>> file.
>>>>> The configuration files were uploaded to zookeeper using the upconfig
>>>>> command.
>>>>> The core and collection was automatically created with the setting in
>>>>> core.properties files and the configSet uploaded in zookeeper and it
>> used
>>>>> to display on the Solr 6.5.1 dashboard.
>>>>> 
>>>>> numShards=12
>>>>> 
>>>>> name=mycore
>>>>> 
>>>>> collection=mycore
>>>>> 
>>>>> configSet=mycore
>>>>> 
>>>>> 
>>>>> With the latest Solr 8.6.3 the same approach is not working. As per my
>>>>> understanding the core is identified using the location of
>>>> core.properties
>>>>> which is under */mycore/core.properties.*
>>>>> 
>>>>> Can you please help me with the following?
>>>>> 
>>>>> 
>>>>> - Is there any property I am missing to load the core and collection
>> as
>>>>> it used to be in Solr 6.5.1 with the help of core.properties and
>>>> config set
>>>>> on zookeeper?
>>>>> - The name of the core and collection should be configurable and not
>>>> the
>>>>> dynamically generated names. How can I control that in the latest
>> Solr?
>>>>> - Is the core and collection API the only way to create core and
>>>>> collection as I see that the core is also not getting listed even if
>>>> the
>>>>> core.properties file is present?
>>>>> 
>>>>> Please note that I will be doing a full indexing once the setup is
>> done.
>>>>> 
>>>>> Kindly help me with your suggestions.
>>>>> 
>>>>> Best,
>>>>> Modassar
>>>> 
>>>> 
>> 
>>

Re: Solr migration related issues.

2020-11-03 Thread Erick Erickson

You absolutely need core.properties files. It’s just that they
should be considered an “implementation detail” that you
should rarely, if ever need to be aware of.

Scripting manual creation of core.properties files in order
to define your collections has never been officially supported, it
just happened to work.

Best,
Erick

> On Nov 3, 2020, at 11:06 AM, Modassar Ather  wrote:
> 
> Thanks Erick for your response.
> 
> I will certainly use the APIs and not rely on the core.properties. I was
> going through the documentation on core.properties and found it to be still
> there.
> I have all the solr install scripts based on older Solr versions and wanted
> to re-use the same as the core.properties way is still available.
> 
> So does this mean that we do not need core.properties anymore?
> How can we ensure that the core name is configurable and not dynamically
> set?
> 
> I will try to use the APIs to create the collection as well as the cores.
> 
> Best,
> Modassar
> 
> On Tue, Nov 3, 2020 at 5:55 PM Erick Erickson 
> wrote:
> 
>> You’re relying on legacyMode, which is no longer supported. In
>> older versions of Solr, if a core.properties file was found on disk Solr
>> attempted to create the replica (and collection) on the fly. This is no
>> longer true.
>> 
>> 
>> Why are you doing it this manually instead of using the collections API?
>> You can precisely place each replica with that API in a way that’ll
>> be continued to be supported going forward.
>> 
>> This really sounds like an XY problem, what is the use-case you’re
>> trying to solve?
>> 
>> Best,
>> Erick
>> 
>>> On Nov 3, 2020, at 6:39 AM, Modassar Ather 
>> wrote:
>>> 
>>> Hi,
>>> 
>>> I am migrating from Solr 6.5.1 to Solr 8.6.3. As a part of the entire
>>> upgrade I have the first task to install and configure the solr with the
>>> core and collection. The solr is installed in SolrCloud mode.
>>> 
>>> In Solr 6.5.1 I was using the following key values in core.properties
>> file.
>>> The configuration files were uploaded to zookeeper using the upconfig
>>> command.
>>> The core and collection was automatically created with the setting in
>>> core.properties files and the configSet uploaded in zookeeper and it used
>>> to display on the Solr 6.5.1 dashboard.
>>> 
>>> numShards=12
>>> 
>>> name=mycore
>>> 
>>> collection=mycore
>>> 
>>> configSet=mycore
>>> 
>>> 
>>> With the latest Solr 8.6.3 the same approach is not working. As per my
>>> understanding the core is identified using the location of
>> core.properties
>>> which is under */mycore/core.properties.*
>>> 
>>> Can you please help me with the following?
>>> 
>>> 
>>>  - Is there any property I am missing to load the core and collection as
>>>  it used to be in Solr 6.5.1 with the help of core.properties and
>> config set
>>>  on zookeeper?
>>>  - The name of the core and collection should be configurable and not
>> the
>>>  dynamically generated names. How can I control that in the latest Solr?
>>>  - Is the core and collection API the only way to create core and
>>>  collection as I see that the core is also not getting listed even if
>> the
>>>  core.properties file is present?
>>> 
>>> Please note that I will be doing a full indexing once the setup is done.
>>> 
>>> Kindly help me with your suggestions.
>>> 
>>> Best,
>>> Modassar
>> 
>>

Re: how do you manage your config and schema

2020-11-03 Thread Erick Erickson

The caution I would add is that you should be careful 
that you don’t enable schemaless mode without understanding 
the consequences in detail.

There is, in fact, some discussion of removing schemaless entirely, 
see:
https://issues.apache.org/jira/browse/SOLR-14701

Otherwise, I usually recommend that you take the stock ocnfigs and
overlay whatever customizations you’ve added in terms of
field definitions and the like.

Do also be careful, some default field params have changed…

Best,
Erick

> On Nov 3, 2020, at 9:30 AM, matthew sporleder  wrote:
> 
> Yesterday I realized that we have been carrying forward our configs
> since, probably, 4.x days.
> 
> I ran a config set action=create (from _default) and saw files i
> didn't recognize, and a lot *fewer* things than I've been uploading
> for the last few years.
> 
> Anyway my new plan is to just use _default and keep params.json,
> solrconfig.xml, and schema.xml in git and just use the defaults for
> the rest.  (modulo synonyms/etc)
> 
> Did everyone move on to managed schema and use some kind of
> intermediate format to upload?
> 
> I'm just looking for updated best practices and a little survey of usage 
> trends.
> 
> Thanks,
> Matt

Re: Search issue in the SOLR for few words

2020-11-03 Thread Erick Erickson

There is not nearly enough information here to begin
to help you.

At minimum we need:
1> your field definition
2> the text you index
3> the query you send

You might want to review: 
https://wiki.apache.org/solr/UsingMailingLists

Best,
Erick

> On Nov 3, 2020, at 1:08 AM, Viresh Sasalawad 
>  wrote:
> 
> Hi Sir/Madam,
> 
> Am facing an issue with few keyword searches (like gazing, one) in solr.
> Can you please help why these words are not listed in solr results?
> 
> Indexing is done properly.
> 
> 
> -- 
> Thanks and Regards
> Veeresh Sasalawad

Re: Solr migration related issues.

2020-11-03 Thread Erick Erickson

You’re relying on legacyMode, which is no longer supported. In
older versions of Solr, if a core.properties file was found on disk Solr
attempted to create the replica (and collection) on the fly. This is no
longer true.


Why are you doing it this manually instead of using the collections API?
You can precisely place each replica with that API in a way that’ll
be continued to be supported going forward.

This really sounds like an XY problem, what is the use-case you’re
trying to solve?

Best,
Erick

> On Nov 3, 2020, at 6:39 AM, Modassar Ather  wrote:
> 
> Hi,
> 
> I am migrating from Solr 6.5.1 to Solr 8.6.3. As a part of the entire
> upgrade I have the first task to install and configure the solr with the
> core and collection. The solr is installed in SolrCloud mode.
> 
> In Solr 6.5.1 I was using the following key values in core.properties file.
> The configuration files were uploaded to zookeeper using the upconfig
> command.
> The core and collection was automatically created with the setting in
> core.properties files and the configSet uploaded in zookeeper and it used
> to display on the Solr 6.5.1 dashboard.
> 
> numShards=12
> 
> name=mycore
> 
> collection=mycore
> 
> configSet=mycore
> 
> 
> With the latest Solr 8.6.3 the same approach is not working. As per my
> understanding the core is identified using the location of core.properties
> which is under */mycore/core.properties.*
> 
> Can you please help me with the following?
> 
> 
>   - Is there any property I am missing to load the core and collection as
>   it used to be in Solr 6.5.1 with the help of core.properties and config set
>   on zookeeper?
>   - The name of the core and collection should be configurable and not the
>   dynamically generated names. How can I control that in the latest Solr?
>   - Is the core and collection API the only way to create core and
>   collection as I see that the core is also not getting listed even if the
>   core.properties file is present?
> 
> Please note that I will be doing a full indexing once the setup is done.
> 
> Kindly help me with your suggestions.
> 
> Best,
> Modassar

Re: SOLR uses too much CPU and GC is also weird on Windows server

2020-11-02 Thread Erick Erickson

What this sounds like is that somehow you were committing after every update in 
8x but not in your 6x code. How that would have been change is anybody’s guess 
;).

It’s vaguely possible that your client is committing and you had 
IgnoreCommitOptimizeUpdateProcessorFactory defined in your update chain in 6x 
but not 8x.

The other thing would be if your commit interval was much shorter in 8x than 6x 
or if your autowarm parameters were significantly different.

That said, this is still a mystery, glad you found an answer.

Thanks for getting back to us on this, this is useful information to have.

Best,
Erick

> On Nov 2, 2020, at 7:50 AM, Jaan Arjasepp  wrote:
> 
> Thanks for all for helping to think about it, but eventually found out that 
> code was basically single record deleting/adding records. After it was 
> batched up, then everything is back to normal. Funny thing is that 6.0.0. 
> handled these requests somehow, but newer version did not.
> Anyway, we will observe this and try to improve our code as well.
> 
> Best regards,
> Jaan
> 
> -Original Message-
> From: Erick Erickson  
> Sent: 28 October 2020 17:18
> To: solr-user@lucene.apache.org
> Subject: Re: SOLR uses too much CPU and GC is also weird on Windows server
> 
> DocValues=true are usually only used for “primitive” types, string, numerics, 
> booleans and the like, specifically _not_ text-based.
> 
> I say “usually” because there’s a special “SortableTextField” where it does 
> make some sense to have a text-based field have docValues, but that’s 
> intended for relatively short fields. For example you want to sort on a title 
> field. And probably not something you’re working with.
> 
> There’s not much we can say from this distance I’m afraid. I think I’d focus 
> on the memory requirements, maybe take a heap dump and see what’s using 
> memory.
> 
> Did you restart Solr _after_ turning off indexing? I ask because that would 
> help determine which side the problem is on, indexing or querying. It does 
> sound like querying though.
> 
> As for docValues in general, if you want to be really brave, you can set 
> uninvertible=false for all your fields where docValues=false. When you facet 
> on such a field, you won’t get anything back. If you sort on such a field, 
> you’ll get an error message back. That should test if somehow not having 
> docValues is the root of your problem. Do this on a test system of course ;) 
> I think this is a low-probability issue, but it’s a mystery anyway so...
> 
> Updating shouldn’t be that much of a problem either, and if you still see 
> high CPU with indexing turned off, that eliminates indexing as a candidate.
> 
> Is there any chance you changed your schema at all and didn’t delete your 
> entire index and add all your documents back? There are a lot of ways things 
> can go wrong if that’s the case. You had to reindex from scratch when you 
> went to 8x from 6x, I’m wondering if during that process the schema changed 
> without starting over. I’m grasping at straws here…
> 
> I’d also seriously consider going to 8.6.3. We only make point releases when 
> there’s something serious. Looking through lucene/CHANGES.txt, there is one 
> memory leak fix in 8.6.2. I’d expect a gradual buildup of heap if that were 
> what you’re seeing, but you never know.
> 
> As for having docValues=false, that would cut down on the size of the index 
> on disk and speed up indexing some, but in terms of memory usage or CPU usage 
> when querying, unless the docValues structures are _needed_, they’re never 
> read into OS RAM by MMapDirectory… The question really is whether you ever, 
> intentionally or not, do “something” that would be more efficient with 
> docValues. That’s where setting uninvertible=false whenever you set 
> docValues=false makes sense, things will show up if your assumption that you 
> don’t need docValues is false.
> 
> Best,
> Erick
> 
> 
>> On Oct 28, 2020, at 9:29 AM, Jaan Arjasepp  wrote:
>> 
>> Hi all,
>> 
>> Its me again. Anyway, I did a little research and we tried different things 
>> and well, some questions I want to ask and some things that I found.
>> 
>> Well after monitoring my system with VirtualVM, I found that GC jumping is 
>> from 0.5GB to 2.5GB and it has 4GB of memory for now, so it should not be an 
>> issue anymore or what? But will observe it a bit as it might rise I guess a 
>> bit.
>> 
>> Next thing we found or are thinking about is that writing on a disk might be 
>> an issue, we turned off the indexing and some other stuff, but I would say, 
>> it did not save much still.
>> I also did go through all the schema fields, not that much really. They are 
>> all

Re: Simulate facet.exists for json query facets

2020-10-30 Thread Erick Erickson

I don’t think there’s anything to do what you’re asking OOB.

If all of those facet queries are _known_ to be a performance hit,
you might be able to do something custom.That would require 
custom code though and I wouldn’t go there unless you can
demonstrate need.

If you issue a debug=timing you’ll see the time each component 
takes,  and there’s a separate entry for faceting so that’ll give you
a clue whether it’s worth the effort.

Best,
Erick

> On Oct 30, 2020, at 8:10 AM, Michael Gibney  wrote:
> 
> Michael, sorry for the confusion; I was positing a *hypothetical*
> "exists()" function that doesn't currently exist, that *is* an
> aggregate function, and the *does* stop early. I didn't account for
> the fact that there's already an "exists()" function *query* that
> behaves very differently. So yes, definitely confusing :-). I guess
> choosing a different name for the proposed aggregate function would
> make sense. I was suggesting it mostly as an alternative to extending
> the syntax of JSON Facet "query" facet type, and to say that I think
> the implementation of such an aggregate function would be pretty
> straightforward.
> 
> On Fri, Oct 30, 2020 at 3:44 AM michael dürr  wrote:
>> 
>> @Erick
>> 
>> Sorry! I chose a simple example as I wanted to reduce complexity.
>> In detail:
>> * We have distinct contents like tours, offers, events, etc which
>> themselves may be categorized: A tour may be a hiking tour, a
>> mountaineering tour, ...
>> * We have hundreds of customers that want to facet their searches to that
>> content types but often with distinct combinations of categories, i.e.
>> customer A wants his facet "tours" to only count hiking tours, customer B
>> only mountaineering tours, customer C a combination of both, etc
>> * We use "query" facets as each facet request will be build dynamically (it
>> is not feasible to aggregate certain categories and add them as an
>> additional solr schema field as we have hundreds of different combinations).
>> * Anyways, our ui only requires adding a toggle to filter for (for example)
>> "tours" in case a facet result is present. We do not care about the number
>> of tours.
>> * As we have millions of contents and dozens of content types (and dozens
>> of categories per content type) such queries may take a very long time.
>> 
>> A complex example may look like this:
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> *q=*:*&json.facet={   tour:{ type : query, q: \"+categoryId:(21450
>> 21453)\"   },   guide:{ type : query, q: \"+categoryId:(21105 21401
>> 21301 21302 21303 21304 21305 21403 21404)\"   },   story:{ type :
>> query, q: \"+categoryId:21515\"   },   condition:{ type : query,
>> q: \"+categoryId:21514\"   },   hut:{ type : query, q:
>> \"+categoryId:8510\"   },   skiresort:{ type : query, q:
>> \"+categoryId:21493\"   },   offer:{ type : query, q:
>> \"+categoryId:21462\"   },   lodging:{ type : query, q:
>> \"+categoryId:6061\"   },   event:{ type : query, q:
>> \"+categoryId:21465\"   },   poi:{ type : query, q:
>> \"+(+categoryId:6000 -categoryId:(6061 21493 8510))\"   },   authors:{
>> type : query, q: \"+categoryId:(21205 21206)\"   },   partners:{
>> type : query, q: \"+categoryId:21200\"   },   list:{ type :
>> query, q: \"+categoryId:21481\"   } }\&rows=0"*
>> 
>> @Michael
>> 
>> Thanks for your suggestion but this does not work as
>> * the facet module expects an aggregate function (which i simply added by
>> embracing your call with sum(...))
>> * and (please correct me if I am wrong) the exists() function not stops on
>> the first match, but counts the number of results for which the query
>> matches a document.

Re: SOLR uses too much CPU and GC is also weird on Windows server

2020-10-28 Thread Erick Erickson

DocValues=true are usually only used for “primitive” types, string, numerics, 
booleans and the like, specifically _not_ text-based.

I say “usually” because there’s a special “SortableTextField” where it does 
make some sense to have a text-based field have docValues, but that’s intended 
for relatively short fields. For example you want to sort on a title field. And 
probably not something you’re working with.

There’s not much we can say from this distance I’m afraid. I think I’d focus on 
the memory requirements, maybe take a heap dump and see what’s using memory.

Did you restart Solr _after_ turning off indexing? I ask because that would 
help determine which side the problem is on, indexing or querying. It does 
sound like querying though.

As for docValues in general, if you want to be really brave, you can set 
uninvertible=false for all your fields where docValues=false. When you facet on 
such a field, you won’t get anything back. If you sort on such a field, you’ll 
get an error message back. That should test if somehow not having docValues is 
the root of your problem. Do this on a test system of course ;) I think this is 
a low-probability issue, but it’s a mystery anyway so...

Updating shouldn’t be that much of a problem either, and if you still see high 
CPU with indexing turned off, that eliminates indexing as a candidate.

Is there any chance you changed your schema at all and didn’t delete your 
entire index and add all your documents back? There are a lot of ways things 
can go wrong if that’s the case. You had to reindex from scratch when you went 
to 8x from 6x, I’m wondering if during that process the schema changed without 
starting over. I’m grasping at straws here…

I’d also seriously consider going to 8.6.3. We only make point releases when 
there’s something serious. Looking through lucene/CHANGES.txt, there is one 
memory leak fix in 8.6.2. I’d expect a gradual buildup of heap if that were 
what you’re seeing, but you never know.

As for having docValues=false, that would cut down on the size of the index on 
disk and speed up indexing some, but in terms of memory usage or CPU usage when 
querying, unless the docValues structures are _needed_, they’re never read into 
OS RAM by MMapDirectory… The question really is whether you ever, intentionally 
or not, do “something” that would be more efficient with docValues. That’s 
where setting uninvertible=false whenever you set docValues=false makes sense, 
things will show up if your assumption that you don’t need docValues is false.

Best,
Erick

> On Oct 28, 2020, at 9:29 AM, Jaan Arjasepp  wrote:
> 
> Hi all,
> 
> Its me again. Anyway, I did a little research and we tried different things 
> and well, some questions I want to ask and some things that I found.
> 
> Well after monitoring my system with VirtualVM, I found that GC jumping is 
> from 0.5GB to 2.5GB and it has 4GB of memory for now, so it should not be an 
> issue anymore or what? But will observe it a bit as it might rise I guess a 
> bit.
> 
> Next thing we found or are thinking about is that writing on a disk might be 
> an issue, we turned off the indexing and some other stuff, but I would say, 
> it did not save much still.
> I also did go through all the schema fields, not that much really. They are 
> all docValues=true. Also I must say they are all automatically generated, so 
> no manual working there except one field, but this also has docValue=true. 
> Just curious, if the field is not a string/text, can it be docValue=false or 
> still better to have true? And as for uninversion, then we are not using much 
> facets nor other specific things in query, just simple queries. 
> 
> Though I must say we are updating documents quite a bunch, but usage of CPU 
> for being so high, not sure about that. Older version seemed not using CPU so 
> much...
> 
> I am a bit running out of ideas and hoping that this will continue to work, 
> but I dont like the CPU usage even over night, when nobody uses it. We will 
> try to figure out the issue here and I hope I can ask more questions when in 
> doubt or out of ideas. Also I must admit, solr is really new for me 
> personally.
> 
> Jaan
> 
> -Original Message-
> From: Walter Underwood  
> Sent: 27 October 2020 18:44
> To: solr-user@lucene.apache.org
> Subject: Re: SOLR uses too much CPU and GC is also weird on Windows server
> 
> That first graph shows a JVM that does not have enough heap for the program 
> it is running. Look at the bottom of the dips. That is the amount of memory 
> still in use after a full GC.
> 
> You want those dips to drop to about half of the available heap, so I’d 
> immediately increase that heap to 4G. That might not be enough, so you’ll 
> need to to watch that graph after the increase.
> 
> I’ve been using 8G heaps with Solr since version 1.2. We run this config with 
> Java 8 on over 100 machines. We do not do any faceting, which can take more 
> memory.
> 
> SOLR_HEAP=8g
> # Use G1

Re: Simulate facet.exists for json query facets

2020-10-28 Thread Erick Erickson

This really sounds like an XY problem. The whole point of facets is
to count the number of documents that have a value in some
number of buckets. So trying to stop your facet query as soon
as it matches a hit for the first time seems like an odd thing to do.

So what’s the “X”? In other words, what is the problem you’re trying
to solve at a high level? Perhaps there’s a better way to figure this
out.

Best,
Erick

> On Oct 28, 2020, at 3:48 AM, michael dürr  wrote:
> 
> Hi,
> 
> I use json facets of type 'query'. As these queries are pretty slow and I'm
> only interested in whether there is a match or not, I'd like to restrict
> the query execution similar to the standard facetting (like with the
> facet.exists parameter). My simplified query looks something like this (in
> reality *:* may be replaced by a complex edismax query and multiple
> subfacets similar to "tour" occur):
> 
> curl http://localhost:8983/solr/portal/select -d \
> "q=*:*\
> &json.facet={
>  tour:{
>type : query,
> q: \"+(+categoryId:6000 -categoryId:(6061 21493 8510))\"
>  }
> }\
> &rows=0"
> 
> Is there any possibility to modify my request to ensure that the facet
> query stops as soon as it matches a hit for the first time?
> 
> Thanks!
> Michael

Re: Solrcloud create collection ignores createNodeSet parameter

2020-10-27 Thread Erick Erickson

You’re confusing replicas and shards a bit. Solr tries its best to put multiple 
replicas _of the same shard_ on different nodes. You have two shards though 
with _one_ replica. Thi is a bit of a nit, but important to keep in mind when 
your replicatinFactor increases. So from an HA perspective, this isn’t 
catastrophic since both shards must be up to run.

That said, it does seem reasonable to use all the nodes in your case. If you 
omit the createNodeSet, what happens? I’m curious if that’s confusing things 
somehow. And can you totally guarantee that both nodes are accessible when the 
collection is created?

BTW, I’ve always disliked the parameter name “maxShardsPerNode”, shards isn’t 
what it’s actually about. But I suppose 
“maxReplicasOfAnyIndividualShardOnASingleNode” is a little verbose...

> On Oct 27, 2020, at 2:17 PM, Webster Homer  
> wrote:
> 
> We have a solrcloud set up with 2 nodes, 1 zookeeper and running Solr 7.7.2 
> This cloud is used for development purposes. Collections are sharded across 
> the 2 nodes.
> 
> Recently we noticed that one of the main collections we use had both replicas 
> running on the same node. Normally we don't see collections created where the 
> replicas run on the same node.
> 
> I tried to create a new version of the collection forcing it to use both 
> nodes. However, that doesn't work both replicas are created on the same node:
> /solr/admin/collections?action=CREATE&name=sial-catalog-product-20201027&collection.configName=sial-catalog-product-20200808&numShards=2&replicationFactor=1&maxShardsPerNode=1&createNodeSet=uc1a-ecomdev-msc02:8983_solr,uc1a-ecomdev-msc01:8983_solr
> The call returns this:
> {
>"responseHeader": {
>"status": 0,
>"QTime": 4659
>},
>"success": {
>"uc1a-ecomdev-msc01:8983_solr": {
>"responseHeader": {
>"status": 0,
>"QTime": 3900
>},
>"core": "sial-catalog-product-20201027_shard2_replica_n2"
>},
>"uc1a-ecomdev-msc01:8983_solr": {
>"responseHeader": {
>"status": 0,
>"QTime": 4012
>},
>"core": "sial-catalog-product-20201027_shard1_replica_n1"
>}
>}
> }
> 
> Both replicas are created on the same node. Why is this happening?
> 
> How do we force the replicas be placed on different nodes?
> 
> 
> 
> This message and any attachment are confidential and may be privileged or 
> otherwise protected from disclosure. If you are not the intended recipient, 
> you must not copy this message or attachment or disclose the contents to any 
> other person. If you have received this transmission in error, please notify 
> the sender immediately and delete the message and any attachment from your 
> system. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not 
> accept liability for any omissions or errors in this message which may arise 
> as a result of E-Mail-transmission or for damages resulting from any 
> unauthorized changes of the content of this message and any attachment 
> thereto. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not 
> guarantee that this message is free of viruses and does not accept liability 
> for any damages caused by any virus transmitted therewith.
> 
> 
> 
> Click http://www.merckgroup.com/disclaimer to access the German, French, 
> Spanish and Portuguese versions of this disclaimer.

Re: SOLR uses too much CPU and GC is also weird on Windows server

2020-10-27 Thread Erick Erickson

Jean:

The basic search uses an “inverted index”, which is basically a list of terms 
and the documents they appear in, e.g.
my - 1, 4, 9, 12
dog - 4, 8, 10

So the word “my” appears in docs 1, 4, 9 and 12, and “dog” appears in 4, 8, 10. 
Makes
it easy to search for 
my AND dog
for instance, obviously both appear in doc 4.

But that’s a lousy structure for faceting, where you have a list of documents 
and are trying to
find the terms it has to count them up. For that, you want to “uninvert” the 
above structure,
1 - my
4 - my dog
8 - dog
9 - my
10 - dog
12 - my

From there, it’s easy to say “count the distinct terms for docs 1 and 4 and put 
them in a bucket”,
giving facet counts like 

my (2)
dog (1)

If docValues=true, then the second structure is built at index time and 
occupies memory at
run time out in MMapDirectory space, i.e. _not_ on the heap. 

If docValues=false, the second structure is built _on_ the heap when it’s 
needed, adding to
GC, memory pressure, CPU utilization etc.

So one theory is that when you upgraded your system (and you did completely 
rebuild your
corpus, right?) you inadvertently changed the docValues property for one or 
more fields that you 
facet, group, sort, or use function queries on and Solr is doing all the extra 
work of
uninverting the field that it didn’t have to before.

To answer that, you need to go through your schema and insure that 
docValues=true is
set for any field you facet, group, sort, or use function queries on. If you do 
change
this value, you need to blow away your index so there are no segments and index
all your documents again.

But that theory has problems:
1> why should Solr run for a while and then go crazy? It’d have to be that the 
query that
triggers uninversion is uncommon.
2> docValues defaults to true for simple types in recent schemas. Perhaps you 
pulled
  over an old definition from your former schema?

One other thing: you mention a bit of custom code you needed to change. I 
always try to
investigate that first. Is it possible to
1> reproduce the problem no a non-prod system
2> see what happens if you take the custom code out?

Best,
Erick

> On Oct 27, 2020, at 4:42 AM, Emir Arnautović  
> wrote:
> 
> Hi Jaan,
> It can be several things:
> caches
> fieldCache/fieldValueCache - it can be that you you are missing doc values on 
> some fields that are used for faceting/sorting/functions and that uninverted 
> field structures are eating your memory. 
> filterCache - you’ve changed setting for filter caches and set it to some 
> large value
> heavy queries
> return a lot of documents
> facet on high cardinality fields
> deep pagination
> 
> HTH,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
> 
> 
> 
>> On 27 Oct 2020, at 08:48, Jaan Arjasepp  wrote:
>> 
>> Hello,
>> 
>> We have been using SOLR for quite some time. We used 6.0 and now we did a 
>> little upgrade to our system and servers and we started to use 8.6.1.
>> We use it on a Windows Server 2019.
>> Java version is 11
>> Basically using it in a default setting, except giving SOLR 2G of heap. It 
>> used 512, but it ran out of memory and stopped responding. Not sure if it 
>> was the issue. When older version, it managed fine with 512MB.
>> SOLR is not in a cloud mode, but in solo mode as we use it internally and it 
>> does not have too many request nor indexing actually.
>> Document sizes are not big, I guess. We only use one core.
>> Document stats are here:
>> Num Docs: 3627341
>> Max Doc: 4981019
>> Heap Memory Usage: 434400
>> Deleted Docs: 1353678
>> Version: 15999036
>> Segment Count: 30
>> 
>> The size of index is 2.66GB
>> 
>> While making upgrade we had to modify one field and a bit of code that uses 
>> it. Thats basically it. It works.
>> If needed more information about background of the system, I am happy to 
>> help.
>> 
>> 
>> But now to the issue I am having.
>> If SOLR is started, at first 40-60 minutes it works just fine. CPU is not 
>> high, heap usage seem normal. All is good, but then suddenly, the heap usage 
>> goes crazy, going up and down, up and down and CPU rises to 50-60% of the 
>> usage. Also I noticed over the weekend, when there are no writing usage, the 
>> CPU remains low and decent. I can try it this weekend again to see if and 
>> how this works out.
>> Also it seems to me, that after 4-5 days of working like this, it stops 
>> responding, but needs to be confirmed with more heap also.
>> 
>> Heap memory usage via JMX and jconsole -> 
>> https://drive.google.com/file/d/1Zo3B_xFsrrt-WRaxW-0A0QMXDNscXYih/view?usp=sharing
>> As you can see, it starts of normal, but then goes crazy and it has been 
>> like this over night.
>> 
>> This is overall monitoring graphs, as you can see CPU is working hard or 
>> hardly working. -> 
>> https://drive.google.com/file/d/1_Gtz-Bi7LUrj8UZvKfmNMr-8gF_lM2Ra/view?usp=sharing
>> VM summary can be found here -

Re: Performance issues with CursorMark

2020-10-26 Thread Erick Erickson

8.6 still has uninvertible=true, so this should go ahead and create an on-heap 
docValues structure. That’s going to consume 38M ints to the heap. Still, that 
shouldn’t require 500M additional space, and this would have been happening in 
your old system anyway so I’m at a loss to explain…

Unless you’re committing frequently or something like that, in which case I 
guess you could have multiple uninverted structures, but that’s a stretch. And 
you can’t get away from sorting on the ID, since it’s required for CursorMark…

I’m wondering whether it’s on the fetch or push phase. What happens if you 
disable firing the docs off to be indexed? It’s vaguely possible that 
CursorMark is a red herring and the slowdown is in the indexing side, at least 
it’s worth checking.

This is puzzling, IDK what would be causing it.

It would be good to get to the bottom of this, but I wanted to mention the 
Collections API REINDEXCOLLECTION command as something possibly worth exploring 
as an alternative. That said, understanding why things have changed is 
important… 

Best,
Erick

> On Oct 26, 2020, at 12:29 PM, Markus Jelsma  
> wrote:
> 
> Hello Anshum,
> 
> Good point! We sort on the collection's uniqueKey, our id field and this one 
> does not have docValues enabled for it. It could be a contender but is it the 
> problem? I cannot easily test it at this scale.
> 
> Thanks,
> Markus
> 
> -Original message-
>> From:Anshum Gupta 
>> Sent: Monday 26th October 2020 17:00
>> To: solr-user@lucene.apache.org
>> Subject: Re: Performance issues with CursorMark
>> 
>> Hey Markus,
>> 
>> What are you sorting on? Do you have docValues enabled on the sort field ?
>> 
>> On Mon, Oct 26, 2020 at 5:36 AM Markus Jelsma 
>> wrote:
>> 
>>> Hello,
>>> 
>>> We have been using a simple Python tool for a long time that eases
>>> movement of data between Solr collections, it uses CursorMark to fetch
>>> small or large pieces of data. Recently it stopped working when moving data
>>> from a production collection to my local machine for testing, the Solr
>>> nodes began to run OOM.
>>> 
>>> I added 500M to the 3G heap and now it works again, but slow (240docs/s)
>>> and costing 3G of the entire heap just to move 32k docs out of 76m total.
>>> 
>>> Solr 8.6.0 is running with two shards (1 leader+1 replica), each shard has
>>> 38m docs almost no deletions (0.4%) taking up ~10.6g disk space. The
>>> documents are very small, they are logs of various interactions of users
>>> with our main text search engine.
>>> 
>>> I monitored all four nodes with VisualVM during the transfer, all four
>>> went up to 3g heap consumption very quickly. After the transfer it took a
>>> while for two nodes to (forcefully) release the no longer for the transfer
>>> needed heap space. The two other nodes, now, 17 minutes later, still think
>>> they have to hang on to their heap consumption. When i start the same
>>> transfer again, the nodes that already have high memory consumption just
>>> seem to reuse that, not consuming additional heap. At least the second time
>>> it went 920docs/s. While we are used to transfer these tiny documents at
>>> light speed of multiple thousands per second.
>>> 
>>> What is going on? We do not need additional heap, Solr is clearly not
>>> asking for more and GC activity is minimal. Why did it become so slow?
>>> Regular queries on the collection are still going fast, but CursorMarking
>>> even through a tiny portion is taking time and memory.
>>> 
>>> Many thanks,
>>> Markus
>>> 
>> 
>> 
>> -- 
>> Anshum Gupta
>>

Re: TieredMergePolicyFactory question

2020-10-26 Thread Erick Erickson

"Some large segments were merged into 12GB segments and
deleted documents were physically removed.”
and
“So with the current natural merge strategy, I need to update solrconfig.xml
and increase the maxMergedSegmentMB often"

I strongly recommend you do not continue down this path. You’re making a
mountain out of a mole-hill. You have offered no proof that removing the
deleted documents is noticeably improving performance. If you replace
docs randomly, deleted docs will be removed eventually with the default
merge policy without you doing _anything_ special at all.

The fact that you think you need to continuously bump up the size of
your segments indicates your understanding is incomplete. When
you start changing settings basically at random in order to “fix” a problem,
especially one that you haven’t demonstrated _is_ a problem, you 
invariably make the problem worse.

By making segments larger, you’ve increased the work Solr (well Lucene) has
to do in order to merge them since the merge process has to handle these
larger segments. That’ll take longer. There are a fixed number of threads
that do merging. If they’re all tied up, incoming updates will block until
a thread frees up. I predict that if you continue down this path, eventually
your updates will start to misbehave and you’ll spend a week trying to figure
out why.

If you insist on worrying about deleted documents, just expungeDeletes
occasionally. I’d also set the segments size back to the default 5G. I can’t
emphasize strongly enough that the way you’re approaching this will lead
to problems, not to mention maintenance that is harder than it needs to
be. If you do set the max segment size back to 5G, your 12G segments will
_not_ merge until they have lots of deletes, making your problem worse. 
Then you’ll spend time trying to figure out why.

Recovering from what you’ve done already has problems. Those large segments
_will_ get rewritten (we call it “singleton merge”) when they’ve accumulated a
lot of deletes, but meanwhile you’ll think that your problem is getting worse 
and worse.

When those large segments have more than 10% deleted documents, expungeDeletes
will singleton merge them and they’ll gradually shrink.

So my prescription is:

1> set the max segment size back to 5G

2> monitor your segments. When you see your large segments  > 5G have 
more than 10% deleted documents, issue an expungeDeletes command (not optimize).
This will recover your index from the changes you’ve already made.

3> eventually, all your segments will be under 5G. When that happens, stop
issuing expungeDeletes.

4> gather some performance statistics and prove one way or another that as 
deleted
docs accumulate over time, it impacts performance. NOTE: after your last
expungeDeletes, deleted docs will accumulate over time until they reach a 
plateau and
shouldn’t continue increasing after that. If you can _prove_ that accumulating 
deleted
documents affects performance, institute a regular expungeDeletes. Optimize, but
expungeDeletes is less expensive and on a changing index expungeDeletes is
sufficient. Optimize is only really useful for a static index, so I’d avoid it 
in your
situation.

Best,
Erick

> On Oct 26, 2020, at 1:22 AM, Moulay Hicham  wrote:
> 
> Some large segments were merged into 12GB segments and
> deleted documents were physically removed.

Re: TieredMergePolicyFactory question

2020-10-23 Thread Erick Erickson

Well, you mentioned that the segments you’re concerned were merged a year ago.
If segments aren’t being merged, they’re pretty static.

There’s no real harm in optimizing _occasionally_, even in an NRT index. If you 
have
segments that were merged that long ago, you may be indexing continually but it
sounds like it’s a situation where you update more recent docs rather than 
random
ones over the entire corpus.

That caution is more for indexes where you essentially replace docs in your
corpus randomly, and it’s really about wasting a lot of cycles rather than
bad stuff happening. When you randomly update documents (or delete them),
the extra work isn’t worth it.

Either operation will involve a lot of CPU cycles and can require that you have
at least as much free space on your disk as the indexes occupy, so do be aware
of that.

All that said, what evidence do you have that this is worth any effort at all?
Depending on the environment, you may not even be able to measure
performance changes so this all may be irrelevant anyway.

But to your question. Yes, you can cause regular merging to more aggressively 
merge segments with deleted docs by setting the
deletesPctAllowed
in solroconfig.xml. The default value is 33, and you can set it as low as 20 or 
as
high as 50. We put
a floor of 20% because the cost starts to rise quickly if it’s lower than that, 
and
expungeDeletes is a better alternative at that point.

This is not a hard number, and in practice the percentage of you index that 
consists
of deleted documents tends to be lower than this number, depending of course
on your particular environment.

Best,
Erick

> On Oct 23, 2020, at 12:59 PM, Moulay Hicham  wrote:
> 
> Thanks Eric.
> 
> My index is near real time and frequently updated.
> I checked this page
> https://lucene.apache.org/solr/guide/8_1/uploading-data-with-index-handlers.html#xml-update-commands
> and using forceMerge/expungeDeletes are NOT recommended.
> 
> So I was hoping that the change in mergePolicyFactory will affect the
> segments with high percent of deletes as part of the REGULAR segment
> merging cycles. Is my understanding correct?
> 
> 
> 
> 
> On Fri, Oct 23, 2020 at 9:47 AM Erick Erickson 
> wrote:
> 
>> Just go ahead and optimize/forceMerge, but do _not_ optimize to one
>> segment. Or you can expungeDeletes, that will rewrite all segments with
>> more than 10% deleted docs. As of Solr 7.5, these operations respect the 5G
>> limit.
>> 
>> See: https://lucidworks.com/post/solr-and-optimizing-your-index-take-ii/
>> 
>> Best
>> Erick
>> 
>> On Fri, Oct 23, 2020, 12:36 Moulay Hicham  wrote:
>> 
>>> Hi,
>>> 
>>> I am using solr 8.1 in production. We have about 30%-50% of deleted
>>> documents in some old segments that were merged a year ago.
>>> 
>>> These segments size is about 5GB.
>>> 
>>> I was wondering why these segments have a high % of deleted docs and
>> found
>>> out that they are NOT being candidates for merging because the
>>> default TieredMergePolicy maxMergedSegmentMB is 5G.
>>> 
>>> So I have modified the TieredMergePolicyFactory config as below to
>>> lower the delete docs %
>>> 
>>> > class="org.apache.solr.index.TieredMergePolicyFactory">
>>>  10
>>>  10
>>>  12000
>>>  20
>>> 
>>> 
>>> 
>>> Do you see any issues with increasing the max merged segment to 12GB and
>>> lowered the deletedPctAllowed to 20%?
>>> 
>>> Thanks,
>>> 
>>> Moulay
>>> 
>>

Re: TieredMergePolicyFactory question

2020-10-23 Thread Erick Erickson

Just go ahead and optimize/forceMerge, but do _not_ optimize to one
segment. Or you can expungeDeletes, that will rewrite all segments with
more than 10% deleted docs. As of Solr 7.5, these operations respect the 5G
limit.

See: https://lucidworks.com/post/solr-and-optimizing-your-index-take-ii/

Best
Erick

On Fri, Oct 23, 2020, 12:36 Moulay Hicham  wrote:

> Hi,
>
> I am using solr 8.1 in production. We have about 30%-50% of deleted
> documents in some old segments that were merged a year ago.
>
> These segments size is about 5GB.
>
> I was wondering why these segments have a high % of deleted docs and found
> out that they are NOT being candidates for merging because the
> default TieredMergePolicy maxMergedSegmentMB is 5G.
>
> So I have modified the TieredMergePolicyFactory config as below to
> lower the delete docs %
>
> 
>   10
>   10
>   12000
>   20
> 
>
>
> Do you see any issues with increasing the max merged segment to 12GB and
> lowered the deletedPctAllowed to 20%?
>
> Thanks,
>
> Moulay
>

Re: When are the score values evaluated?

2020-10-22 Thread Erick Erickson

You’d get a much better idea of what goes on
if you added &explain=true and analyzed the
output. That’d show you exactly what is
calculated when.

Best,
Erick

> On Oct 22, 2020, at 4:05 AM, Taisuke Miyazaki  
> wrote:
> 
> Hi,
> 
> If you use a high value for the score, the values on the smaller scale are
> ignored.
> 
> Example :
> bq = foo:(1.0)^1.0
> bf = sum(200)
> 
> When I do this, the additional score for "foo" at 1.0 does not affect the
> sort order.
> 
> I'm assuming this is an issue with the precision of the score floating
> point, is that correct?
> 
> As a test, if we change the query as follows, the order will change as you
> would expect, reflecting the additional score of "foo" when it is 1.0
> bq = foo:(1.0)^10
> bf = sum(200)
> 
> How can I avoid this?
> The idea I'm thinking of at the moment is to divide the whole thing by an
> appropriate number, such as bf= div(sum(200),100).
> However, this may or may not work as expected depending on when the
> floating point operations are done and rounded off.
> 
> At what point are score's floats rounded?
> 
> 1. when sorting
> 2. when calculating the score
> 3. when evaluating each function for each bq and bf
> 
> Regards,
> Taisuke

Re: Add single or batch document in Solr 6.1.0

2020-10-20 Thread Erick Erickson

Batching is better, see: 
https://lucidworks.com/post/really-batch-updates-solr-2/

> On Oct 20, 2020, at 9:03 AM, vishal patel  
> wrote:
> 
> I am using solr 6.1.0. We have 2 shards and each has one replica.
> 
> I want to insert 100 documents in one collection. I am using below code.
> 
> org.apache.solr.client.solrj.impl.CloudSolrClient cloudServer = new 
> org.apache.solr.client.solrj.impl.CloudSolrClient(zkHost);
> cloudServer.setParallelUpdates(true);
> cloudServer.setDefaultCollection(collection);
> 
> I have 2 ways to add the documents. single or batch
> 1) cloudServer.add(SolrInputDocument); //loop of 100 documents
> 2) cloudServer.add(List); // 100 documents
> 
> Note: we are not using cloudServer.commit from application. we used below 
> configuration from solrconfig.xml
> 
> 60
>   2
>   false
> 
> 
>   1000
> 
> 2
> 
> Which one is better for performance oriented single or batch? which one is 
> faster for commit process?
> 
> Regards,
> Vishal
>

Re: [EXT: NEWSLETTER] SolrDocument difference between String and text_general

2020-10-20 Thread Erick Erickson

Owen:

Collection reload is necessary but not sufficient. You’ll still get wonky 
results even if you re-index everything unless you delete _all_ the documents 
first or start with a whole new collection. Each Lucene index is a “mini index” 
with its own picture of the structure of that index (i.e. the schema in force 
when it was created). If you have segments created with the old schema and 
other segments with the new schema, when they get merged the result is 
undefined. It may not blow up, but it also won't do what you want.

Take your change from text to string type and the title “my dog has fleas”. In 
the segment with the field defined as a Text type, you’ll be able to search for 
“dog” and get the doc. Similarly for Dog (assuming you have lowercasing in your 
analysis chain). “has fleas” would hit, as would “dog fleas”~2. 

For the segment defined with String, you will only get a hit if you search for 
“my dog has fleas”. You wouldn’t find the doc if you searched for any of the 
following:
- my AND dog AND has AND fleas
- “My dog has fleas”
- fleas
- “dog has fleas my"

When those segments are merged, Lucene doesn’t have the information to “do the 
right thing”, and even if it did the cost would be prohibitive because it’d be 
like re-indexing all the docs in one segment or the other.

You cannot spoof this by simply reindexing the corpus over top of an existing 
index since that’ll involve a bunch of segment merges.

You’re seeing consistent results here because you started with a _new_ 
collection that had no old segments lying around.

Best,
Erick

> On Oct 20, 2020, at 4:37 AM, Cox, Owen  wrote:
> 
> Hi Konstantinos, I think you're onto something there.  I don't think the 
> collection was reloaded, I've just tried the same code against a different 
> collection that uses the same configset; only difference being this 
> collection was created after the schema changes.  That works, so it must've 
> been the reload that was missing.
> 
> Thanks!
> 
> Owen Cox
> Senior Consultant | Deloitte MCS Limited
> D: +44 20 7007 1657
> o...@deloitte.co.uk | www.deloitte.co.uk
> 
> 
> -Original Message-
> From: Konstantinos Koukouvis 
> Sent: 20 October 2020 09:04
> To: solr-user@lucene.apache.org
> Subject: [EXT: NEWSLETTER] Re: SolrDocument difference between String and 
> text_general
> 
> Hi Owen,
> 
> If I understand correctly you have changed the schema, then reloaded the core 
> and reindexed all data right? Cause whenever I got this error I’ve usually 
> forgotten to do one of those two things…
> 
> Regards,
> Konstantinos
> 
>> On 20 Oct 2020, at 09:53, Cox, Owen  wrote:
>> 
>> Hi folks,
>> 
>> I'm using Solr 8.5.2 and populating documents which include a string field 
>> called "title".  This field used to be text_general, but the data was 
>> reindexed and we've been inserting data happily with REST calls and it's 
>> been behaving as desired.
>> 
>> I've now written a Java Spring-Boot program to populate documents (snippet 
>> below) using SolrCrudRepository.  This works when I don't index the "title" 
>> field, but when I try include title I get the following error "cannot change 
>> field "title" from index options=DOCS_AND_FREQS_AND_POSITIONS to 
>> inconsistent index options=DOCS"
>> 
>> To me that looks like it's trying to index the title as text_general and 
>> store it in a string field.  But the Solr schema states that field is 
>> string, all of the data in it is string, and any other string field in the 
>> document which is string is indexed correctly.
>> 
>> Could there be any hanging reference to the field's type anywhere?  Or some 
>> requirement that a field named "title" is always text_general or something 
>> odd like that?
>> 
>> Any help appreciated, thanks
>> Owen
>> 
>> 
>> 
>> @Data
>> @SolrDocument(collection="mycollection")
>> public class Node {
>> 
>>   @Id
>>   @Field
>>   private String id;
>> 
>> 
>>   @Field
>>   private String title;
>> 
>> 
>> 
>> 
>> IMPORTANT NOTICE
>> 
>> This communication is from Deloitte LLP, a limited liability partnership 
>> registered in England and Wales with registered number OC303675. Its 
>> registered office is 1 New Street Square, London EC4A 3HQ, United Kingdom. 
>> Deloitte LLP is the United Kingdom affiliate of Deloitte NSE LLP, a member 
>> firm of Deloitte Touche Tohmatsu Limited, a UK private company limited by 
>> guarantee ("DTTL"). DTTL and each of its member firms are legally separate 
>> and independent entities. DTTL and Deloitte NSE LLP do not provide services 
>> to clients. Please see 
>> www.deloitte.co.uk/about to learn more 
>> about our global network of member firms. For details of our professional 
>> regulation please see 
>> Regulators.
>> 
>> This communication contains information which is confidential and may also 
>> be privileged. It is for the exclusive use of the in

Re: Faceting on indexed=false stored=false docValues=true fields

2020-10-19 Thread Erick Erickson

uyilmaz:

Hmm, that _is_ confusing. And inaccurate.

In this context, it should read something like

The Text field should have indexed="true" docValues=“false" if used for 
searching 
but not faceting and the String field should have indexed="false" 
docValues=“true"
if used for faceting but not searching.

I’ll fix this, thanks for pointing this out.

Erick

> On Oct 19, 2020, at 1:42 PM, uyilmaz  wrote:
> 
> Thanks! This also contributed to my confusion:
> 
> https://lucene.apache.org/solr/guide/8_4/faceting.html#field-value-faceting-parameters
> 
> "If you want Solr to perform both analysis (for searching) and faceting on 
> the full literal strings, use the copyField directive in your Schema to 
> create two versions of the field: one Text and one String. Make sure both are 
> indexed="true"."
> 
> On Mon, 19 Oct 2020 13:08:00 -0400
> Alexandre Rafalovitch  wrote:
> 
>> I think this is all explained quite well in the Ref Guide:
>> https://lucene.apache.org/solr/guide/8_6/docvalues.html
>> 
>> DocValues is a different way to index/store values. Faceting is a
>> primary use case where docValues are better than what 'indexed=true'
>> gives you.
>> 
>> Regards,
>>   Alex.
>> 
>> On Mon, 19 Oct 2020 at 12:51, uyilmaz  wrote:
>>> 
>>> 
>>> Hey all,
>>> 
>>> From my little experiments, I see that (if I didn't make a stupid mistake) 
>>> we can facet on fields marked as both indexed and stored being false:
>>> 
>>> >> stored="false" docValues="true"/>
>>> 
>>> I'm suprised by this, I thought I would need to index it. Can you confirm 
>>> this?
>>> 
>>> Regards
>>> 
>>> --
>>> uyilmaz 
> 
> 
> -- 
> uyilmaz

Re: converting string to solr.TextField

2020-10-17 Thread Erick Erickson

Did you read the long explanation in this thread already about
segment merging? If so, can you ask specific questions about
the information in those?

Best,
Erick

> On Oct 17, 2020, at 8:23 AM, Vinay Rajput  wrote:
> 
> Sorry to jump into this discussion. I also get confused whenever I see this
> strange Solr/Lucene behaviour. Probably, As @Erick said in his last year
> talk, this is how it has been designed to avoid many problems that are
> hard/impossible to solve.
> 
> That said, one more time I want to come back to the same question: why
> solr/lucene can not handle this when we are updating all the documents?
> Let's take a couple of examples :-
> 
> *Ex 1:*
> Let's say I have only 10 documents in my index and all of them are in a
> single segment (Segment 1). Now, I change the schema (update field type in
> this case) and reindex all of them.
> This is what (according to me) should happen internally :-
> 
> 1st update req : Solr will mark 1st doc as deleted and index it again
> (might run the analyser chain based on config)
> 2nd update req : Solr will mark 2st doc as deleted and index it again
> (might run the analyser chain based on config)
> And so on..
> based on autoSoftCommit/autoCommit configuration, all new documents will be
> indexed and probably flushed to disk as part of new segment (Segment 2)
> 
> 
> Now, whenever segment merging happens (during commit or later in time),
> lucene will create a new segment (Segment 3) can discard all the docs
> present in segment 1 as there are no live docs in it. And there would *NOT*
> be any situation to decide whether to choose the old config or new config
> as there is not even a single live document with the old config. Isn't it?
> 
> *Ex 2:*
> I see that it can be an issue if we think about reindexing millions of
> docs. Because in that case, merging can be triggered when indexing is half
> way through, and since there are some live docs in the old segment (with
> old cofig), things will blow up. Please correct me if I am wrong.
> 
> I am *NOT* a Solr/Lucene expert and just started learning the ways things
> are working internally. In the above example, I can be wrong at many
> places. Can someone confirm if scenarios like Ex-2 are the reasons behind
> the fact that even re-indexing all documents doesn't help if some
> incompatible schema changes are done?  Any other insight would also be
> helpful.
> 
> Thanks,
> Vinay
> 
> On Sat, Oct 17, 2020 at 5:48 AM Shawn Heisey  wrote:
> 
>> On 10/16/2020 2:36 PM, David Hastings wrote:
>>> sorry, i was thinking just using the
>>> *:*
>>> method for clearing the index would leave them still
>> 
>> In theory, if you delete all documents at the Solr level, Lucene will
>> delete all the segment files on the next commit, because they are empty.
>>  I have not confirmed with testing whether this actually happens.
>> 
>> It is far safer to use a new index as Erick has said, or to delete the
>> index directories completely and restart Solr ... so you KNOW the index
>> has nothing in it.
>> 
>> Thanks,
>> Shawn
>>

Re: Index Replication Failure

2020-10-17 Thread Erick Erickson

None of your images made it through the mail server. You’ll
have to put them somewhere and provide a link.

> On Oct 17, 2020, at 5:17 AM, Parshant Kumar 
>  wrote:
> 
> Architecture image: If not visible in previous mail
> 
> 
> 
> 
> On Sat, Oct 17, 2020 at 2:38 PM Parshant Kumar  
> wrote:
> Hi all,
> 
> We are having solr architecture as below.
> 
> 
> 
> We are facing the frequent replication failure between master to repeater 
> server  as well as between repeater  to slave servers.
> On checking logs found every time one of the below  exceptions occurred 
> whenever the replication have failed. 
> 
> 1)
> 
> 2)
> 
> 
> 3)
> 
> 
> The replication configuration of master,repeater,slave's is given below:
> 
> 
> 
> Commit Configuration master,repeater,slave's is given below :
> 
> 
> 
> Replication between master and repeater occurs every 10 mins.
> Replication between repeater and slave servers occurs every 15 mins between 
> 4-7 am and after that in every 3 hours.
> 
> Thanks,
> Parshant Kumar
> 
> 
> 
> 
> 
> 
>

Re: converting string to solr.TextField

2020-10-16 Thread Erick Erickson

Not sure what you’re asking here. re-indexing, as I was
using the term, means completely removing the index and
starting over. Or indexing to a new collection. At any
rate, starting from a state where there are _no_ segments.

I’m guessing you’re still thinking that re-indexing without
doing the above will work; it won’t. The way merging works,
it chooses segments based on a number of things, including
the percentage deleted documents. But there are still _other_
live docs in the segment.

Segment S1 has docs 1, 2, 3, 4 (old definition)
Segment S2 has docs 5, 6, 7, 8 (new definition)

Doc 2 is deleted, and S1 and S2 are merged into S3. The whole
discussion about not being able to do the right thing kicks in.
Should S3 use the new or old definition? Whichever one
it uses is wrong for the other segment. And remember,
Lucene simply _cannot_ “do the right thing” if the data
isn’t there.

What you may be missing is that a segment is a “mini-index”.
The underlying assumption is that all documents in that
segment are produced with the same schema and can be
accessed the same way. My comments about merging
“doing the right thing” is really about transforming docs
so all the docs can be treated the same. Which they can’t
if they were produced with different schemas.

Robert Muir’s statement is interesting here, built
on Mike McCandless’ comment:

"I think the key issue here is Lucene is an index not a database.
Because it is a lossy index and does not retain all of the user’s
data, its not possible to safely migrate some things automagically.
…. The function is y = f(x) and if x is not available its not 
possible, so lucene can't do it."

Don’t try to get around this. Prepare to
re-index the entire corpus into a new collection whenever
you change the schema and then maybe use an alias to
seamlessly convert from the user’s perspective. If you
simply cannot re-index from the system-of-record, you have
two choices:

1> use new collections whenever you need to change the
 schema and “somehow” have the app do different things
with the new and old collections

2> set stored=true for all your source fields (i.e. not
   copyField destination). You can either roll your own
   program that pulls data from the old and sends
   it to the new or use the Collections API REINDEXCOLLECTION
   API call. But note that it’s specifically called out
   in the docs that all fields must be stored to use the
API, what happens under the covers is that the 
 stored fields are read and sent to the target
   collection.

In both these cases, Robert’s comment doesn’t apply. Well,
it does apply but “if x is not available” is not the case,
the original _is_ available; it’s the stored data...

I’m over-stating the case somewhat, there are a few changes
that you can get away with re-indexing all the docs into an
existing index, things like changing from stored=true to 
stored=false, adding new fields, deleting fields (although the
meta-data for the field is still kept around) etc.

> On Oct 16, 2020, at 3:57 PM, David Hastings  
> wrote:
> 
> Gotcha, thanks for the explanation.  another small question if you
> dont mind, when deleting docs they arent actually removed, just tagged as
> deleted, and the old field/field type is still in the index until
> merged/optimized as well, wouldnt that cause almost the same conflicts
> until then?
> 
> On Fri, Oct 16, 2020 at 3:51 PM Erick Erickson 
> wrote:
> 
>> Doesn’t re-indexing a document just delete/replace….
>> 
>> It’s complicated. For the individual document, yes. The problem
>> comes because the field is inconsistent _between_ documents, and
>> segment merging blows things up.
>> 
>> Consider. I have segment1 with documents indexed with the old
>> schema (String in this case). I  change my schema and index the same
>> field as a text type.
>> 
>> Eventually, a segment merge happens and these two segments get merged
>> into a single new segment. How should the field be handled? Should it
>> be defined as String or Text in the new segment? If you convert the docs
>> with a Text definition for the field to String,
>> you’d lose the ability to search for individual tokens. If you convert the
>> String to Text, you don’t have any guarantee that the information is even
>> available.
>> 
>> This is just the tip of the iceberg in terms of trying to change the
>> definition of a field. Take the case of changing the analysis chain,
>> say you use a phonetic filter on a field then decide to remove it and
>> do not store the original. Erick might be encoded as “ENXY” so the
>> original data is simply not there to convert. Ditto removing a
>> stemmer, lowercasing, applying a regex, …...
>> 
>> 
>> From Mike McCandless:
>> 
>> "This really is the difference betwee

Re: Info about legacyMode cluster property

2020-10-16 Thread Erick Erickson

You should not be using the core api to do anything with cores in SolrCloud.

True, under the covers the collections API uses the core API to do its tricks,
but you have to use it in a very precise manner.

As for legacyMode, don’t use it, please. it’s not supported any more, has
been completely removed in 9x.

Best,
Erick

> On Oct 16, 2020, at 2:38 PM, yaswanth kumar  wrote:
> 
> Can someone help on the above question?
> 
> On Thu, Oct 15, 2020 at 1:09 PM yaswanth kumar 
> wrote:
> 
>> Can someone explain what are the implications when we change
>> legacyMode=true on solr 8.2
>> 
>> We have migrated from solr 5.5 to solr 8.2 everything worked great but
>> when we are trying to add a core to existing collection with core api
>> create it’s asking to pass the coreNodeName or switch legacyMode to true.
>> When we switched it worked fine . But we need to understand on what are the
>> cons because seems like this is false by default from solr 7
>> 
>> Sent from my iPhone
> 
> 
> 
> -- 
> Thanks & Regards,
> Yaswanth Kumar Konathala.
> yaswanth...@gmail.com

Re: converting string to solr.TextField

2020-10-16 Thread Erick Erickson

Doesn’t re-indexing a document just delete/replace….

It’s complicated. For the individual document, yes. The problem
comes because the field is inconsistent _between_ documents, and
segment merging blows things up.

Consider. I have segment1 with documents indexed with the old
schema (String in this case). I  change my schema and index the same
field as a text type.

Eventually, a segment merge happens and these two segments get merged
into a single new segment. How should the field be handled? Should it
be defined as String or Text in the new segment? If you convert the docs
with a Text definition for the field to String,
you’d lose the ability to search for individual tokens. If you convert the
String to Text, you don’t have any guarantee that the information is even
available.

This is just the tip of the iceberg in terms of trying to change the 
definition of a field. Take the case of changing the analysis chain,
say you use a phonetic filter on a field then decide to remove it and
do not store the original. Erick might be encoded as “ENXY” so the 
original data is simply not there to convert. Ditto removing a 
stemmer, lowercasing, applying a regex, …...

From Mike McCandless:

"This really is the difference between an index and a database:
 we do not store, precisely, the original documents.  We store 
an efficient derived/computed index from them.  Yes, Solr/ES 
can add database-like behavior where they hold the true original 
source of the document and use that to rebuild Lucene indices 
over time.  But Lucene really is just a "search index" and we 
need to be free to make important improvements with time."

And all that aside, you have to re-index all the docs anyway or
your search results will be inconsistent. So leaving aside the 
impossible task of covering all the possibilities on the fly, it’s
better to plan on re-indexing….

Best,
Erick

> On Oct 16, 2020, at 3:16 PM, David Hastings  
> wrote:
> 
> "If you want to
> keep the same field name, you need to delete all of the
> documents in the index, change the schema, and reindex."
> 
> actually doesnt re-indexing a document just delete/replace anyways assuming
> the same id?
> 
> On Fri, Oct 16, 2020 at 3:07 PM Alexandre Rafalovitch 
> wrote:
> 
>> Just as a side note,
>> 
>>> indexed="true"
>> If you are storing 32K message, you probably are not searching it as a
>> whole string. So, don't index it. You may also want to mark the field
>> as 'large' (and lazy):
>> 
>> https://lucene.apache.org/solr/guide/8_2/field-type-definitions-and-properties.html#field-default-properties
>> 
>> When you are going to make it a text field, you will probably be
>> having the same issues as well.
>> 
>> And honestly, if you are not storing those fields to search, maybe you
>> need to consider the architecture. Maybe those fields do not need to
>> be in Solr at all, but in external systems. Solr (or any search
>> system) should not be your system of records since - as the other
>> reply showed - some of the answers are "reindex everything".
>> 
>> Regards,
>>   Alex.
>> 
>> On Fri, 16 Oct 2020 at 14:02, yaswanth kumar 
>> wrote:
>>> 
>>> I am using solr 8.2
>>> 
>>> Can I change the schema fieldtype from string to solr.TextField
>>> without indexing?
>>> 
>>>> stored="true"/>
>>> 
>>> The reason is that string has only 32K char limit where as I am looking
>> to
>>> store more than 32K now.
>>> 
>>> The contents on this field doesn't require any analysis or tokenized but
>> I
>>> need this field in the queries and as well as output fields.
>>> 
>>> --
>>> Thanks & Regards,
>>> Yaswanth Kumar Konathala.
>>> yaswanth...@gmail.com
>>

Re: Rotate Solr Logfiles

2020-10-15 Thread Erick Erickson

Possibly browser caches? Try using a private window or purging your browser 
caches.

Shot in the dark…

Best,
Erick

> On Oct 15, 2020, at 5:41 AM, DINSD | SPAutores 
>  wrote:
> 
> Hi,
> 
> I'm currently using Solr-8.5.2 with a default log4j2.xml and trying to do the 
> following :
> 
> Each time an event appears in the log file, i check, rectify if necessary and 
> clean (Rotate) the logfiles.
> 
> This causes the solr.log file to run out of data, which is what we want.
> 
> However if I use the web interface, "localhost: port", and view the "Logging" 
> option,
> messages remain visible.
> 
> Any idea why this is happening, or other way to "force rotate"?
> 
> Best regards
> Rui
> 
> 
> Rui Pimentel
> 
> 
> 
> DINSD - Departamento de Informática / SPA Digital
> Av. Duque de Loulé, 31 - 1069-153 Lisboa  PORTUGAL
> T (+ 351) 21 359 44 36 / (+ 351) 21 359 44 00  F (+ 351) 21 353 02 57
>  informat...@spautores.pt
>  www.SPAutores.pt
> 
> Please consider the environment before printing this email 
> 
> Esta mensagem electrónica, incluindo qualquer dos seus anexos, contém 
> informação PRIVADA, CONFIDENCIAL e de DIVULGAÇÃO PROIBIDA,e destina-se 
> unicamente à pessoa e endereço electrónico acima indicados. Se não for o 
> destinatário desta mensagem, agradecemos que a elimine e nos comunique de 
> imediato através do telefone  +351 21 359 44 00 ou por email para: 
> ge...@spautores.pt 
> 
> This electronic mail transmission including any attachment hereof, contains 
> information that is PRIVATE, CONFIDENTIAL and PROTECTED FROM DISCLOSURE, and 
> it is only for the use of the person and the e-mail address above indicated. 
> If you have received this electronic mail transmission in error, please 
> destroy it and notify us immediately through the telephone number  +351 21 
> 359 44 00 or at the e-mail address:  ge...@spautores.pt
>

Re: solr1.3からsolr8.4へのデータ移行について

2020-10-14 Thread Erick Erickson

Kaya is certainly correct. I’d add that Solr has changed so much
in the last 12 years that you should treat it as a new Solr installation.
Do not, for instance, just use the configs from Solr 1.3, start
with the ones from the version of Solr you install.

Best,
Erick

> On Oct 14, 2020, at 3:41 AM, Kayak28  wrote:
> 
> Hi,
> 
> Replacing fils under the data directory won't work as you expected.
> Solr has changed its index format,
> so whenever you consider upgrading, you are strongly recommended to
> re-index all of your documents for your safety.
> 
> FYI: not only index format, but also other things have been changed from
> Solr1.3 to Solr 8.4.
> You have to pay attention to managed-schema, which is the replacement of
> the schema.xlm.
> Additionally, scoring and ranking never be able to the same between Solr
> 1.3 and 8.4
> 
> I strongly recommend taking a look at changes.txt.
> 
> Good Luck with your upgrading.
> 
> 
> Sincerely,
> Kaya Ota
> 
> 
> 
> 
> 
> 
> 2020年10月12日(月) 19:57 阿部真也 :
> 
>> 私はsolrを使用した古いシステムから新しいシステムに再構築し
>> データ移行を行う必要があります。
>> 
>> 現在システムはsolr1.3で動作していて、新規に構築するシステムでは
>> solrのバージョンを8.4 にアップデートしようと考えています。
>> そこで、/var/solr/{system_name}/data のデータを
>> 旧システムから新システムに移し替えることでうまくいくかどうか、知りたいです。
>> 
>> 既にsolrconfig.xmlのほとんどの設定が移行できないことが分かっていますが、
>> こちらは使用している設定名の代替手段がエラーログに出てくるため
>> 何とかなるかもしれないと思っています。
>> 
>> よろしくお願いします。
>> 
> 
> 
> -- 
> 
> Sincerely,
> Kaya
> github: https://github.com/28kayak

Re: Need urgent help -- High cpu on solr

2020-10-14 Thread Erick Erickson

Zisis makes good points. One other thing is I’d look to 
see if the CPU spikes coincide with commits. But GC
is where I’d look first.

Continuing on with the theme of caches, yours are far too large
at first glance. The default is, indeed, size=512. Every time
you open a new searcher, you’ll be executing 128 queries
for autowarming the filterCache and another 128 for the queryResultCache.
autowarming alone might be accounting for it. I’d reduce
the size back to 512 and an autowarm count nearer 16
and monitor the cache hit ratio. There’s little or no benefit
in squeezing the last few percent from the hit ratio. If your
hit ratio is small even with the settings you have, then your caches
don’t do you much good anyway so I’d make them much smaller.

You haven’t told us how often your indexes are
updated, which will be significant CPU hit due to
your autowarming.

Once you’re done with that, I’d then try reducing the heap. Most
of the actual searching is done in Lucene via MMapDirectory,
which resides in the OS memory space. See:

https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

Finally, if it is GC, consider G1GC if you’re not using that
already.

Best,
Erick

> On Oct 14, 2020, at 7:37 AM, Zisis T.  wrote:
> 
> The values you have for the caches and the maxwarmingsearchers do not look
> like the default. Cache sizes are 512 for the most part and
> maxwarmingsearchers are 2 (if not limit them to 2)
> 
> Sudden CPU spikes probably indicate GC issues. The #  of documents you have
> is small, are they huge documents? The # of collections is OK in general but
> since they are crammed in 5 Solr nodes the memory requirements might be
> bigger. Especially if filter and the other caches get populated with 50K
> entries. 
> 
> I'd first go through the GC activity to make sure that this is not causing
> the issue. The fact that you lose some Solr servers is also an indicator of
> large GC pauses that might create a problem when Solr communicates with
> Zookeeper. 
> 
> 
> 
> --
> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Memory line in status output

2020-10-12 Thread Erick Erickson

Solr doesn’t manage this at all, it’s the JVM’s garbage collection
that occasionally kicks in. In general, memory creeps up until
the GC threshold is set (which there are about a zillion
parameters that you can set) and then GC kicks in.

Generally, the recommendation is to use the G1GC collector
and just leave the default settings as they are.

It’s usually a mistake, BTW, to over-allocate memory. You should shrink the
heap as far as you can and still maintain a reasonable safety margin. See:

https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

What’s a “reasonable safety margin”? Unfortunately you have to experiment.

Best,
Erick

> On Oct 12, 2020, at 10:33 AM, Ryan W  wrote:
> 
> Hi all,
> 
> What is the meaning of the "memory" line in the output when I run the solr
> status command?  What controls whether that memory gets exhausted?  At
> times if I run "solr status" over and over, that memory number creeps up
> and up and up.  Presumably it is not a good thing if it moves all the way
> up to my 31GB capacity.  What controls whether that happens?  How do I
> prevent that?  Or does Solr manage this automatically?
> 
> 
> $ /opt/solr/bin/solr status
> 
> Found 1 Solr nodes:
> 
> Solr process 101530 running on port 8983
> {
>  "solr_home":"/opt/solr/server/solr",
>  "version":"7.7.2 d4c30fc2856154f2c1fefc589eb7cd070a415b94 - janhoy -
> 2019-05-28 23:37:48",
>  "startTime":"2020-10-12T12:04:57.379Z",
>  "uptime":"0 days, 1 hours, 46 minutes, 41 seconds",
>  "memory":"3.3 GB (%10.7) of 31 GB"}

Re: Any solr api to force leader on a specified node

2020-10-12 Thread Erick Erickson

First, I totally agree with Walter. See: 
https://lucidworks.com/post/indexing-with-solrj/

Second, DIH is being deprecated. It is being moved to
a package that will be supported if, and only if, there is
enough community support for it. “Community support” 
means people who use it need to step up and maintain
it.

Third, there’s nothing in Solr that requires DIH to
be run on a leader so your premise is wrong. You need
to look at your logs to see what’s happening there. It
should be perfectly fine to run it on a replica.

Best,
Erick

> On Oct 11, 2020, at 11:53 PM, Walter Underwood  wrote:
> 
> Don’t use DIH. DIH has a lot of limitations and problems, as you are 
> discovering.
> 
> Write a simple program that fetches from the database and sends documents 
> in batches to Solr. I did this before DIH was invented (Solr 1.3) and I’m 
> doing it
> now.
> 
> You can send the updates to the load balancer for the Solr Cloud cluster. The
> updates will be automatically routed to the right leader. It is very fast.
> 
> My loader is written in Python and I don’t even bother with a special Solr 
> library.
> It just sends JSON to the update handler with the right options.
> 
> We do this for all of our clusters. Our biggest one is 48 hosts with 55 
> million
> documents.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
>> On Oct 11, 2020, at 8:40 PM, yaswanth kumar  wrote:
>> 
>> Hi wunder 
>> 
>> Thanks for replying on this..
>> 
>> I did setup solr cloud with 4 nodes being one node having DIH configured 
>> that pulls data from ms sql every minute.. if I install DIH on rest of the 
>> nodes it’s causing connection issues on the source dB which I don’t want and 
>> manage with only one sever polling dB while rest are used as replicas for 
>> search.
>> 
>> So now everything works fine but when the severs are rebooted for 
>> maintenance and once they come back and if the leader is not the one that 
>> doesn’t have DIH it stops pulling data from sql , so that’s the reason why I 
>> want always to force a node as leader
>> 
>> Sent from my iPhone
>> 
>>> On Oct 11, 2020, at 11:05 PM, Walter Underwood  
>>> wrote:
>>> 
>>> That requirement is not necessary. Let Solr choose a leader.
>>> 
>>> Why is someone making this bad requirement?
>>> 
>>> wunder
>>> Walter Underwood
>>> wun...@wunderwood.org
>>> http://observer.wunderwood.org/  (my blog)
>>> 
 On Oct 11, 2020, at 8:01 PM, yaswanth kumar  wrote:
 
 Can someone pls help me to know if there is any solr api /config where we 
 can make sure to always opt leader on a particular solr node in solr 
 cloud??
 
 Using solr 8.2 and zoo 3.4
 
 I have four nodes and my requirement is to always make a particular node 
 as leader
 
 Sent from my iPhone
>>> 
>

Re: Solr Memory

2020-10-10 Thread Erick Erickson

_Have_ they crashed due to OOMs? It’s quite normal for Java to create
a sawtooth pattern of memory consumption. If you attach, say, jconsole
to the running Solr and hit the GC button, does the memory drop back?

To answer your question, though, no there’s no reason memory should creep.
That said, the scenario you describe is not a “normal” scenario,  in general
collection creation/deletion is a fairly rare operation, so any memory leaks
in that code wouldn’t have jumped out at everyone the same way, say,
a memory leak when searching would.

A heap dump would be useful, but do use something to force a global GC
first.

Best,
Erick

> On Oct 9, 2020, at 5:45 PM, Kevin Van Lieshout  
> wrote:
> 
> Hi,
> 
> I use solr for distributed indexing in cloud mode. I run solr in kubernetes
> on a 72 core, 256 GB sever. In the work im doing, i benchmark index times
> so we are constantly indexing, and then deleting the collection, etc for
> accurate benchmarking on certain size of GB. In theory, this should not
> cause a memory build up but it does. As we index more and more (create
> collections) and then delete the collections, we are still seeing a build
> up in percentages from kubernetes metric tracking of our server. We are
> running Solr 7.6 and ZK 3.5.5. Is there any reason why collections are
> being "deleted" but data stays persistent on the shards that do not release
> memory, therefore causing a build up and then solr shards will crash for
> OOM reasons even if they have no collections or "data" on them after we
> delete each time.
> 
> Let me know if anyone has seen this. Thanks
> 
> Kevin

Re: PositionGap

2020-10-09 Thread Erick Erickson

No. It’s just an integer multiplication. X * 5 is no different than X*1...

> On Oct 9, 2020, at 2:48 PM, Jae Joo  wrote:
> 
> Does increasing of Position Gap make Search Slow?
> 
> Jae

Re: Folding Repeated Letters

2020-10-09 Thread Erick Erickson

Anything you do will be wrong ;).

I suppose you could kick out words that weren’t in some dictionary and 
accumulate a list of words not in the dictionary and just deal with them 
“somehow", but that’s labor-intensive since you then have to deal with proper 
names and the like. Sometimes you can get by with ignoring words with _only_ 
the first letter capitalized, which is also not perfect but might get you 
closer. You mentioned phonetic filters, but frankly I have no idea whether YES 
and YY would reduce to the same code, I rather doubt it.

In general, you _can’t_ solve this problem perfectly without inspecting each 
input, you can only get an approximation. And at some point it’s worth asking 
“is it worth it?”. I suppose you could try the regex Andy suggested in a 
copyField destination and use that as well as the primary field in queries, 
that might help at least find things like this.

If we were just able to require humans to use proper spelling, this would be a 
lot easier….

Wish there were a solution

Best,
Erick

> On Oct 8, 2020, at 10:59 PM, Mike Drob  wrote:
> 
> I was thinking about that, but there are words that are legitimately
> different with repeated consonants. My primary school teacher lost hair
> over getting us to learn the difference between desert and dessert.
> 
> Maybe we need something that can borrow the boosting behaviour of fuzzy
> query - match the exact term, but also the neighbors with a slight deboost,
> so that if the main term exists those others won't show up.
> 
> On Thu, Oct 8, 2020 at 5:46 PM Andy Webb  wrote:
> 
>> How about something like this?
>> 
>> {
>>"add-field-type": [
>>{
>>"name": "norepeat",
>>"class": "solr.TextField",
>>"analyzer": {
>>"tokenizer": {
>>"class": "solr.StandardTokenizerFactory"
>>},
>>"filters": [
>>{
>>"class": "solr.LowerCaseFilterFactory"
>>},
>>{
>>"class": "solr.PatternReplaceFilterFactory",
>>"pattern": "(.)\\1+",
>>"replacement": "$1"
>>}
>>]
>>}
>>}
>>]
>> }
>> 
>> This finds a match...
>> 
>> http://localhost:8983/solr/#/norepeat/analysis?analysis.fieldvalue=Yes&analysis.query=YyyeeEssSs&analysis.fieldtype=norepeat
>> 
>> Andy
>> 
>> 
>> 
>> On Thu, 8 Oct 2020 at 23:02, Mike Drob  wrote:
>> 
>>> I'm looking for a way to transform words with repeated letters into the
>>> same token - does something like this exist out of the box? Do our
>> stemmers
>>> support it?
>>> 
>>> For example, say I would want all of these terms to return the same
>> search
>>> results:
>>> 
>>> YES
>>> YESSS
>>> YYYEEESSS
>>> YYEE[...]S
>>> 
>>> I don't know how long a user would hold down the S key at the end to
>>> capture their level of excitement, and I don't want to manually define
>>> synonyms for every length.
>>> 
>>> I'm pretty sure that I don't want PhoneticFilter here, maybe
>>> PatternReplace? Not a huge fan of how that one is configured, and I think
>>> I'd have to set up a bunch of patterns inline for it?
>>> 
>>> Mike
>>> 
>>

Re: Question about solr commits

2020-10-08 Thread Erick Erickson

This is a bit confused. There will be only one timer that starts at time T when
the first doc comes in. At T+ 15 seconds, all docs that have been received since
time T will be committed. The first doc to hit Solr _after_ T+15 seconds starts
a single new timer and the process repeats.

Best,
rick

> On Oct 8, 2020, at 2:26 PM, Rahul Goswami  wrote:
> 
> Shawn,
> So if the autoCommit interval is 15 seconds, and one update request arrives
> at t=0 and another at t=10 seconds, then will there be two timers one
> expiring at t=15 and another at t=25 seconds, but this would amount to ONLY
> ONE commit at t=15 since that one would include changes from both updates.
> Is this understanding correct ?
> 
> Thanks,
> Rahul
> 
> On Wed, Oct 7, 2020 at 11:39 PM yaswanth kumar 
> wrote:
> 
>> Thank you very much both Eric and Shawn
>> 
>> Sent from my iPhone
>> 
>>> On Oct 7, 2020, at 10:41 PM, Shawn Heisey  wrote:
>>> 
>>> On 10/7/2020 4:40 PM, yaswanth kumar wrote:
 I have the below in my solrconfig.xml
 

  ${solr.Data.dir:}


  ${solr.autoCommit.maxTime:6}
  false


  ${solr.autoSoftCommit.maxTime:5000}

  
 Does this mean even though we are always sending data with commit=false
>> on
 update solr api, the above should do the commit every minute (6 ms)
 right?
>>> 
>>> Assuming that you have not defined the "solr.autoCommit.maxTime" and/or
>> "solr.autoSoftCommit.maxTime" properties, this config has autoCommit set to
>> 60 seconds without opening a searcher, and autoSoftCommit set to 5 seconds.
>>> 
>>> So five seconds after any indexing begins, Solr will do a soft commit.
>> When that commit finishes, changes to the index will be visible to
>> queries.  One minute after any indexing begins, Solr will do a hard commit,
>> which guarantees that data is written to disk, but it will NOT open a new
>> searcher, which means that when the hard commit happens, any pending
>> changes to the index will not be visible.
>>> 
>>> It's not "every five seconds" or "every 60 seconds" ... When any changes
>> are made, Solr starts a timer.  When the timer expires, the commit is
>> fired.  If no changes are made, no commits happen, because the timer isn't
>> started.
>>> 
>>> Thanks,
>>> Shawn
>>

Re: Master/Slave

2020-10-08 Thread Erick Erickson

What Jan said. I wanted to add that the replication API also makes use of it. A 
little-known fact: you can use the replication API in SolrCloud _without_ 
configuring anything in solrconfig.xml. You can specify the URL to pull from on 
the fly in the command….

Best,
Erick

> On Oct 8, 2020, at 2:54 AM, Jan Høydahl  wrote:
> 
> The API that enables master/slave is the ReplicationHandler, where the 
> follower (slave) pulls index files from leader (master).
> This same API is used in SolrCloud for the PULL replica type, and also as a 
> fallback for full recovery if transaction log is not enough. 
> So I don’t see it going away anytime soon, even if the non-cloud deployment 
> style is less promoted in the documentation.
> 
> Jan
> 
>> 6. okt. 2020 kl. 16:25 skrev Oakley, Craig (NIH/NLM/NCBI) [C] 
>> :
>> 
>>> it better not ever be depreciated.  it has been the most reliable mechanism 
>>> for its purpose
>> 
>> I would like to know whether that is the consensus of Solr developers.
>> 
>> We had been scrambling to move from Master/Slave to CDCR based on the 
>> assertion that CDCR support would last far longer than Master/Slave support.
>> 
>> Can we now assume safely that this assertion is now completely moot? Can we 
>> now assume safely that Master/Slave is likely to be supported for the 
>> foreseeable future? Or are we forced to assume that Master/Slave support 
>> will evaporate shortly after the now-evaporated CDCR support?
>> 
>> -Original Message-
>> From: David Hastings  
>> Sent: Wednesday, September 30, 2020 3:10 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Master/Slave
>> 
>>> whether we should expect Master/Slave replication also to be deprecated
>> 
>> it better not ever be depreciated.  it has been the most reliable mechanism
>> for its purpose, solr cloud isnt going to replace standalone, if it does,
>> thats when I guess I stop upgrading or move to elastic
>> 
>> On Wed, Sep 30, 2020 at 2:58 PM Oakley, Craig (NIH/NLM/NCBI) [C]
>>  wrote:
>> 
>>> Based on the thread below (reading "legacy" as meaning "likely to be
>>> deprecated in later versions"), we have been working to extract ourselves
>>> from Master/Slave replication
>>> 
>>> Most of our collections need to be in two data centers (a read/write copy
>>> in one local data center: the disaster-recovery-site SolrCloud could be
>>> read-only). We also need redundancy within each data center for when one
>>> host or another is unavailable. We implemented this by having different
>>> SolrClouds in the different data centers; with Master/Slave replication
>>> pulling data from one of the read/write replicas to each of the Slave
>>> replicas in the disaster-recovery-site read-only SolrCloud. Additionally,
>>> for some collections, there is a desire to have local read-only replicas
>>> remain unchanged for querying during the loading process: for these
>>> collections, there is a local read/write loading SolrCloud, a local
>>> read-only querying SolrCloud (normally configured for Master/Slave
>>> replication from one of the replicas of the loader SolrCloud to both
>>> replicas of the query SolrCloud, but with Master/Slave disabled when the
>>> load was in progress on the loader SolrCloud, and with Master/Slave resumed
>>> after the loaded data passes QA checks).
>>> 
>>> Based on the thread below, we made an attempt to switch to CDCR. The main
>>> reason for wanting to change was that CDCR was said to be the supported
>>> mechanism, and the replacement for Master/Slave replication.
>>> 
>>> After multiple unsuccessful attempts to get CDCR to work, we ended up with
>>> reproducible cases of CDCR loosing data in transit. In June, I initiated a
>>> thread in this group asking for clarification of how/whether CDCR could be
>>> made reliable. This seemed to me to be met with deafening silence until the
>>> announcement in July of the release of Solr8.6 and the deprecation of CDCR.
>>> 
>>> So we are left with the question whether we should expect Master/Slave
>>> replication also to be deprecated; and if so, with what is it expected to
>>> be replaced (since not with CDCR)? Or is it now sufficiently safe to assume
>>> that Master/Slave replication will continue to be supported after all
>>> (since the assertion that it would be replaced by CDCR has been
>>> discredited)? In either case, are there other suggested implementations of
>>> having a read-only SolrCloud receive data from a read/write SolrCloud?
>>> 
>>> 
>>> Thanks
>>> 
>>> -Original Message-
>>> From: Shawn Heisey 
>>> Sent: Tuesday, May 21, 2019 11:15 AM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: SolrCloud (7.3) and Legacy replication slaves
>>> 
>>> On 5/21/2019 8:48 AM, Michael Tracey wrote:
 Is it possible set up an existing SolrCloud cluster as the master for
 legacy replication to a slave server or two?   It looks like another
>>> option
 is to use Uni-direction CDCR, but not sure what is the best option in
>>> this
 case.
>>> 
>>

Re: Question about solr commits

2020-10-07 Thread Erick Erickson

Yes.

> On Oct 7, 2020, at 6:40 PM, yaswanth kumar  wrote:
> 
> I have the below in my solrconfig.xml
> 
> 
>
>  ${solr.Data.dir:}
>
>
>  ${solr.autoCommit.maxTime:6}
>  false
>
>
>  ${solr.autoSoftCommit.maxTime:5000}
>
>  
> 
> Does this mean even though we are always sending data with commit=false on
> update solr api, the above should do the commit every minute (6 ms)
> right?
> 
> -- 
> Thanks & Regards,
> Yaswanth Kumar Konathala.
> yaswanth...@gmail.com

Re: Java GC issue investigation

2020-10-06 Thread Erick Erickson

12G is not that huge, it’s surprising that you’re seeing this problem.

However, there are a couple of things to look at:

1> If you’re saying that you have 16G total physical memory and are allocating 
12G to Solr, that’s an anti-pattern. See: 
https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
If at all possible, you should allocate between 25% and 50% of your physical 
memory to Solr...

2> what garbage collector are you using? G1GC might be a better choice.

> On Oct 6, 2020, at 10:44 AM, matthew sporleder  wrote:
> 
> Your index is so small that it should easily get cached into OS memory
> as it is accessed.  Having a too-big heap is a known problem
> situation.
> 
> https://cwiki.apache.org/confluence/display/SOLR/SolrPerformanceProblems#SolrPerformanceProblems-HowmuchheapspacedoIneed?
> 
> On Tue, Oct 6, 2020 at 9:44 AM Karol Grzyb  wrote:
>> 
>> Hi Matthew,
>> 
>> Thank you for the answer, I cannot reproduce the setup locally I'll
>> try to convince them to reduce Xmx, I guess they will rather not agree
>> to 1GB but something less than 12G for sure.
>> And have some proper dev setup because for now we could only test prod
>> or stage which are difficult to adjust.
>> 
>> Is being stuck in GC common behaviour when the index is small compared
>> to available heap during bigger load? I was more worried about the
>> ratio of heap to total host memory.
>> 
>> Regards,
>> Karol
>> 
>> 
>> wt., 6 paź 2020 o 14:39 matthew sporleder  napisał(a):
>>> 
>>> You have a 12G heap for a 200MB index?  Can you just try changing Xmx
>>> to, like, 1g ?
>>> 
>>> On Tue, Oct 6, 2020 at 7:43 AM Karol Grzyb  wrote:

 Hi,

 I'm involved in investigation of issue that involves huge GC overhead
 that happens during performance tests on Solr Nodes. Solr version is
 6.1. Last test were done on staging env, and we run into problems for
 <100 requests/second.

 The size of the index itself is ~200MB ~ 50K docs
 Index has small updates every 15min.

 Queries involve sorting and faceting.

 I've gathered some heap dumps, I can see from them that most of heap
 memory is retained because of object of following classes:

 -org.apache.lucene.search.grouping.term.TermSecondPassGroupingCollector
 (>4G, 91% of heap)
 -org.apache.lucene.search.grouping.AbstractSecondPassGroupingCollector$SearchGroupDocs
 -org.apache.lucene.search.FieldValueHitQueue$MultiComparatorsFieldValueHitQueue
 -org.apache.lucene.search.TopFieldCollector$SimpleFieldCollector
 (>3.7G 76% of heap)

 Based on information above is there anything generic that can been
 looked at as source of potential improvement without diving deeply
 into schema and queries (which may be very difficlut to change at this
 moment)? I don't see docvalues being enabled - could this help, as if
 I get the docs correctly, it's specifically helpful when there are
 many sorts/grouping/facets? Or I

 Additionaly I see, that many threads are blocked on LRUCache.get,
 should I recomend switching to FastLRUCache?

 Also, I wonder if -Xmx12288m for java heap is not too much for 16G
 memory? I see some (~5/s) page faults in Dynatrace during the biggest
 traffic.

 Thank you very much for any help,
 Kind regards,
 Karol

Re: Solr Issue - Logger : UpdateLog Error Message : java.io.EOFException

2020-10-03 Thread Erick Erickson

Very strange things start to happen when GC becomes unstable. The first and 
simplest thing to do would be to bump up your heap, say to 20g (note, try to 
stay under 32G or be prepared to jump significantly higher. At 32G long 
pointers have to be used and you actually have less memory available than you 
think).

The first three warnings indicate that you have both managed-schema and 
schema.xml in your configset _and_ are using the managed schema (enabled in 
solrconfig.xml). Which also suggests you’re upgrading from a previous version. 
This is printed out in as a courtesy notification that schema.xml is no longer 
being used so you should delete it to avoid confusion. NOTE: if you want to use 
schema.xml like you have before, see the reference guide.

The fourth warning suggests that you have killed Solr without committing and 
it’s replaying the transaction log. For instance, “kill -9” or other will do 
it. If you do something like that before a commit completes, updates are 
replayed from the tlog in order to preserve data.

Which leads to your second issue. I’d guess either you’re not committing after 
your updates (and, BTW, please just let your autocommit settings handle that), 
and forcefully killing Solr (e.g. kill -9). That can happen even if you use the 
“bin/solr stop” command if it takes too long (3 minutes by default last I 
knew). A “normal” shutdown that succeeds (i.e. bin/solr stop that doesn’t print 
a message about forcefully killing Solr” will commit on shutdown BTW. Taking 
over 3 minutes may be a symptom of GC going crazy.

You should to try to figure out why you have this kind of memory spike, 
returning a zillion documents is one possible cause (i.e. rows=100 or 
something). All the docs have to be assembled in memory, so if you need to 
return lots of rows, use streaming or cursorMark.

So what I’d do:
1> bump up your heap
2> insure that you shut Solr down gracefully
3> see if any particular query triggers this memory spike and if you’re using 
an anti-pattern.

Best,
Erick

> On Oct 2, 2020, at 7:10 PM, Training By Coding  
> wrote:
> 
> Events:
>   • GC logs showing continuous Full GC events. Log report attached.
>   • Core filling failed , showing less data( Num Docs)  than expected.
>   • following warnings showing on dashboard before error.
> 
> Level Logger  Message
> WARN falseManagedIndexSchemaFactory   The schema has been upgraded to 
> managed, but the non-managed schema schema.xml is still loadable. PLEASE 
> REMOVE THIS FILE.
> WARN falseManagedIndexSchemaFactory   The schema has been upgraded to 
> managed, but the non-managed schema schema.xml is still loadable. PLEASE 
> REMOVE THIS FILE.
> WARN falseSolrResourceLoader  Solr loaded a deprecated 
> plugin/analysis class [solr.TrieDateField]. Please consult documentation how 
> to replace it accordingly.
> WARN falseManagedIndexSchemaFactory   The schema has been upgraded to 
> managed, but the non-managed schema schema.xml is still loadable. PLEASE 
> REMOVE THIS FILE.
> WARN falseUpdateLog   Starting log replay 
> tlog{file=\data\tlog\tlog.0445482 refcount=2} 
> active=false starting pos=0 inSortedOrder=false
>   • Total data in all cores around 8 GB
>   • Other Configurations:
>   • -XX:+UseG1GC
>   • -XX:+UseStringDeduplication
>   • -XX:MaxGCPauseMillis=500
>   • -Xms15g
>   • -Xmx15g
>   • -Xss256k
>   • OS Environment :
>   • Windows 10,
>   • Filling cores by calling SQL query using jtds-1.3.1 library.
>   • Solr Version 7.5
>   • Runtime: Oracle Corporation OpenJDK 64-Bit Server VM 11.0.2 
> 11.0.2+9
>   • Processors : 48
>   • System Physical Memory : 128 GB
>   • Swap Space : 256GB
>   • solr-spec7.5.0
>   • solr-impl7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - 
> jimczi - 2018-09-18 13:07:55
>   • lucene-spec7.5.0
>   • lucene-impl7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - 
> jimczi - 2018-09-18 13:01:1
> Error Message : 
> java.io.EOFException
> at 
> org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:168)
> at org.apache.solr.common.util.JavaBinCodec.readStr(JavaBinCodec.java:863)
> at org.apache.solr.common.util.JavaBinCodec.readStr(JavaBinCodec.java:857)
> at org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:266)
> at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:256)
> at 
> org.apache.solr.common.util.JavaBinCodec.readSolrInputDocument(JavaBinCodec.java:603)
> at org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:315)
> at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:256)
> at org.apache.solr.common.util.JavaBinCodec.readArray(JavaBinCodec.java:747)
> at org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCo

Re: Transaction not closed on ms sql

2020-10-01 Thread Erick Erickson

First of all, I’d just use a stand-alone program to do your 
processing for a number of reasons, see:

https://lucidworks.com/post/indexing-with-solrj/

1- I suspect your connection will be closed eventually. Since it’s expensive to
open one of these, the driver may keep it open for a while.

2 - This is one of the reasons I’d go to something outside Solr. The
 link above gives you a skeletal program that’ll show you how. It
 has the usual problem of demo code, it needs more error checking
and the like.

3 - see TolerantUpdateProcessor(Factory).

Best,
Erick

> On Sep 30, 2020, at 10:43 PM, yaswanth kumar  wrote:
> 
> Can some one help in troubleshooting some issues that happening from DIH??
> 
> Solr version: 8.2; zookeeper 3.4
> Solr cloud with 4 nodes and 3 zookeepers
> 
> 1. Configured DIH for ms sql with mssql jdbc driver, and when trying to pull 
> the data from mssql it’s connecting and fetching records but we do see the 
> connection that was opened on the other end mssql was not closed even though 
> the full import was completed .. need some help in troubleshooting why it’s 
> leaving connections open
> 
> 2. The way I have scheduled this import api call is like a util that will be 
> hitting DIH api every min with a solr pool url and with this it looks like 
> multiple calls are going from different solr nodes which I don’t want .. I 
> always need the call to be taken by only one node.. can we control this with 
> any config?? Or is this happening because I have three zoo’s?? Please suggest 
> the best approach 
> 
> 3. I do see some records are shown as failed while doing import, is there a 
> way to track these failures?? Like why a minimal no of records are failing??
> 
> 
> 
> Sent from my iPhone

Re: Slow Solr 8 response for long query

2020-09-30 Thread Erick Erickson

Increasing the number of rows should not have this kind of impact in either 
version of Solr, so I think there’s something fundamentally strange in your 
setup.

Whether returning 10 or 300 documents, every document has to be scored. There 
are two differences between 10 and 300 rows:

1> when returning 10 rows, Solr keeps a sorted list of 10 doc, just IDs and 
score (assuming you’re sorting by relevance), when returning 300 the list is 
300 long. I find it hard to believe that keeping a list 300 items long is 
making that much of a difference.

2> Solr needs to fetch/decompress/assemble 300 documents .vs. 10 documents for 
the response. Regardless of the fields returned, the entire document will be 
decompresses if you return any fields that are not docValues=true. So it’s 
possible that what you’re seeing is related.

Try adding, as Alexandre suggests, &debug to the query. Pay particular 
attention to the “timings” section too, that’ll show you the time each 
component took _exclusive_ of step <2> above and should give a clue.

All that said, fq clauses don’t score, so scoring is certainly involved in why 
the query takes so long to return even 10 rows but gets faster when you move 
the clause to a filter query, but my intuition is that there’s something else 
going on as well to account for the difference when you return 300 rows.

Best,
Erick

> On Sep 29, 2020, at 8:52 PM, Alexandre Rafalovitch  wrote:
> 
> What do the debug versions of the query show between two versions?
> 
> One thing that changed is sow (split on whitespace) parameter among
> many. It is unlikely to be the cause, but I am mentioning just in
> case.
> https://lucene.apache.org/solr/guide/8_6/the-standard-query-parser.html#standard-query-parser-parameters
> 
> Regards,
>   Alex
> 
> On Tue, 29 Sep 2020 at 20:47, Permakoff, Vadim
>  wrote:
>> 
>> Hi Solr Experts!
>> We are moving from Solr 6.5.1 to Solr 8.5.0 and having a problem with long 
>> query, which has a search text plus many OR and AND conditions (all in one 
>> place, the query is about 20KB long).
>> For the same set of data (about 500K docs) and the same schema the query in 
>> Solr 6 return results in less than 2 sec, Solr 8 takes more than 10 sec to 
>> get 10 results. If I increase the number of rows to 300, in Solr 6 it takes 
>> about 10 sec, in Solr 8 it takes more than 1 min. The results are small, 
>> just IDs. It looks like the relevancy scoring plays role, because if I move 
>> this query to filter query - both Solr versions work pretty fast.
>> The right way should be to change the query, but unfortunately it is 
>> difficult to modify the application which creates these queries, so I want 
>> to find some temporary workaround.
>> 
>> What was changed from Solr 6 to Solr 8 in terms of scoring with many 
>> conditions, which affects the search speed negatively?
>> Is there anything to configure in Solr 8 to get the same performance for 
>> such query like it was in Solr 6?
>> 
>> Thank you,
>> Vadim
>> 
>> 
>> 
>> This email is intended solely for the recipient. It may contain privileged, 
>> proprietary or confidential information or material. If you are not the 
>> intended recipient, please delete this email and any attachments and notify 
>> the sender of the error.

Re: How to Resolve : "The request took too long to iterate over doc values"?

2020-09-29 Thread Erick Erickson

Let’s see the query. My bet is that you are _searching_ against the field and 
have indexed=false.

Searching against a docValues=true indexed=false field results in the
equivalent of a “table scan” in the RDBMS world. You may use
the docValues efficiently for _function queries_ to mimic some
search behavior.

Best,
Erick

> On Sep 29, 2020, at 6:59 AM, raj.yadav  wrote:
> 
> In our index, we have few fields defined as `ExternalFileField` field type.
> We decided to use docValues for such fields. Here is the field type
> definition
> 
> OLD => (ExternalFileField)
>  defVal="0.0" class="solr.ExternalFileField"/>
> 
> NEW => (docValues)
>  indexed="false" stored="false"  docValues="true"
> useDocValuesAsStored="false"/>
> 
> After this modification we started getting the following `timeout warning`
> messages:
> 
> ```The request took too long to iterate over doc values. Timeout: timeoutAt:
> 1626463774823735 (System.nanoTime(): 1626463774836490),
> DocValues=org.apache.lucene.codecs.lucene80.Lucene80DocValuesProducer$8@4efddff
> ```
> 
> Our system configuration:
> Each Solr Instance: 8 vcpus, 64 GiB memory
> JAVA Memory: 30GB
> Collection: 4 shards (each shard has approximately 12 million docs and index
> size of 12 GB) and each Solr instance has one replica of the shard. 
> 
> GC_TUNE="-XX:NewRatio=3 \
> -XX:SurvivorRatio=4 \
> -XX:PermSize=64m \
> -XX:MaxPermSize=64m \
> -XX:TargetSurvivorRatio=80 \
> -XX:MaxTenuringThreshold=9 \
> -XX:+UseConcMarkSweepGC \
> -XX:+UseParNewGC \
> -XX:+CMSClassUnloadingEnabled \
> -XX:ConcGCThreads=4 -XX:ParallelGCThreads=4 \
> -XX:+CMSScavengeBeforeRemark \
> -XX:PretenureSizeThreshold=64m \
> -XX:+UseCMSInitiatingOccupancyOnly \
> -XX:CMSInitiatingOccupancyFraction=50 \
> -XX:CMSMaxAbortablePrecleanTime=6000 \
> -XX:+CMSParallelRemarkEnabled \
> -XX:+ParallelRefProcEnabled"
> 
> 1. What this warning message means?
> 2. How to resolve it?
> 
> 
> 
> 
> --
> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Solr storage of fields <-> indexed data

2020-09-28 Thread Erick Erickson

Fields are placed in the index totally separately from each
other, so it’s no wonder that removing
the copyField results in this kind of savings.

And they have to be separate. Consider what comes out of the end of the
analysis chain. The same input could produce totally different output. 
As a trivial example, imagine two fields:

whitespacetokenizer
lowercasefilter

whitespacetokenizer
lowercasefilter
edgengramfilterfactory

and identical input "fleas”. The output of the first would be “fleas”, and the
output of the second would be something like “f”, “fl”, “fle”, “flea”, “fleas”.

Trying to share the tokens between fields would be a nightmare.

And that’s only one of many ways the output of two different analysis
chains could be different…

Best,
Erick

> On Sep 28, 2020, at 10:56 AM, Edward Turner  wrote:
> 
> Hi all,
> 
> We have recently switched to using edismax + qf fields, and no longer use
> copyfields to allow us to easily search over values in multiple fields (by
> copying multiple fields' values to the copyfield destinations, and then
> performing queries over the destination field).
> 
> By removing the copyfields, we've found that our index sizes have reduced
> by ~40% in some cases, which is great! We're just curious now as to exactly
> how this can be ...
> 
> My question is, given the following two schemas, if we index some data to
> the "description" field, will the index for schema1 be twice as large as
> the index of schema2? (I guess this relates to how, internally, Solr stores
> field + index data)
> 
> Old way -- schema1:
> ===
>  multiValued="false"/>
>  multiValued="false" />
>  multiValued="false"/>
> 
> Many thanks and kind regards,
> 
> Edd

Re: SOLR Cursor Pagination Issue

2020-09-28 Thread Erick Erickson

I said nothing about docId changing. _any_ sort criteria changing is an issue. 
You’re sorting by score. Well, as you index documents, the new docs change the 
values used to calculate scores for _all_ documents will change, thus changing 
the sort order and potentially causing unexpected results when using 
cursormark. That said, I don’t think you’re getting any different scores at all 
if you’re really searching for “(* AND *)", try returning score in the fl list, 
are they different?

You still haven’t given an example of the results you’re seeing that are 
unexpected. And my assumption is that you are seeing odd results when you call 
this query again with a cursorMark returned by a previous call. Or are you 
saying that you don’t think facet.query is returning the correct count? Be 
aware that Solr doesn’t support true Boolean logic, see: 
https://lucidworks.com/post/why-not-and-or-and-not/

There’s special handling for the form "fq=NOT something” to change it to 
"fq=*:* NOT something” that’s not present in something like "q=NOT something”. 
How that plays in facet.query I’m not sure, but try “facet.query=*:* NOT 
something” if the facet count is what the problem is.

l have no idea what you’re trying to accomplish with (* AND *) unless those are 
just placeholders and you put real text in them. That’s rather odd. *:* is 
“select everything”...

BTW, returning 10,000 docs is somewhat of an anti-pattern, if you really 
require that many documents consider streaming.

> On Sep 28, 2020, at 10:21 AM, vmakov...@xbsoftware.by wrote:
> 
> Hi, Erick
> 
> I have a python script that sends requests with CursorMark. This script 
> checks data against the following Expected series criteria:
> Collected series:
> Number of requests:
> Collected unique series:
> The request looks like this: 
> select?indent=off&defType=edismax&wt=json&facet.query={!key=NUM_DOCS}NOT 
> SERIES_ID:0&fq=NOT 
> SERIES_ID:0&spellcheck=true&spellcheck.collate=true&spellcheck.extendedResults=true&facet.limit=-1&q=(*
>  AND *)&qf=all_text_stemming all_text&fq=facet_db_code:( "CN" 
> )&fq=-SERIES_CODE:( "TEST" )&fl=SERIES_ID&sort=score desc,docId 
> asc&bq=SERIES_STATUS:T^5&bq=KEY_SERIES_FLAG:1^5&bq=accuracy_name:0&bq=SERIES_STATUS:C^-30&rows=1&cursorMark=*
> 
> DocId does not change during data update.During data updating process in 
> solrCloud skript returnd incorect Number of requests and Collected series.
> 
> Best,
> Vlad
> 
> 
> 
> Mon, 28 Sep 2020 08:54:57 -0400, Erick Erickson  
> писал(а):
> 
>> Define “incorrect” please. Also, showing the exact query you use would be 
>> helpful.
>> That said, indexing data at the same time you are using CursorMark is not 
>> guaranteed do find all documents. Consider a sort with date asc, id asc. 
>> doc53 has a date of 2001 and you’re already returned the doc.
>> Next, you update doc53 to 2020. It now appears sometime later in the results 
>> due to the changed data. Or the other way, doc53 starts with 2020, and while 
>> your cursormark label is in 2010, you change doc53 to have a date of 2001. 
>> It will never be returned.
>> Similarly for anything else you change that’s relevant to the sort criteria 
>> you’re using.
>> CursorMark doesn’t remember _documents_, just, well, call it the fingerprint 
>> (i.e. sort criteria values) of the last document returned so far.
>> Best,
>> Erick
>>> On Sep 28, 2020, at 3:32 AM, vmakov...@xbsoftware.by wrote:
>>> Good afternoon,
>>> Could you please suggest us a solution: during data updating process in 
>>> solrCloud, requests with cursor mark return incorrect data. I suppose that 
>>> the results do not follow each other during the indexation process, because 
>>> the data doesn't have enough time to be replicated between the nodes.
>>> Kind regards,
>>> Vladislav Makovski
> Vladislav Makovski
> Developer
> XB Software Ltd. | Minsk, Belarus
> Site: https://xbsoftware.com
> Skype: vlad__makovski
> Cell:  +37529 6484100

Re: Unable to upload updated solr config set

2020-09-28 Thread Erick Erickson

Until then, you can use

bin/solr zk upconfig….

Best,
Erick

> On Sep 28, 2020, at 10:06 AM, Houston Putman  wrote:
> 
> Until the next Solr minor version is released you will not be able to
> overwrite an existing configSet with a new configSet of the same name.
> 
> The ticket for this feature is SOLR-10391
> , and it will be included
> in the 8.7.0 release.
> 
> Until then you will have to create a configSet with a new name, and then
> update your collections to point to that new configSet.
> 
> - Houston
> 
> On Sun, Sep 27, 2020 at 6:56 PM Deepu  wrote:
> 
>> Hi,
>> 
>> I was able to upload solr  configs using solr/admin/configs?action=UPLOAD
>> api but getting below error when reupload same config set with same.
>> 
>> {
>> 
>>  "responseHeader":{
>> 
>>"status":400,
>> 
>>"QTime":51},
>> 
>>  "error":{
>> 
>>"metadata":[
>> 
>>  "error-class","org.apache.solr.common.SolrException",
>> 
>>  "root-error-class","org.apache.solr.common.SolrException"],
>> 
>>"msg":"The configuration sampleConfigSet already exists in zookeeper",
>> 
>>"code":400}}
>> 
>> 
>> how we re upload same config with few schema & solr config changes ?
>> 
>> 
>> 
>> Thanks,
>> 
>> Deepu
>>

Re: Corrupted records after successful commit

2020-09-28 Thread Erick Erickson

Is your “id” field is your , and is it tokenized? It shouldn’t be, 
use something like “string” or keywordTokenizer. Definitely do NOT use, say, 
text_general. 

It’s very unlikely that records are not being flushed on commit, I’m 99.99% 
certain that’s a red herring and that this is a problem in your environment.

Or that some process you don’t know about is sending documents that don’t have 
the information you expect. The fact that you say you’ve disabled your update 
scripts but see this second record being indexed 3 minutes later is strong 
evidence that _someone_ is updating records, is there a cron job somewhere 
that’s sending docs? Other?? 

I bet that if you redefined your updateHandler to give it some name other than 
“/update” in solrconfig.xml two things would happen: 
1> this problem will go away
2> you’ll get some error report from somewhere telling you that Solr is broken 
because it isn’t accepting documents for update ;)

> On Sep 28, 2020, at 9:01 AM, Mr Havercamp  wrote:
> 
> Thanks Eric. My knowledge is fairly limited but 1) sounds feasible. Some
> logs:
> 
> I write a bunch of recods to Solr:
> 
> 2020-09-28 11:01:01.255 INFO  (qtp918312414-21) [   x:vnc]
> o.a.s.u.p.LogUpdateProcessorFactory [vnc]  webapp=/solr path=/update
> params={json.nl=flat&omitHeader=false&wt=json}{add=[
> talk.tq0rkem4pc.jaydeep.pan...@dev.vnc.de (1679075166122934272),
> talk.tq0rkem4pc.dmitry.zolotni...@dev.vnc.de (1679075166123982848),
> talk.tq0rkem4pc.hayden.yo...@dev.vnc.de (1679075166125031424),
> talk.tq0rkem4pc.nishant.j...@dev.vnc.de (1679075166125031425),
> talk.tq0rkem4pc.macanh@dev.vnc.de (167907516612608),
> talk.tq0rkem4pc.kapil.nadiyap...@dev.vnc.de (1679075166126080001),
> talk.tq0rkem4pc.sanjay.domad...@dev.vnc.de (1679075166126080002),
> talk.tq0rkem4pc.umesh.sarva...@dev.vnc.de (1679075166127128576)],commit=} 0
> 8
> 
> Selecting records looks good:
> 
>  {
>"talk_id_s":"tq0rkem4pc",
>"talk_internal_id_s":"29896",
>"from_s":"from address",
>"content_txt":["test_116"],
>"raw_txt":["http://www.w3.org/1999/xhtml\
> ">test_116"],
>"created_dt":"2020-09-28T11:00:02Z",
>"updated_dt":"2020-09-28T11:00:02Z",
>"type_s":"talk",
>"talk_type_s":"groupchat",
>"title_s":"role__change__1_talk@conference",
>"to_ss":["bunch", "of", "names"],
>"owner_s":"owner address",
>"id":"talk.tq0rkem4pc.email@address",
>"_version_":1679075166127128576}
> 
> Then, a few minutes later:
> 
> 2020-09-28 11:04:33.070 INFO  (qtp918312414-21) [   x:vnc]
> o.a.s.u.p.LogUpdateProcessorFactory [vnc]  webapp=/solr path=/update
> params={wt=json}{add=[talk.tq0rkem4pc.hayden.yo...@dev.vnc.de
> (1679075388234399744)]} 0 1
> 2020-09-28 11:04:33.150 INFO  (qtp918312414-21) [   x:vnc]
> o.a.s.u.p.LogUpdateProcessorFactory [vnc]  webapp=/solr path=/update
> params={wt=json}{add=[talk.tq0rkem4pc (1679075388318285824)]} 0 0
> 
> Checking the record again:
> 
> {
>"id":"talk.tq0rkem4pc.email@address",
>"_version_":1679075388234399744},
>  {
>"id":"talk.tq0rkem4pc",
>"_version_":1679075388318285824}
> 
> A couple of strange things here:
> 
> 1. my talk.tq0rkem4pc.email@address record no longer has any data in it.
> Just id and version.
> 
> 2. The second entry is really strange; this isn't a valid record at all and
> I don't have any record of creating it.
> 
> I've ruled out reindexing items both from my indexing script (I just don't
> run it) and an external code snippet updating the record at a later time.
> 
> Not sure if I've got the terminology right but would I be correct in
> assuming that it is possible records are not being flushed from the buffer
> when added? I'm assuming there is some kind of buffering or caching going
> on before records are commttted? Is it possible they are getting corrupted
> under higher than usual load?
> 
> 
> On Mon, 28 Sep 2020 at 20:41, Erick Erickson 
> wrote:
> 
>> There are several possibilities:
>> 
>> 1> you simply have some process incorrectly updating documents.
>> 
>> 2> you’ve changed your schema sometime without completely deleting your
>> old index and re-indexing all documents from scratch. I recommend in fact
>> indexing into a new collection an

Re: SOLR Cursor Pagination Issue

2020-09-28 Thread Erick Erickson

Define “incorrect” please. Also, showing the exact query you use would be 
helpful.

That said, indexing data at the same time you are using CursorMark is not 
guaranteed do find all documents. Consider a sort with date asc, id asc. doc53 
has a date of 2001 and you’re already returned the doc.

Next, you update doc53 to 2020. It now appears sometime later in the results 
due to the changed data. 

Or the other way, doc53 starts with 2020, and while your cursormark label is in 
2010, you change doc53 to have a date of 2001. It will never be returned.

Similarly for anything else you change that’s relevant to the sort criteria 
you’re using.

CursorMark doesn’t remember _documents_, just, well, call it the fingerprint 
(i.e. sort criteria values) of the last document returned so far.

Best,
Erick

> On Sep 28, 2020, at 3:32 AM, vmakov...@xbsoftware.by wrote:
> 
> Good afternoon,
> Could you please suggest us a solution: during data updating process in 
> solrCloud, requests with cursor mark return incorrect data. I suppose that 
> the results do not follow each other during the indexation process, because 
> the data doesn't have enough time to be replicated between the nodes.
> Kind regards,
> Vladislav Makovski
>

Re: Corrupted records after successful commit

2020-09-28 Thread Erick Erickson

There are several possibilities:

1> you simply have some process incorrectly updating documents.

2> you’ve changed your schema sometime without completely deleting your old 
index and re-indexing all documents from scratch. I recommend in fact indexing 
into a new collection and using collection aliasing if you can’t delete and 
recreate the collection before re-indexing. There’s some support for this idea 
because you say that the doc in question not only changes one way, but then 
changes back mysteriously. So seg1 (old def) merges with seg2 (new def) into 
seg10 using the old def because merging saw seg1 first. Then sometime later 
seg3 (new def) merges with seg10 and the data mysteriously comes back because 
that merge uses seg3 (new def) as a template for how the index “should” look.

But I’ve never heard of Solr (well, Lucene actually) doing this by itself, and 
I have heard of the merging process doing “interesting” stuff with segments 
created with changed schema definitions.

Best,
Erick

> On Sep 28, 2020, at 8:26 AM, Mr Havercamp  wrote:
> 
> Hi,
> 
> We're seeing strange behaviour when records have been committed. It doesn't
> happen all the time but enough that the index is very inconsistent.
> 
> What happens:
> 
> 1. We commit a doc to Solr,
> 2. The doc shows in the search results,
> 3. Later (may be immediate, may take minutes, may take hours), the same
> document is emptied of all data except version and id.
> 
> We have custom scripts which add to the index but even without them being
> executed we see records being updated in this way.
> 
> For example committing:
> 
> { id: talk.1234, from: "me", to: "you", "content": "some content", title:
> "some title"}
> 
> will suddenly end up as after an initial successful search:
> 
> { id: talk.1234, version: 1234}
> 
> Not sure how to proceed on debugging this issue. It seems to settle in
> after Solr has been running for a while but can just as quickly rectify
> itself.
> 
> At a loss how to debug and proceed.
> 
> Any help much appreciated.

Re: Solr 8.6.2 text_general

2020-09-25 Thread Erick Erickson

Uhhh, this is really dangerous. If you’ve indexed documents 
since upgrading, some were indexed with multiValued=false. Now
you’ve changed the definition at a fundamental Lucene level and
Things Can Go Wrong. 

You’re OK if (and only if) you have indexed _no_ documents since
you upgraded.

But even in that case, there may be other fields with different
multiValued values. And if you sort, group, or facet on them
when some have been indexed one way and some others, you’ll
get errors.

I strongly urge you to re-index all your data into a new collection
and, perhaps, use collection aliasing to seamlessly switch.

Best,
Erick

> On Sep 25, 2020, at 8:50 AM, Anuj Bhargava  wrote:
> 
> It worked. I just added multiValued="true".
> 
>multiValued="true"/>
>   
>multiValued="true"/>
>   
> 
> Thanks for all your help.
> 
> Regards,
> 
> Anuj
> 
> On Fri, 25 Sep 2020 at 18:08, Alexandre Rafalovitch 
> wrote:
> 
>> Ok, something is definitely not right. In those cases, I suggest
>> checking backwards from hard reality. Just in case the file you are
>> looking at is NOT the one that is actually used when collection is
>> actually setup. Happened to me more times than I can count.
>> 
>> Point your Admin UI to the collection you are having issues and check
>> the schema definitions there (either in Files or even in Schema
>> screen). I still think your multiValued definition changed somewhere.
>> 
>> Regards,
>>  Alex.
>> 
>> On Fri, 25 Sep 2020 at 03:57, Anuj Bhargava  wrote:
>>> 
>>> Schema on both are the same
>>> 
>>>   > stored="true"/>
>>>   
>>>   > stored="true"/>
>>>   
>>> 
>>> Regards,
>>> 
>>> Anuj
>>> 
>>> On Thu, 24 Sep 2020 at 18:58, Alexandre Rafalovitch 
>>> wrote:
>>> 
 These are field definitions for _text_ and text, your original
 question was about the fields named "country"/"currency" and whatever
 type they mapped to.
 
 Your text/_text_ field is not actually returned to the browser,
 because it is "stored=false", so it is most likely a catch-all
 copyField destination. You may be searching against it, but you are
 returning other (original) fields.
 
 Regards,
   Alex.
 
 On Thu, 24 Sep 2020 at 09:23, Anuj Bhargava 
>> wrote:
> 
> In both it is the same
> 
> In Solr 8.0.0
> >>> indexed="true"
> stored="false"/>
> 
> In Solr 8.6.2
>  multiValued="true"/>
> 
> On Thu, 24 Sep 2020 at 18:33, Alexandre Rafalovitch <
>> arafa...@gmail.com>
> wrote:
> 
>> I think that means your field went from multiValued to
>> singleValued.
>> Double check your schema. Remember that multiValued flag can be set
>> both on the field itself and on its fieldType.
>> 
>> Regards,
>>   Alex
>> P.s. However if your field is supposed to be single-valued, maybe
>> you
>> should treat it as a feature not a bug. Multivalued fields have
>> some
>> restrictions that single-valued fields do not have (around sorting
>> for
>> example).
>> 
>> On Thu, 24 Sep 2020 at 03:09, Anuj Bhargava 
 wrote:
>>> 
>>> In solr 8.0.0 when running the query the data
>> (type="text_general")
 was
>>> shown in brackets *[ ]*
>>> "country":*[*"IN"*]*,
>>> "currency":*[*"INR"*]*,
>>> "date_c":"2020-08-23T18:30:00Z",
>>> "est_cost":0,
>>> 
>>> However, in solr 8.6.2 the query the data (type="text_general")
>> is
 not
>>> showing in brackets [ ]
>>> "country":"IN",
>>> "currency":"INR",
>>> "date_c":"2020-08-23T18:30:00Z",
>>> "est_cost":0,
>>> 
>>> 
>>> How to get the query results to show brackets in Solr 8.6.2
>> 
 
>>

Re: TimeAllowed and Partial Results

2020-09-22 Thread Erick Erickson

TimeAllowed stops looking when the timer expires. If it hasn’t found any docs 
with a
non-zero score by then, you’ll get zero hits.

It has to be this way, because Solr doesn’t know whether a doc is a hit until 
Solr scores it.

So this is normal behavior, assuming that some part of the processing takes 
more than 2
seconds before it finds the first non-zero scoring document.

So what I’d recommend is that you build up your query gradually and find out 
what’s
taking the time. Something like
q=clause1
q=clause1 clause2

etc, all with timeAllowed at 2 seconds. Eventually you’ll find a clause that 
exceeds your timeout, then
you can address the root cause.

Best,
Erick

> On Sep 22, 2020, at 10:29 AM, Jae Joo  wrote:
> 
> I have timeAllowed=2000 (2sec) and having mostly 0 hits coming out. Should
> I have more than 0 results?
> 
> Jae

Re: Many small instances, or few large instances?

2020-09-21 Thread Erick Erickson

In a word, yes. G1GC still has spikes, and the larger the heap the more likely 
you’ll be to encounter them. So having multiple JVMS rather than one large JVM 
with a ginormous heap is still recommended.

I’ve seen some cases that used the Zing zero-pause product with very large 
heaps, but they were forced into that by the project requirements.

That said, when Java has a ZCG option, I think we’re in uncharted territory. I 
frankly don’t know what using very large heaps without having to worry about GC 
pauses will mean for Solr. I suspect we’ll have to do something to take 
advantage of that. For instance, could we support a topology where all shards 
had at least one replica in the same JVM that didn’t make any HTTP requests? 
Would that topology be common enough to support? Maybe extend “rack aware” to 
be “JVM aware”? Etc.

One thing that does worry me is that it’ll be easier and easier to “just throw 
more memory at it” rather than examine whether you’re choosing options that 
minimize heap requirements. And Lucene has done a lot to move memory to the OS 
rather than heap (e.g. docValues, MMapDirectory etc.).

Anyway, carry on as before for the nonce.

Best,
Erick

> On Sep 21, 2020, at 6:06 AM, Bram Van Dam  wrote:
> 
> Hey folks,
> 
> I've always heard that it's preferred to have a SolrCloud setup with
> many smaller instances under the CompressedOops limit in terms of
> memory, instead of having larger instances with, say, 256GB worth of
> heap space.
> 
> Does this recommendation still hold true with newer garbage collectors?
> G1 is pretty fast on large heaps. ZGC and Shenandoah promise even more
> improvements.
> 
> Thx,
> 
> - Bram

Re: Pining Solr

2020-09-18 Thread Erick Erickson

Well, this is doesn’t look right at all:

/solr/cpsearch/select_cpsearch/select

It should just be:
/solr/cpsearch/select_cpsearch

Best,
Erick

> On Sep 18, 2020, at 3:18 PM, Steven White  wrote:
> 
> /solr/cpsearch/select_cpsearch/select

Re: Pining Solr

2020-09-18 Thread Erick Erickson

This looks kind of confused. I’m assuming what you’re after is a way to get
to your select_cpsearch request handler to test if Solr is alive and calling 
that
“ping”.

The ping request handler is just that, a separate request handler that you hit 
by going to 
http://sever:port/solr/admin/ping. 

It has nothing to do at all with your custom search handler and in recent 
versions of
Solr is implicitly defined so it should just be there.

Your custom handler is defined as

Re: Handling failure when adding docs to Solr using SolrJ

2020-09-17 Thread Erick Erickson

I recommend _against_ issuing explicit commits from the client, let
your solrconfig.xml autocommit settings take care of it. Make sure
either your soft or hard commits open a new searcher for the docs
to be searchable.

I’ll bend a little bit if you can _guarantee_ that you only ever have one
indexing client running and basically only ever issue the commit at the
end.

There’s another strategy, do the solrClient.add() command with the
commitWithin parameter.

As far as failures, look at 
https://lucene.apache.org/solr/7_3_0/solr-core/org/apache/solr/update/processor/TolerantUpdateProcessor.html
that’ll give you a better clue about _which_ docs failed. From there, though,
it’s a bit if debugging to figure out why that particular doc failed, usually 
people
record the docs that failed for later analysis. and/or look at the Solr logs 
which
usually give a more detailed reason of _why_ a document failed...

Best,
Erick

> On Sep 17, 2020, at 1:09 PM, Steven White  wrote:
> 
> Hi everyone,
> 
> I'm trying to figure out when and how I should handle failures that may
> occur during indexing.  In the sample code below, look at my comment and
> let me know what state my index is in when things fail:
> 
>   SolrClient solrClient = new HttpSolrClient.Builder(url).build();
> 
>   solrClient.add(solrDocs);
> 
>   // #1: What to do if add() fails?  And how do I know if all or some of
> my docs in 'solrDocs' made it to the index or not ('solrDocs' is a list of
> 1 or more doc), should I retry add() again?  Retry with a smaller chunk?
> Etc.
> 
>   if (doCommit == true)
>   {
>  solrClient.commit();
> 
>   // #2: What to do if commit() fails?  Re-issue commit() again?
>   }
> 
> Thanks
> 
> Steven

Re: Doing what does using SolrJ API

2020-09-17 Thread Erick Erickson

The script can actually be written an any number of scripting languages, 
python, groovy,
javascript etc. but Alexandre’s comments about javascript are well taken.

It all depends here on whether you every want to search the fields 
individually. If you do,
you need to have them in your index as well as the copyField.

> On Sep 17, 2020, at 1:37 PM, Walter Underwood  wrote:
> 
> If you want to ignore a field being sent to Solr, you can set indexed=false 
> and 
> stored=false for that field in schema.xml. It will take up room in schema.xml 
> but
> zero room on disk.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
>> On Sep 17, 2020, at 10:23 AM, Alexandre Rafalovitch  
>> wrote:
>> 
>> Solr has a whole pipeline that you can run during document ingesting before
>> the actual indexing happens. It is called Update Request Processor (URP)
>> and is defined in solrconfig.xml or in an override file. Obviously, since
>> you are indexing from SolrJ client, you have even more flexibility, but it
>> is good to know about anyway.
>> 
>> You can read all about it at:
>> https://lucene.apache.org/solr/guide/8_6/update-request-processors.html and
>> see the extensive list of processors you can leverage. The specific
>> mentioned one is this one:
>> https://lucene.apache.org/solr/8_6_0//solr-core/org/apache/solr/update/processor/StatelessScriptUpdateProcessorFactory.html
>> 
>> Just a word of warning that Stateless URP is using Javascript, which is
>> getting a bit of a complicated story as underlying JVM is upgraded (Oracle
>> dropped their javascript engine in JDK 14). So if one of the simpler URPs
>> will do the job or a chain of them, that may be a better path to take.
>> 
>> Regards,
>>  Alex.
>> 
>> 
>> On Thu, 17 Sep 2020 at 13:13, Steven White  wrote:
>> 
>>> Thanks Erick.  Where can I learn more about "stateless script update
>>> processor factory".  I don't know what you mean by this.
>>> 
>>> Steven
>>> 
>>> On Thu, Sep 17, 2020 at 1:08 PM Erick Erickson 
>>> wrote:
>>> 
>>>> 1000 fields is fine, you'll waste some cycles on bookkeeping, but I
>>> really
>>>> doubt you'll notice. That said, are these fields used for searching?
>>>> Because you do have control over what gous into the index if you can put
>>> a
>>>> "stateless script update processor factory" in your update chain. There
>>> you
>>>> can do whatever you want, including combine all the fields into one and
>>>> delete the original fields. There's no point in having your index
>>> cluttered
>>>> with unused fields, OTOH, it may not be worth the effort just to satisfy
>>> my
>>>> sense of aesthetics 😉
>>>> 
>>>> On Thu, Sep 17, 2020, 12:59 Steven White  wrote:
>>>> 
>>>>> Hi Eric,
>>>>> 
>>>>> Yes, this is coming from a DB.  Unfortunately I have no control over
>>> the
>>>>> list of fields.  Out of the 1000 fields that there maybe, no document,
>>>> that
>>>>> gets indexed into Solr will use more then about 50 and since i'm
>>> copying
>>>>> the values of those fields to the catch-all field and the catch-all
>>> field
>>>>> is my default search field, I don't expect any problem for having 1000
>>>>> fields in Solr's schema, or should I?
>>>>> 
>>>>> Thanks
>>>>> 
>>>>> Steven
>>>>> 
>>>>> 
>>>>> On Thu, Sep 17, 2020 at 8:23 AM Erick Erickson <
>>> erickerick...@gmail.com>
>>>>> wrote:
>>>>> 
>>>>>> “there over 1000 of them[fields]”
>>>>>> 
>>>>>> This is often a red flag in my experience. Solr will handle that many
>>>>>> fields, I’ve seen many more. But this is often a result of
>>>>>> “database thinking”, i.e. your mental model of how all this data
>>>>>> is from a DB perspective rather than a search perspective.
>>>>>> 
>>>>>> It’s unwieldy to have that many fields. Obviously I don’t know the
>>>>>> particulars of
>>>>>> your app, and maybe that’s the best design. Particularly if many of
>>> the
>>>>>> fields
>>>>>> are sparsely populated, i.e. only a small percentage of the do

Re: Doing what does using SolrJ API

2020-09-17 Thread Erick Erickson

1000 fields is fine, you'll waste some cycles on bookkeeping, but I really
doubt you'll notice. That said, are these fields used for searching?
Because you do have control over what gous into the index if you can put a
"stateless script update processor factory" in your update chain. There you
can do whatever you want, including combine all the fields into one and
delete the original fields. There's no point in having your index cluttered
with unused fields, OTOH, it may not be worth the effort just to satisfy my
sense of aesthetics 😉

On Thu, Sep 17, 2020, 12:59 Steven White  wrote:

> Hi Eric,
>
> Yes, this is coming from a DB.  Unfortunately I have no control over the
> list of fields.  Out of the 1000 fields that there maybe, no document, that
> gets indexed into Solr will use more then about 50 and since i'm copying
> the values of those fields to the catch-all field and the catch-all field
> is my default search field, I don't expect any problem for having 1000
> fields in Solr's schema, or should I?
>
> Thanks
>
> Steven
>
>
> On Thu, Sep 17, 2020 at 8:23 AM Erick Erickson 
> wrote:
>
> > “there over 1000 of them[fields]”
> >
> > This is often a red flag in my experience. Solr will handle that many
> > fields, I’ve seen many more. But this is often a result of
> > “database thinking”, i.e. your mental model of how all this data
> > is from a DB perspective rather than a search perspective.
> >
> > It’s unwieldy to have that many fields. Obviously I don’t know the
> > particulars of
> > your app, and maybe that’s the best design. Particularly if many of the
> > fields
> > are sparsely populated, i.e. only a small percentage of the documents in
> > your
> > corpus have any value for that field then taking a step back and looking
> > at the design might save you some grief down the line.
> >
> > For instance, I’ve seen designs where instead of
> > field1:some_value
> > field2:other_value….
> >
> > you use a single field with _tokens_ like:
> > field:field1_some_value
> > field:field2_other_value
> >
> > that drops the complexity and increases performance.
> >
> > Anyway, just a thought you might want to consider.
> >
> > Best,
> > Erick
> >
> > > On Sep 16, 2020, at 9:31 PM, Steven White 
> wrote:
> > >
> > > Hi everyone,
> > >
> > > I figured it out.  It is as simple as creating a List and using
> > > that as the value part for SolrInputDocument.addField() API.
> > >
> > > Thanks,
> > >
> > > Steven
> > >
> > >
> > > On Wed, Sep 16, 2020 at 9:13 PM Steven White 
> > wrote:
> > >
> > >> Hi everyone,
> > >>
> > >> I want to avoid creating a  > >> source="OneFieldOfMany"/> in my schema (there will be over 1000 of
> them
> > and
> > >> maybe more so managing it will be a pain).  Instead, I want to use
> SolrJ
> > >> API to do what  does.  Any example of how I can do this?
> If
> > >> there is an example online, that would be great.
> > >>
> > >> Thanks in advance.
> > >>
> > >> Steven
> > >>
> >
> >
>

Re: Unable to create core Solr 8.6.2

2020-09-17 Thread Erick Erickson

Look in your solr log, there’s usually a more detailed message

> On Sep 17, 2020, at 9:35 AM, Anuj Bhargava  wrote:
> 
> Getting the following error message while trying to create core
> 
> # sudo su - solr -c "/opt/solr/bin/solr create_core -c 9lives"
> WARNING: Using _default configset with data driven schema functionality.
> NOT RECOMMENDED for production use.
> To turn off: bin/solr config -c 9lives -p 8984 -action
> set-user-property -property update.autoCreateFields -valuefalse
> 
> ERROR: Parse error : 
> 
> 
> Error 401 Unauthorized
> 
> HTTP ERROR 401
> Problem accessing /solr/admin/info/system. Reason:
> Unauthorized
> 
>

Re: Doing what does using SolrJ API

2020-09-17 Thread Erick Erickson

“there over 1000 of them[fields]”

This is often a red flag in my experience. Solr will handle that many 
fields, I’ve seen many more. But this is often a result of 
“database thinking”, i.e. your mental model of how all this data
is from a DB perspective rather than a search perspective.

It’s unwieldy to have that many fields. Obviously I don’t know the particulars 
of
your app, and maybe that’s the best design. Particularly if many of the fields
are sparsely populated, i.e. only a small percentage of the documents in your
corpus have any value for that field then taking a step back and looking
at the design might save you some grief down the line.

For instance, I’ve seen designs where instead of
field1:some_value
field2:other_value….

you use a single field with _tokens_ like:
field:field1_some_value
field:field2_other_value

that drops the complexity and increases performance.

Anyway, just a thought you might want to consider.

Best,
Erick

> On Sep 16, 2020, at 9:31 PM, Steven White  wrote:
> 
> Hi everyone,
> 
> I figured it out.  It is as simple as creating a List and using
> that as the value part for SolrInputDocument.addField() API.
> 
> Thanks,
> 
> Steven
> 
> 
> On Wed, Sep 16, 2020 at 9:13 PM Steven White  wrote:
> 
>> Hi everyone,
>> 
>> I want to avoid creating a > source="OneFieldOfMany"/> in my schema (there will be over 1000 of them and
>> maybe more so managing it will be a pain).  Instead, I want to use SolrJ
>> API to do what  does.  Any example of how I can do this?  If
>> there is an example online, that would be great.
>> 
>> Thanks in advance.
>> 
>> Steven
>>

Re: join query limitations

2020-09-14 Thread Erick Erickson

What version of Solr are you using? ‘cause 8x has this definition for _version_


 

and I find no text like you’re seeing in any schema file in 8x….

So with a prior version, “try it and see”? See: 
https://issues.apache.org/jira/browse/SOLR-9449 and linked JIRAs,
the _version_ can be indexed=“false” since 6.3 at least if it’s 
docValues=“true". It’s not clear to me that it needed
to be indexed=“true” even before that, but no guarantees.

updateLog will be defined in solrconfig.xml, but unless you’re on a very old 
version of Solr it doesn’t matter 
‘cause you don’t need to have indexed=“true”. Updatelog is not necessary if 
you’re not running SolrCloud...

I strongly urge you to completely remove all your indexes (perhaps create a new 
collection) and re-index
from scratch if you change the definition. You might be able to get away with 
deleting all the docs then
re-indexing, but just re-indexing all the docs without starting fresh can have 
“interesting” results.

Best,
Erick

> On Sep 14, 2020, at 5:16 PM, matthew sporleder  wrote:
> 
> Yes but "the _version_ field is also a non-indexed, non-stored single
> valued docValues field;"  <- is that a problem?
> 
> My schema has this:
>  
>  
> 
> I don't know if I use the updateLog or not.  How can I find out?
> 
> I think that would work for me as I could just make a dynamic fild like:
>  stored="false" multiValued="false" required="false" docValues="true"
> />
> 
> ---
> Yes it is just for functions, sorting, and boosting
> 
> On Mon, Sep 14, 2020 at 4:51 PM Erick Erickson  
> wrote:
>> 
>> Have you seen “In-place updates”?
>> 
>> See:
>> https://lucene.apache.org/solr/guide/8_1/updating-parts-of-documents.html
>> 
>> Then use the field as part of a function query. Since it’s non-indexed, you
>> won’t be searching on it. That said, you can do a lot with function queries
>> to satisfy use-cases.
>> 
>> Best.
>> Erick
>> 
>>> On Sep 14, 2020, at 3:12 PM, matthew sporleder  wrote:
>>> 
>>> I have hit a bit of a cross-road with our usage of solr where I want
>>> to include some slightly dynamic data.
>>> 
>>> I want to ask solr to find things like "text query" but only if they
>>> meet some specific criteria.  When I have all of those criteria
>>> indexed, everything works great.  (text contains "apples", in_season=1
>>> ,sort by latest)
>>> 
>>> Now I would like to add a criteria which changes every day -
>>> popularity of a document, specifically.  This appeared to be *the*
>>> canonical use case for external field files but I have 50M documents
>>> (and growing) so a *text* file doesn't fit the bill.
>>> 
>>> I also looked at using a !join but the limitations of !join, as I
>>> understand them, appear to mean I can't use it for my use case? aka I
>>> can't actually use the data from my traffic-stats core to sort/filter
>>> "text contains" "apples", in_season=1, sort by most traffic, sort by
>>> latest
>>> 
>>> The last option appears to be updating all of my documents every
>>> single day, possibly using atomic/partial updates, but even those have
>>> a growing list of gotchas: losing stored=false documents is a big one,
>>> caveats I don't quite understand related to copyFields, changes to the
>>> _version_ field (the _version_ field is also a non-indexed, non-stored
>>> single valued docValues field;), etc
>>> 
>>> Where else can I look?  The last time we attempted something like this
>>> we ended up rebuilding the index from scratch each day and shuffling
>>> it out, which was really pretty nasty.
>>> 
>>> Thanks,
>>> Matt
>>

Re: join query limitations

2020-09-14 Thread Erick Erickson

Have you seen “In-place updates”?

See: 
https://lucene.apache.org/solr/guide/8_1/updating-parts-of-documents.html

Then use the field as part of a function query. Since it’s non-indexed, you
won’t be searching on it. That said, you can do a lot with function queries
to satisfy use-cases.

Best.
Erick

> On Sep 14, 2020, at 3:12 PM, matthew sporleder  wrote:
> 
> I have hit a bit of a cross-road with our usage of solr where I want
> to include some slightly dynamic data.
> 
> I want to ask solr to find things like "text query" but only if they
> meet some specific criteria.  When I have all of those criteria
> indexed, everything works great.  (text contains "apples", in_season=1
> ,sort by latest)
> 
> Now I would like to add a criteria which changes every day -
> popularity of a document, specifically.  This appeared to be *the*
> canonical use case for external field files but I have 50M documents
> (and growing) so a *text* file doesn't fit the bill.
> 
> I also looked at using a !join but the limitations of !join, as I
> understand them, appear to mean I can't use it for my use case? aka I
> can't actually use the data from my traffic-stats core to sort/filter
> "text contains" "apples", in_season=1, sort by most traffic, sort by
> latest
> 
> The last option appears to be updating all of my documents every
> single day, possibly using atomic/partial updates, but even those have
> a growing list of gotchas: losing stored=false documents is a big one,
> caveats I don't quite understand related to copyFields, changes to the
> _version_ field (the _version_ field is also a non-indexed, non-stored
> single valued docValues field;), etc
> 
> Where else can I look?  The last time we attempted something like this
> we ended up rebuilding the index from scratch each day and shuffling
> it out, which was really pretty nasty.
> 
> Thanks,
> Matt

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 9052 matches

Mail list logo