Question about autoAddReplicas

2017-03-30 Thread Tseng, Danny
Hi, I create a collection of 2 shards with 1 replication factor and enable autoAddReplicas. Then I kill shard2 with 'kill -9' . The overseer asked the other solr node to create a new solr core and point to the dataDir of shard2. Unfortunately, the new core failed to come up because of

Re: format data at source or format data during indexing?

2017-03-30 Thread Derek Poh
Hi Alex The business use case for the field is - exact match - singular-plural stemmingon each terms in the field Eg. search for "dvd cases" must match "dvd case"and "dvds case". This is the field type currently and It satisfy the business use case. The 1 drawback of this is I need to add those

Re: Maintaining variable values between transition states

2017-03-30 Thread Shawn Heisey
On 3/30/2017 12:34 PM, Shashank Pedamallu wrote: > I have some configuration variables that I need to hold in Solr as it > switches between transient states on a transient core. What is the > best way to do this? These variables can change value during a running > environment. So, I need to have

Re: Maintaining variable values between transition states

2017-03-30 Thread Shashank Pedamallu
Hi Erick, Thanks for your response. So, by the way you say it, I understand that there is no way to persist variables between transient states. I just looked at ReplicationHandler class and it has the api to enable or disable replication on a core which is stored as an AtomicBoolean (defaulted

RE: Indexing speed reduced significantly with OCR

2017-03-30 Thread Phil Scadden
Yes, that would seem an accurate assessment of the problem. -Original Message- From: Zheng Lin Edwin Yeo [mailto:edwinye...@gmail.com] Sent: Thursday, 30 March 2017 4:53 p.m. To: solr-user@lucene.apache.org Subject: Re: Indexing speed reduced significantly with OCR Thanks for your reply.

Re: Maintaining variable values between transition states

2017-03-30 Thread Erick Erickson
Short form: There's no easy to do that ATM. The whole synchronization process when working with transient cores (i.e. synchronizing on some of the internal structures is pretty hairy and would require you to fork a version of Solr to change. Much of this is being worked out in SOLR-8906 where you

Maintaining variable values between transition states

2017-03-30 Thread Shashank Pedamallu
Hi All, I have some configuration variables that I need to hold in Solr as it switches between transient states on a transient core. What is the best way to do this? These variables can change value during a running environment. So, I need to have read and write access to the persistent store.

Re: Avoiding Transient state during a long running background process

2017-03-30 Thread Erick Erickson
Right, you're artificially forcing this loading/unloading, which is a good thing to stress! Every time the code accesses a core, the last access time should be updated. So somehow either you're accessing more than two cores in a round-robin fashion or you're somehow having other requests come in

Re: slow indexing when keys are verious

2017-03-30 Thread moscovig
With hight entropy we see the same latency even when working with 1 shard. Assuming that even with 1 shard, Solr is still working hard to route the documents, what is the component that is responsible for the document routing? Is it the zookeeper? And how would you verify that that's the

Re: Avoiding Transient state during a long running background process

2017-03-30 Thread Shashank Pedamallu
Hi, Thanks Erik and Shawn for so much info! 1) Yes, I have deliberately put transientCacheSize as very small(2) in my dev laptop for testing how Solr handles the switch and how it affects my backup process. 2) Yes, I'm not using Solr Cloud. Each individual Solr core is independent and we are

Re: slow indexing when keys are verious

2017-03-30 Thread Alexandre Rafalovitch
Did you check the number of documents that end up on each shard in these two scenarios. My guess would be that - perhaps - low entropy key puts most of the documents into one shard and high-entropy key causes a lot more routing traffic with delay coming from the network communication and/or

Re: slow indexing when keys are verious

2017-03-30 Thread moscovig
Hi Yes it is solrCloud, we saw the same behavior with 1,2 and 4 shards. each shard has 3 replicas. Each bulk contains 300 docs. We get approximately 800 docs inserted in a second. ~6000 docs are being sent in an iteration by all loading threads. we have 20 threads, each sending bulks of 300

Re: SOLR scalability porblem

2017-03-30 Thread Erick Erickson
I'm inferring that at the end of the day, all your docs fit in a single index, correct? SolrCloud won't be a magic bullet, and I'd strongly advise if you _do_ go to SolrCloud to use SlolrJ or similar to feed docs as DIH runs on a single server. However, all that aside if I can restate your

Re: Avoiding Transient state during a long running background process

2017-03-30 Thread Erick Erickson
bq: I thought that LotsOfCores didn't coexist with Cloud very well. It doesn't, you're right, I got off on a tangent there. The OP mentioned "Cloud" and my brain cross-wired. On Thu, Mar 30, 2017 at 6:32 AM, Shawn Heisey wrote: > On 3/29/2017 8:09 PM, Erick Erickson wrote:

Re: slow indexing when keys are verious

2017-03-30 Thread Alexandre Rafalovitch
Are you by any chance in the SolrCloud? And to confirm, the total number of documents is the same within any particular time period? Regards, Alex. http://www.solr-start.com/ - Resources for Solr users, new and experienced On 30 March 2017 at 10:50, moscovig wrote:

Re: Indexing speed reduced significantly with OCR

2017-03-30 Thread Walter Underwood
As I said before, this is a great application for pay-as-needed cloud servers. Netflix’s first use of Amazon EC2 was encoding movies for different screen sizes, data rates, codecs, and DRM. They would fire up a hundred or a thousand instances, feed movies to them, pick up the encodes, then

Re: Is there a way to retrieve the a term's position/offset in Solr

2017-03-30 Thread Bjarke Buur Mortensen
OK, that complicates things a bit. I would still try to go for a solution where you store the rich text in Solr, but make sure you tokenize it correctly. If the format is relatively simple, you could use either a regexp pattern tokenizer

Re: slow indexing when keys are verious

2017-03-30 Thread moscovig
Thanks Shawn. We do specify 3 30 false but I guess that still, the commitWithin 300 ms is a bad idea. We will definitely try playing with the configs you suggested. I still don't get the reason for a fast inserting when

Re: slow indexing when keys are verious

2017-03-30 Thread Shawn Heisey
On 3/30/2017 7:36 AM, moscovig wrote: > We are using solr 6.2.1 for server and solrj 6.2.0, with no explicit commits, > and - > > 3 > 30 > for autoCommit. > > Each request to solr contains 300 small documents with different keys., with > a commitWithin of 300 ms. I think the

slow indexing when keys are verious

2017-03-30 Thread moscovig
Hi We are using solr 6.2.1 for server and solrj 6.2.0, with no explicit commits, and - 3 30 for autoCommit. Each request to solr contains 300 small documents with different keys., with a commitWithin of 300 ms. We have lots of requests coming in. The behavior is as the following:

Re: Avoiding Transient state during a long running background process

2017-03-30 Thread Shawn Heisey
On 3/29/2017 8:09 PM, Erick Erickson wrote: > bq: My guess is that it is decided by the load time, because this is > the option that would have the best performance. > > Not at all. The theory here is that this is to support the pattern > where some transient cores are used all the time and some

SOLR scalability porblem

2017-03-30 Thread santosh sidnal
Hi All, I have a problem with scalability on my project. we are running almost close of 100 cores which are having documents of ~25000 each and the total size of the index files being 7.5 GB. Also, we have the staging server where we build index files using data importer and using replication

RE: Indexing speed reduced significantly with OCR

2017-03-30 Thread Allison, Timothy B.
> Note that the OCRing is a separate task from Solr indexing, and is best done > on separate machines. +1 -Original Message- From: Rick Leir [mailto:rl...@leirtech.com] Sent: Thursday, March 30, 2017 7:37 AM To: solr-user@lucene.apache.org Subject: Re: Indexing speed reduced

Re: Indexing speed reduced significantly with OCR

2017-03-30 Thread Rick Leir
The workflow is -/ OCR new documents -/ check quality and tune until you get good output text -/ keep the output text in the file system -/ index and re-index to Solr as necessary from the file system Note that the OCRing is a separate task from Solr indexing, and is best done on separate

Re: format data at source or format data during indexing?

2017-03-30 Thread Alexandre Rafalovitch
What's you actual business use case? On 30 Mar 2017 1:53 AM, "Derek Poh" wrote: > Hi Erick > > So I could also not use the query analyzer stage to append the code to the > search keyword? > Have the front-end application append the code for every query it issue >

Re: Is there a way to retrieve the a term's position/offset in Solr

2017-03-30 Thread Rick Leir
Hi forest Do you have a html to richtext converter? You could use it on the highlighter's output. Otherwise you could count characters in the html. That might only be useful if your richtext font is fixed width. Cheers -- Rick On March 30, 2017 4:39:39 AM EDT, forest_soup

Re: Is there a way to retrieve the a term's position/offset in Solr

2017-03-30 Thread forest_soup
Unfortunately the rich text is not an html/xml/doc/pdf or any other popular rich text format. And we would like to show the highlighted text in the doc's own specific viewer. That's why I'm eagerly want the offset. The /tvrh(term vector component) and tv.offsets/tv.positions can give us such

Re: Is there a way to retrieve the a term's position/offset in Solr

2017-03-30 Thread Bjarke Buur Mortensen
OK, so the next thing to do would be to index and store the rich text ... is it HTML? Because then you can use HTMLStripCharFilterFactory in your analyzer, and still get the correct highlight back with hl.fragsize=0. I would think that you will have a hard time using the term positions, if what

Re: format data at source or format data during indexing?

2017-03-30 Thread Derek Poh
Hi Alex Thank you for pointing out theUpdateRequestProcessor option. On 3/30/2017 11:43 AM, Alexandre Rafalovitch wrote: I am not sure I can tell how to decide on one or another. However, I wanted to mention that you also have an option of doing in in the UpdateRequestProcessor chain. That's