Re: locks and high CPU

2015-10-22 Thread Rallavagu Kon
Erick, Indexing happening via Solr cloud server. This thread was from the leader. Some followers show symptom of high cpu during this time. You think this is from locking? What is the thread that is holding the lock doing? Also, we are unable to reproduce this issue in load test environment.

Re: Highlighting content field problem when using JiebaTokenizerFactory

2015-10-22 Thread Scott Chu
Hi solr-user, Can't judge the cause on fast glimpse of your definition but some suggestions I can give: 1. I take a look at Jieba. It uses a dictionary and it seems to do a good job on CJK. I doubt this problem may be from those filters (note: I can understand you may use CJKWidthFilter to

Is it possible to specigfy only one-character term synonym for 2-gram tokenizer?

2015-10-22 Thread Scott Chu
Hi solr-user, I always uses CJKTokenizer on appropriate amount of Chinese news articles. Say in Chinese, character C1 has same meaning as character C2 (e.g 台=臺), Is it possible that I only add this line in synonym.txt: C1,C2 (and in true exmaple: 台, 臺) and by applying CJKTokenizer and

Re: Solr Pagination

2015-10-22 Thread Toke Eskildsen
On Wed, 2015-10-14 at 10:17 +0200, Jan Høydahl wrote: > I have not benchmarked various number of segments at different sizes > on different HW etc, so my hunch could very well be wrong for Salman’s case. > I don’t know how frequent updates there is to his data either. > > Have you done #segments

Re: `cat /dev/null > solr-8983-console.log` frees host's memory

2015-10-22 Thread Shalin Shekhar Mangar
Hi Tim, Should we remove the console appender by default? This is very trappy I guess. On Tue, Oct 20, 2015 at 11:39 PM, Timothy Potter wrote: > You should fix your log4j.properties file to no log to console ... > it's there for the initial getting started experience, but

Highlighting queries in parentheses

2015-10-22 Thread Michał Słomkowski
Hello, recently I've deployed Solr 5.2.1 and I've observed the following issue: My documents have two fields: id and text. Solr is configured to use FastVectorHighlighter (I've tried StandardHighlighter too, no difference). I've created the schema.xml, solrconfig.xml hasn't been changed in

Re: locks and high CPU

2015-10-22 Thread Erick Erickson
The details are in Tim's blog post and the linked JIRAs Unfortunately, the only real solution I know of is to upgrade to at least Solr 5.2. Meanwhile, throttling the indexing rate will at least smooth out the issue. Not a great approach but all there is for 4.6. Best, Erick On Thu, Oct 22, 2015

Re: getting cached terms inside UpdateRequestProcessor...

2015-10-22 Thread Erik Hatcher
Yes, it works (now, not sure when though). I just adjusted the TestContentStreamDataSource test case, see patch pasted below, that passes. Note that the solrconfig file has a mistake in that the attribute ‘key’ isn’t correct - should be ‘name’ (this was tested on trunk via IntelliJ, just FYI

Re: getting cached terms inside UpdateRequestProcessor...

2015-10-22 Thread Alexandre Rafalovitch
You need to tell the second call which documents to update. Are you doing that? There may also be a wrinkle in the URP order, but let's get the first step working first. On 22 Oct 2015 12:59 pm, "Roxana Danger" wrote: > yes, it's working now... but I can not use

Re: Wildcard "?" ?

2015-10-22 Thread Bruno Mannina
Upayavira, Thanks a lot for these information Regards, Bruno Le 21/10/2015 19:24, Upayavira a écrit : regexp will match the whole term. So, if you have stemming on, magnetic may well stem to magnet, and that is the term against which the regexp is executed. If you want to do the regexp

Re: getting cached terms inside UpdateRequestProcessor...

2015-10-22 Thread Mikhail Khludnev
Hello Roxana, I feel it's almost impossible. I can only suggest to commit to make new terms visible. There is SolrCore.getRealtimeSearcher() but I never understand what it does. On Thu, Oct 22, 2015 at 1:20 PM, Roxana Danger < roxana.dan...@reedonline.co.uk> wrote: > Hello, > > I would like to

Re: How to get the join data by multiple cores?

2015-10-22 Thread Mikhail Khludnev
thread hijack: Erick, wdyt about writing query-time analog of [child] https://cwiki.apache.org/confluence/display/solr/Transforming+Result+Documents ? On Thu, Oct 22, 2015 at 6:32 PM, Erick Erickson wrote: > You will NOT get the stored fields from the child record >

Re: How to get the join data by multiple cores?

2015-10-22 Thread Erick Erickson
Mikhail: Brilliant! Assuming we can get the "from" and "to" parameters out of the query and, perhaps, the fromIndex (for cross-core) then it _should_ just be a matter of fetching the from doc and adding the fields. And since it's only operating on the returned documents it also shouldn't be very

Re: locks and high CPU

2015-10-22 Thread Rallavagu
Thanks Erick. Currently, migrating to 5.3 and it is taking a bit of time. Meanwhile, I looked at the JIRAs from the blog and the stack trace looks a bit different from what I see but not sure if they are related. Also, as per the stack trace I have included in my original email, it is the

Re: Is it possible to specigfy only one-character term synonymfor2-gram tokenizer?

2015-10-22 Thread Scott Chu
Hi Emir, Very weirdly. I've reply to your email at home many times yesterday but they never show up in the solr-user email list again. Don't know why. So I reply this again at office. Hope this will show up. Thanks to your explanation. I'll see PatternReplaceCharFilter as a workaround (As I

Re: Highlighting content field problem when using JiebaTokenizerFactory

2015-10-22 Thread Scott Chu
Hi Edwin, Since you've tested all my suggestions and the problem is still there, I can't think of anything wrong with your configuration. Now I can only suspect two things: 1. You said the problem only happens on "contents" field, so maybe there're something wrong with the contents of that

Unable to extract images content (OCR) from PDF files using Solr

2015-10-22 Thread Damien Picard
Hi, I'm using Solr 5.3.0 on a Red Hat EL 7 and I try to extract content from PDF, Word, LibreOffice, etc. docs using the ExtractingRequestHandler. Everything works fine, except when I want to extract content from embedding images in PDF/Word etc. documents : I send an extract request like this

getting cached terms inside UpdateRequestProcessor...

2015-10-22 Thread Roxana Danger
Hello, I would like to create an updateRequestProcessorChain that should to be executed after a DB DIH. I am extending UpdateRequestProcessorFactory and the UpdateRequestProcessor classes. The method processAdd of my UpdateRequestProcessor should be able to update the documents with the indexed

Re: Is it possible to specigfy only one-character term synonym for2-gram tokenizer?

2015-10-22 Thread Emir Arnautovic
Hi Scott, Using PatternReplaceCharFilter is not same as replacing raw data (replacing raw data is not proper solution as it does not solve issue when searching with "other" character). This is part of token standardization, no different than lower casing - it is standard approach as well when

Re: Is it possible to specigfy only one-character term synonym for 2-gram tokenizer?

2015-10-22 Thread Emir Arnautovic
Hi Scott, I don't have experience with Chinese, but SynonymFilter works on tokens, so if CJKTokenizer recognizes C1 and Cm as tokens, it should work. If not, than you can try configuring PatternReplaceCharFilter to replace C1 to C2 during indexing and searching and get a match. Thanks, Emir

Re: Zookeeper Quorum leader election

2015-10-22 Thread Arcadius Ahouansou
The leader election issue we were having was solved by passing -Djava.net.preferIPv4Stack=true to zookeeper startup script It seems our Linux servers have IPv6 enabled but we have no IPv6 network. Hope this helps others. Arcadius. On 4 September 2015 at 04:57, Arcadius Ahouansou

Re: Is it possible to specigfy only one-character term synonym for2-gram tokenizer?

2015-10-22 Thread Scott Chu
Hi solr-user, Ya, I thought about replacing C1 with C2 in the underground raw data. However, it's a huge data set (over 10M news articles) so I give up this strategy eariler. My current temporary solution is going back to use 1-gram tokenizer ((i.e.StandardTokenizer) so I can only set 1 rule.

Re: getting cached terms inside UpdateRequestProcessor...

2015-10-22 Thread Erik Hatcher
Roxana - What is the purpose of doing this? (that’ll help guide the best approach) It can be quite handy to get the terms from analysis into a field as stored values and to separate terms into separate fields and such. Here’s a presentation where I detailed an update script trick that

Select sibling data via XPathEntityProcessor

2015-10-22 Thread Routley, Alan
Hi, Given an xml structure: Subject 032-001946363 Subject 037-001946370 Author 040-001959713 Author 040-001959829

Some SolR nodes are hanging, bringing down the entire cluster

2015-10-22 Thread Stephane Lagraulet
Hello all, We experienced a two major problems in two days on one of our data centers. Here is our setup: 15 nodes, 3 shards, one replica per node, around 50Gb of index per shard. We are running Solr 4.10.4 on Ubuntu servers using jdk 1.8.0u51. We have an ensemble of 5 zookeeper nodes to

Re: Select sibling data via XPathEntityProcessor

2015-10-22 Thread Alexandre Rafalovitch
I don't think DIH supports siblings. Have you thought of using XSLT processor before sending XML to Solr. Or using it instead of DIH during the update (not a well know part of Solr):

Re: `cat /dev/null > solr-8983-console.log` frees host's memory

2015-10-22 Thread Shawn Heisey
On 10/22/2015 12:24 AM, Shalin Shekhar Mangar wrote: > Should we remove the console appender by default? This is very trappy I guess. The only time we should need console logging is when Solr is run in the foreground, and in that case, it should not be saved to a file, just printed on the

Re: Highlighting content field problem when using JiebaTokenizerFactory

2015-10-22 Thread Zheng Lin Edwin Yeo
Hi Scott, Thank you for your response and suggestions. With respond to your questions, here are the answers: 1. I take a look at Jieba. It uses a dictionary and it seems to do a good job on CJK. I doubt this problem may be from those filters (note: I can understand you may use CJKWidthFilter to

Split shard onto new physical volumes

2015-10-22 Thread Nikolay Shuyskiy
Hello. We have a Solr 5.3.0 installation with ~4 TB index size, and the volume containing it is almost full. I hoped to utilize SolrCloud power to split index into two shards or Solr nodes, thus spreading index across several physical devices. But as I look closer, it turns out that

Re: getting cached terms inside UpdateRequestProcessor...

2015-10-22 Thread Roxana Danger
Hi Alex, My idea behind this is avoid two calls: first, the importer and after the updater. As there is an update processor chain than can be used after the DIH, I thorough it was possible to get a real-time updater. So, I am getting your advice and dividing the process in different steps. I

Re: Split shard onto new physical volumes

2015-10-22 Thread Shawn Heisey
On 10/22/2015 8:29 AM, Nikolay Shuyskiy wrote: > I imagined that I could, say, add two new nodes to SolrCloud, and split > shard so that two new shards ("halves" of the one being split) will be > created on those new nodes. > > Right now the only way to split shard in my situation I see is to

Re: Highlighting content field problem when using JiebaTokenizerFactory

2015-10-22 Thread Zheng Lin Edwin Yeo
Hi Scott, Thank you for your respond. 1. You said the problem only happens on "contents" field, so maybe there're something wrong with the contents of that field. Doe it contain any special thing in them, e.g. HTML tags or symbols. I recall SOLR-42 mentions something about HTML stripping will

EdgeNGramFilterFactory for Chinese characters

2015-10-22 Thread Zheng Lin Edwin Yeo
Hi, Would like to check, is it good to use EdgeNGramFilterFactory for indexes that contains Chinese characters? Will it affect the accuracy of the search for Chinese words? I have rich-text documents that are in both English and Chinese, and currently I have EdgeNGramFilterFactory enabled during

Re: Solr fails to start with log file not found error

2015-10-22 Thread Shawn Heisey
On 10/22/2015 10:49 PM, awhosit wrote: Not working one is solr 5.2.1/SLES 12. > But I have working one with solr 5.2.1/SLES 11 and solr 5.2.1/Ubuntu 14. > > From the log left in sol-8983-console.log is as follow. > I'm using OpenJDK 1.7 as follow. > > java version "1.7.0_85" > OpenJDK Runtime

Get this committed

2015-10-22 Thread William Bell
I can confirm this is working in PROD at 100M hits a day. Can we commit it please? Begging here. https://issues.apache.org/jira/browse/SOLR-7993 -- Bill Bell billnb...@gmail.com cell 720-256-8076

Re: Solr fails to start with log file not found error

2015-10-22 Thread awhosit
Hi, I'm newbie on solr, but have same issue. More precisely, only one machine can't start solr with the message, "cannot open {solr.log} file for reading: No such file or directory." Obviously there is no file and even I created empty one, it doesn't help. I've tried also - moving around the

Re: getting cached terms inside UpdateRequestProcessor...

2015-10-22 Thread Roxana Danger
Dear Mikhail, Thank you very much for your advice. I have tried, but the realTimeSearcher didn't help... This may looks very silly but: can a commit be called with RunUpdateProcessorFactory? Can I use it twice in a updateRequestProcessorChain? Thank you very much again, Roana On 22 October 2015

Re: getting cached terms inside UpdateRequestProcessor...

2015-10-22 Thread Alexandre Rafalovitch
You are doing things out of order. It's DIH, URP, then indexer. Any attempt to subvert that order for the record being indexed will end in problems. Have you considered doing a dual path? Index, then update. Of course, your fields all need to be stored for that. Also, perhaps you need to rethink

Re: Index Multiple entity in one collection core

2015-10-22 Thread Alexandre Rafalovitch
When you run a full-import, Solr will try to delete old documents before importing the new ones. If there is several top-level entities, they step on each other foot. Use preImportDeleteQuery to avoid that (as per

Re: getting cached terms inside UpdateRequestProcessor...

2015-10-22 Thread Roxana Danger
Hi Erik, Thanks for the links, but the analyzers are called correctly. The problem is that I need to get access to the whole set of terms through a searcher, but the request searcher cannot retrieve any terms because the commit method has not been called already. My idea behind this is avoid two

Re: Zookeeper Quorum leader election

2015-10-22 Thread Erick Erickson
Thanks for adding that to our collective knowledge store! On Thu, Oct 22, 2015 at 2:44 AM, Arcadius Ahouansou wrote: > The leader election issue we were having was solved by passing > > -Djava.net.preferIPv4Stack=true > > to zookeeper startup script > > It seems our Linux

Re: How to get the join data by multiple cores?

2015-10-22 Thread Erick Erickson
You will NOT get the stored fields from the child record with the join operation, it's called "pseudo join" for a good reason. It's usually a mistake to try to force Solr to performa just like a database. I would seriously consider flattening (denormalizing) the data if at all possible. Best,

Re: getting cached terms inside UpdateRequestProcessor...

2015-10-22 Thread Jack Krupansky
It is still not clear what problem you are really trying to solve. This is what we call an XY problem - you are focusing on your intended solution but not describing the original, underlying problem, the application itself. IOW, there may be a much more appropriate solution for us to suggest if

OOM on solr cloud 5.2.1, does not trigger oom_solr.sh

2015-10-22 Thread Raja Pothuganti
Hi, Some times I see OOM happening on replicas,but does not trigger script oom_solr.sh which was passed in as -XX:OnOutOfMemoryError=/actualLocation/solr/bin/oom_solr.sh 8091. These OOM happened while DIH importing data from database. Is this known issue? is there any quick fix? Here are stack

Re: getting cached terms inside UpdateRequestProcessor...

2015-10-22 Thread Alexandre Rafalovitch
Well, I guess I imagined three steps: 1) Run DIH 2) Get the tokenized representation for each document using facets or other approaches 3) Submit document partial-update request with additional custom processing through URP Your example seems to be skipping step 2, so the URP chain does not know

Highlighting queries in parentheses

2015-10-22 Thread Michał Słomkowski
Hello, recently I've deployed Solr 5.2.1 and I've observed the following issue: My documents have two fields: id and text. Solr is configured to use FastVectorHighlighter (I've tried StandardHighlighter too, no difference). I've created the schema.xml, solrconfig.xml hasn't been changed in

Re: getting cached terms inside UpdateRequestProcessor...

2015-10-22 Thread Erik Hatcher
Setting “update.chain” in the DataImportHandler handler defined in solrconfig.xml should allow you to specify which update chain is used. Can you confirm that works, Shawn? This is from DataImportHandler.java: UpdateRequestProcessorChain processorChain =

Re: getting cached terms inside UpdateRequestProcessor...

2015-10-22 Thread Shawn Heisey
On 10/22/2015 10:32 AM, Erik Hatcher wrote: > Setting “update.chain” in the DataImportHandler handler defined in > solrconfig.xml should allow you to specify which update chain is used. Can > you confirm that works, Shawn? I tried this a couple of years ago without luck. Does it work now?

Solrcloud (4.10) reports the end of soft commit before all shard replicas finished committing

2015-10-22 Thread vsolakhian
We have a strange behavior of our Sorlcloud related code after upgrading from from Solr 4.4 to Solr 4.10 (as part of upgrading from Cloudera CDH 4.6 to Cloudera CDH 5.4.5). We have a Solrcloud collection with three replicas of one shard. Our code does batch indexing, then submits a soft commit

locks and high CPU

2015-10-22 Thread Rallavagu
Solr 4.6.1 cloud Looking into thread dump 4-5 threads causing cpu to go very high and causing issues. These are tomcat's http threads and are locking. Can anybody help me understand what is going on here? I see that incoming connections coming in for updates and they are being passed on to

Re: getting cached terms inside UpdateRequestProcessor...

2015-10-22 Thread Roxana Danger
Hi Alexandre, The DIH is executed correctly and the tokenized representation is obtained correctly, but the URP chain is not executed with the call: http://localhost:8983/solr/reed_jobs/update/details?commit=true Isn't it the correct URL? is there any parameter missing? Best, Roxana On 22

Re: getting cached terms inside UpdateRequestProcessor...

2015-10-22 Thread Shawn Heisey
On 10/22/2015 10:09 AM, Roxana Danger wrote: > The DIH is executed correctly and the tokenized representation is obtained > correctly, but the URP chain is not executed with the call: > http://localhost:8983/solr/reed_jobs/update/details?commit=true > Isn't it the correct URL? is there any

Re: getting cached terms inside UpdateRequestProcessor...

2015-10-22 Thread Roxana Danger
Yes, it arrives there... On 22 October 2015 at 17:32, Erik Hatcher wrote: > Setting “update.chain” in the DataImportHandler handler defined in > solrconfig.xml should allow you to specify which update chain is used. Can > you confirm that works, Shawn? > > This is from

Re: locks and high CPU

2015-10-22 Thread Erick Erickson
Prior to Solr 5.2, there were several inefficiencies when distributing updates to replicas, see: https://lucidworks.com/blog/2015/06/10/indexing-performance-solr-5-2-now-twice-fast/. The symptom was that there was significantly higher CPU utilization on the followers compared to the leader. The

Re: Solr Full text search

2015-10-22 Thread Shawn Heisey
On 10/22/2015 10:37 AM, vitaly bulgakov wrote: > But it returns no results when the query has a term which is not in a > document. > Say searching for "building constructor" I get a result, but > searching for "good building constructor" returns no results because there > are no documents

Re: getting cached terms inside UpdateRequestProcessor...

2015-10-22 Thread Roxana Danger
yes, it's working now... but I can not use the updateprocessor chain. I need to use first the DIH and then use UPR, but I am not having luck in updating my docs with the URL: http://localhost:8983/solr/reed_jobs/update/jtdetails?commit=true Do you manage to use an updateProcessor chain after use