Re: locks and high CPU

2015-10-22 Thread Rallavagu Kon
Erick,

Indexing happening via Solr cloud server. This thread was from the leader. Some 
followers show symptom of high cpu during this time. You think this is from 
locking? What is the thread that is holding the lock doing? Also, we are unable 
to reproduce this issue in load test environment. Any clues would help.

> On Oct 22, 2015, at 09:50, Erick Erickson  wrote:
> 
> Prior to Solr 5.2, there were several inefficiencies when distributing
> updates to replicas, see:
> https://lucidworks.com/blog/2015/06/10/indexing-performance-solr-5-2-now-twice-fast/.
> 
> The symptom was that there was significantly higher CPU utilization on
> the followers
> compared to the leader.
> 
> The only real fix is to upgrade to 5.2+ assuming that's your issue.
> 
> How are you indexing? Using SolrJ with CloudSolrServer would help if
> you're not using
> them.
> 
> Best,
> Erick
> 
>> On Thu, Oct 22, 2015 at 9:43 AM, Rallavagu  wrote:
>> Solr 4.6.1 cloud
>> 
>> Looking into thread dump 4-5 threads causing cpu to go very high and causing
>> issues. These are tomcat's http threads and are locking. Can anybody help me
>> understand what is going on here? I see that incoming connections coming in
>> for updates and they are being passed on to StreamingSolrServer and
>> subsequently ConcurrentUpdateSolrServer and they both have locks. Thanks.
>> 
>> 
>> "http-bio-8080-exec-4394" id=8774 idx=0x988 tid=14548 prio=5 alive,
>> native_blocked, daemon
>>at __lll_lock_wait+34(:0)@0x38caa0e262
>>at safepointSyncOnPollAccess+167(safepoint.c:83)@0x7fc29b9c9138
>>at trapiNormalHandler+484(traps_posix.c:220)@0x7fc29b9fd745
>>at _L_unlock_16+44(:0)@0x38caa0f710
>>at
>> java/util/concurrent/locks/ReentrantLock.lock(ReentrantLock.java:262)[optimized]
>>at
>> org/apache/solr/client/solrj/impl/ConcurrentUpdateSolrServer.blockUntilFinished(ConcurrentUpdateSolrServer.java:391)[inlined]
>>at
>> org/apache/solr/update/StreamingSolrServers.blockUntilFinished(StreamingSolrServers.java:98)[inlined]
>>at
>> org/apache/solr/update/SolrCmdDistributor.finish(SolrCmdDistributor.java:61)[inlined]
>>at
>> org/apache/solr/update/processor/DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:501)[inlined]
>>at
>> org/apache/solr/update/processor/DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1278)[optimized]
>>^-- Holding lock:
>> org/apache/solr/update/StreamingSolrServers$1@0x496cf6e50[biased lock]
>>^-- Holding lock:
>> org/apache/solr/update/StreamingSolrServers@0x49d32adc8[biased lock]
>>at
>> org/apache/solr/handler/ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83)[optimized]
>>at
>> org/apache/solr/handler/RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)[optimized]
>>at org/apache/solr/core/SolrCore.execute(SolrCore.java:1859)[optimized]
>>at
>> org/apache/solr/servlet/SolrDispatchFilter.execute(SolrDispatchFilter.java:721)[inlined]
>>at
>> org/apache/solr/servlet/SolrDispatchFilter.doFilter(SolrDispatchFilter.java:417)[inlined]
>>at
>> org/apache/solr/servlet/SolrDispatchFilter.doFilter(SolrDispatchFilter.java:201)[optimized]
>>at
>> org/apache/catalina/core/ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)[inlined]
>>at
>> org/apache/catalina/core/ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)[optimized]
>>at
>> org/apache/catalina/core/StandardWrapperValve.invoke(StandardWrapperValve.java:222)[optimized]
>>at
>> org/apache/catalina/core/StandardContextValve.invoke(StandardContextValve.java:123)[optimized]
>>at
>> org/apache/catalina/core/StandardHostValve.invoke(StandardHostValve.java:171)[optimized]
>>at
>> org/apache/catalina/valves/ErrorReportValve.invoke(ErrorReportValve.java:99)[optimized]
>>at
>> org/apache/catalina/valves/AccessLogValve.invoke(AccessLogValve.java:953)[optimized]
>>at
>> org/apache/catalina/core/StandardEngineValve.invoke(StandardEngineValve.java:118)[optimized]
>>at
>> org/apache/catalina/connector/CoyoteAdapter.service(CoyoteAdapter.java:408)[optimized]
>>at
>> org/apache/coyote/http11/AbstractHttp11Processor.process(AbstractHttp11Processor.java:1023)[optimized]
>>at
>> org/apache/coyote/AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:589)[optimized]
>>at
>> org/apache/tomcat/util/net/JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:310)[optimized]
>>^-- Holding lock:
>> org/apache/tomcat/util/net/SocketWrapper@0x496e58810[thin lock]
>>at
>> java/util/concurrent/ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)[inlined]
>>at
>> java/util/concurrent/ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)[optimized]
>>at java/lang/Thread.run(Thread.java:682)[optimized]
>>at jrockit/vm/RNI.c2java(J)V(Native Method)


Re: Highlighting content field problem when using JiebaTokenizerFactory

2015-10-22 Thread Scott Chu
Hi solr-user,

Can't judge the cause on fast glimpse of your definition but some suggestions I 
can give:

1. I take a look at Jieba. It uses a dictionary and it seems to do a good job 
on CJK. I doubt this problem may be from those filters (note: I can understand 
you may use CJKWidthFilter to convert Japanese but doesn't understand why you 
use CJKBigramFilter and EdgeNGramFilter). Have you tried commenting out those 
filters, say leave only Jieba and StopFilter, and see if this problem disppears?

2.Does this problem occur only on Chinese search words? Does it happen on 
English search words?

3.To use FastVectorHighlighter, you seem to have to enable 3 term* parameters 
in field declaration? I see only one is enabled. Please refer to the answer in 
this stackoverflow question: 
http://stackoverflow.com/questions/25930180/solr-how-to-highlight-the-whole-search-phrase-only


Scott Chu,scott@udngroup.com
2015/10/22 
- Original Message - 
From: Zheng Lin Edwin Yeo 
To: solr-user 
Date: 2015-10-20, 12:04:11
Subject: Re: Highlighting content field problem when using JiebaTokenizerFactory


Hi Scott,

Here's my schema.xml for content and title, which uses text_chinese. The
problem only occurs in content, and not in title.


   


  
 






 
 





  
   


Here's my solrconfig.xml on the highlighting portion:

  
  
   explicit
   10
   json
   true
  text
  id, title, content_type, last_modified, url, score 

  on
   id, title, content, author, tag
  true
   true
   html
  200
true
signature
true
100
  
  


 
WORD
en
SG
 



Meanwhile, I'll take a look at the articles too.

Thank you.

Regards,
Edwin


On 20 October 2015 at 11:32, Scott Chu  wrote:

> Hi Edwin,
>
> I didn't use Jieba on Chinese (I use only CJK, very foundamental, I
> know) so I didn't experience this problem.
>
> I'd suggest you post your schema.xml so we can see how you define your
> content field and the field type it uses?
>
> In the mean time, refer to these articles, maybe the answer or workaround
> can be deducted from them.
>
> https://issues.apache.org/jira/browse/SOLR-3390
>
> http://qnalist.com/questions/661133/solr-is-highlighting-wrong-words
>
> http://qnalist.com/questions/667066/highlighting-marks-wrong-words
>
> Good luck!
>
>
>
>
> Scott Chu,scott@udngroup.com
> 2015/10/20
>
> - Original Message -
> *From: *Zheng Lin Edwin Yeo 
> *To: *solr-user 
> *Date: *2015-10-13, 17:04:29
> *Subject: *Highlighting content field problem when using
> JiebaTokenizerFactory
>
> Hi,
>
> I'm trying to use the JiebaTokenizerFactory to index Chinese characters in
>
> Solr. It works fine with the segmentation when I'm using
> the Analysis function on the Solr Admin UI.
>
> However, when I tried to do the highlighting in Solr, it is not
> highlighting in the correct place. For example, when I search of 自然環境与企業本身,
> it highlight 認為自然環境与企業本身的
>
> Even when I search for English character like responsibility, it highlight
>  *responsibilit*y.
>
> Basically, the highlighting goes off by 1 character/space consistently.
>
> This problem only happens in content field, and not in any other fields.

> Does anyone knows what could be causing the issue?
>
> I'm using jieba-analysis-1.0.0, Solr 5.3.0 and Lucene 5.3.0.
>
>
> Regards,
> Edwin
>
>
>
> -
> 未在此訊息中找到病毒。
> 已透過 AVG 檢查 - www.avg.com
> 版本: 2015.0.6140 / 病毒庫: 4447/10808 - 發佈日期: 10/12/15
>
>



-
未在此訊息中找到病毒。
已透過 AVG 檢查 - www.avg.com
版本: 2015.0.6172 / 病毒庫: 4447/10853 - 發佈日期: 10/19/15


Is it possible to specigfy only one-character term synonym for 2-gram tokenizer?

2015-10-22 Thread Scott Chu
Hi solr-user,

I always uses CJKTokenizer on appropriate amount of Chinese news articles. Say 
in Chinese, character C1 has same meaning as character C2 (e.g 台=臺), Is it 
possible that I only add this line in synonym.txt:

C1,C2 (and in true exmaple: 台, 臺)

and by applying CJKTokenizer and SynonymFilter, I only have to query "C1Cm..."  
(say Cm is arbitrary Chinese character) and Solr will return documents that 
matche whether "C1Cm" or "C2Cm"?

Scott Chu,scott@udngroup.com
2015/10/22 


Re: Solr Pagination

2015-10-22 Thread Toke Eskildsen
On Wed, 2015-10-14 at 10:17 +0200, Jan Høydahl wrote:
> I have not benchmarked various number of segments at different sizes
> on different HW etc, so my hunch could very well be wrong for Salman’s case.
> I don’t know how frequent updates there is to his data either.
> 
> Have you done #segments benchmarking for your huge datasets?

Only informally. However, the guys at UKWA run a similar scale index and
have done multiple segment-count-oriented tests. They have not published
a report, but there are measurements & graphs at
https://github.com/ukwa/shine/tree/master/python/test-logs

- Toke Eskildsen, State and University Library, Denmark




Re: `cat /dev/null > solr-8983-console.log` frees host's memory

2015-10-22 Thread Shalin Shekhar Mangar
Hi Tim,

Should we remove the console appender by default? This is very trappy I guess.

On Tue, Oct 20, 2015 at 11:39 PM, Timothy Potter  wrote:
> You should fix your log4j.properties file to no log to console ...
> it's there for the initial getting started experience, but you don't
> need to send log messages to 2 places.
>
> On Tue, Oct 20, 2015 at 10:42 AM, Shawn Heisey  wrote:
>> On 10/20/2015 9:19 AM, Eric Torti wrote:
>>> I had a 52GB solr-8983-console.log on my Solr 5.2.1 Amazon Linux
>>> 64-bit box and decided to `cat /dev/null > solr-8983-console.log` to
>>> free space.
>>>
>>> The weird thing is that when I checked Sematext I noticed the OS had
>>> freed a lot of memory at the same exact instant I did that.
>>
>> On that memory graph, the legend doesn't indicate which of the graph
>> colors represent each of the four usage types at the top -- they all
>> have blue checkboxes, so I can't tell for sure what changed.
>>
>> If the number that dropped is "cached" (which I think is likely) then
>> everything is working exactly as it should.  The OS had simply cached a
>> large chunk of the logfile, exactly as it is designed to do, and once
>> the file was deleted, it stopped reserving that memory and made it
>> available.
>>
>> https://en.wikipedia.org/wiki/Page_cache
>>
>> Thanks,
>> Shawn
>>



-- 
Regards,
Shalin Shekhar Mangar.


Highlighting queries in parentheses

2015-10-22 Thread Michał Słomkowski

Hello,

recently I've deployed Solr 5.2.1 and I've observed the following issue:

My documents have two fields: id and text. Solr is configured to use 
FastVectorHighlighter (I've tried StandardHighlighter too, no 
difference). I've created the schema.xml, solrconfig.xml hasn't been 
changed in any way.


I have a following highlighting query: text:((foo AND bar) OR eggs). 
Let's say the documents contains only bar and eggs. Currently both of 
them are highlighted. However the desired behaviour is to highlight eggs 
only since (foo AND bar) is not true.


The query I send has following parameters:

'fl': 'id',
'hl': 'true',
'hl.requireFieldMatch': 'true',
'hl.fragListBuilder': 'single',
'hl.fragsize': '0',
'hl.fl': 'text',
'hl.mergeContiguous': 'true',
'hl.useFastVectorHighlighter': 'true',
'hl.q': 'text:((foo AND bar) OR eggs)'

I'd like to know what should I do to make it work as expected.





Re: locks and high CPU

2015-10-22 Thread Erick Erickson
The details are in Tim's blog post and the linked JIRAs

Unfortunately, the only real solution I know of is to upgrade
to at least Solr 5.2. Meanwhile, throttling the indexing rate
will at least smooth out the issue. Not a great approach but
all there is for 4.6.

Best,
Erick

On Thu, Oct 22, 2015 at 10:48 AM, Rallavagu Kon  wrote:
> Erick,
>
> Indexing happening via Solr cloud server. This thread was from the leader. 
> Some followers show symptom of high cpu during this time. You think this is 
> from locking? What is the thread that is holding the lock doing? Also, we are 
> unable to reproduce this issue in load test environment. Any clues would help.
>
>> On Oct 22, 2015, at 09:50, Erick Erickson  wrote:
>>
>> Prior to Solr 5.2, there were several inefficiencies when distributing
>> updates to replicas, see:
>> https://lucidworks.com/blog/2015/06/10/indexing-performance-solr-5-2-now-twice-fast/.
>>
>> The symptom was that there was significantly higher CPU utilization on
>> the followers
>> compared to the leader.
>>
>> The only real fix is to upgrade to 5.2+ assuming that's your issue.
>>
>> How are you indexing? Using SolrJ with CloudSolrServer would help if
>> you're not using
>> them.
>>
>> Best,
>> Erick
>>
>>> On Thu, Oct 22, 2015 at 9:43 AM, Rallavagu  wrote:
>>> Solr 4.6.1 cloud
>>>
>>> Looking into thread dump 4-5 threads causing cpu to go very high and causing
>>> issues. These are tomcat's http threads and are locking. Can anybody help me
>>> understand what is going on here? I see that incoming connections coming in
>>> for updates and they are being passed on to StreamingSolrServer and
>>> subsequently ConcurrentUpdateSolrServer and they both have locks. Thanks.
>>>
>>>
>>> "http-bio-8080-exec-4394" id=8774 idx=0x988 tid=14548 prio=5 alive,
>>> native_blocked, daemon
>>>at __lll_lock_wait+34(:0)@0x38caa0e262
>>>at safepointSyncOnPollAccess+167(safepoint.c:83)@0x7fc29b9c9138
>>>at trapiNormalHandler+484(traps_posix.c:220)@0x7fc29b9fd745
>>>at _L_unlock_16+44(:0)@0x38caa0f710
>>>at
>>> java/util/concurrent/locks/ReentrantLock.lock(ReentrantLock.java:262)[optimized]
>>>at
>>> org/apache/solr/client/solrj/impl/ConcurrentUpdateSolrServer.blockUntilFinished(ConcurrentUpdateSolrServer.java:391)[inlined]
>>>at
>>> org/apache/solr/update/StreamingSolrServers.blockUntilFinished(StreamingSolrServers.java:98)[inlined]
>>>at
>>> org/apache/solr/update/SolrCmdDistributor.finish(SolrCmdDistributor.java:61)[inlined]
>>>at
>>> org/apache/solr/update/processor/DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:501)[inlined]
>>>at
>>> org/apache/solr/update/processor/DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1278)[optimized]
>>>^-- Holding lock:
>>> org/apache/solr/update/StreamingSolrServers$1@0x496cf6e50[biased lock]
>>>^-- Holding lock:
>>> org/apache/solr/update/StreamingSolrServers@0x49d32adc8[biased lock]
>>>at
>>> org/apache/solr/handler/ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83)[optimized]
>>>at
>>> org/apache/solr/handler/RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)[optimized]
>>>at org/apache/solr/core/SolrCore.execute(SolrCore.java:1859)[optimized]
>>>at
>>> org/apache/solr/servlet/SolrDispatchFilter.execute(SolrDispatchFilter.java:721)[inlined]
>>>at
>>> org/apache/solr/servlet/SolrDispatchFilter.doFilter(SolrDispatchFilter.java:417)[inlined]
>>>at
>>> org/apache/solr/servlet/SolrDispatchFilter.doFilter(SolrDispatchFilter.java:201)[optimized]
>>>at
>>> org/apache/catalina/core/ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)[inlined]
>>>at
>>> org/apache/catalina/core/ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)[optimized]
>>>at
>>> org/apache/catalina/core/StandardWrapperValve.invoke(StandardWrapperValve.java:222)[optimized]
>>>at
>>> org/apache/catalina/core/StandardContextValve.invoke(StandardContextValve.java:123)[optimized]
>>>at
>>> org/apache/catalina/core/StandardHostValve.invoke(StandardHostValve.java:171)[optimized]
>>>at
>>> org/apache/catalina/valves/ErrorReportValve.invoke(ErrorReportValve.java:99)[optimized]
>>>at
>>> org/apache/catalina/valves/AccessLogValve.invoke(AccessLogValve.java:953)[optimized]
>>>at
>>> org/apache/catalina/core/StandardEngineValve.invoke(StandardEngineValve.java:118)[optimized]
>>>at
>>> org/apache/catalina/connector/CoyoteAdapter.service(CoyoteAdapter.java:408)[optimized]
>>>at
>>> org/apache/coyote/http11/AbstractHttp11Processor.process(AbstractHttp11Processor.java:1023)[optimized]
>>>at
>>> org/apache/coyote/AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:589)[optimized]
>>>at
>>> org/apache/tomcat/util/net/JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:310)[optimized]
>>>^-- Holding lock:
>>> 

Re: getting cached terms inside UpdateRequestProcessor...

2015-10-22 Thread Erik Hatcher
Yes, it works (now, not sure when though).  I just adjusted the 
TestContentStreamDataSource test case, see patch pasted below, that passes.  
Note that the solrconfig file has a mistake in that the attribute ‘key’ isn’t 
correct - should be ‘name’

(this was tested on trunk via IntelliJ, just FYI in case that matters)


—
Erik Hatcher, Senior Solutions Architect
http://www.lucidworks.com




Index: src/test-files/dih/solr/collection1/conf/contentstream-solrconfig.xml
===
--- src/test-files/dih/solr/collection1/conf/contentstream-solrconfig.xml   
(revision 1700690)
+++ src/test-files/dih/solr/collection1/conf/contentstream-solrconfig.xml   
(working copy)
@@ -242,6 +242,7 @@
   
 
   data-config.xml
+  contentstream
 
 
   
@@ -295,11 +296,15 @@
 *:*
   
 
-  
+  
 
 
 
   
 
+  
+
+  
+
 
 


> On Oct 22, 2015, at 12:42 PM, Shawn Heisey  wrote:
> 
> On 10/22/2015 10:32 AM, Erik Hatcher wrote:
>> Setting “update.chain” in the DataImportHandler handler defined in 
>> solrconfig.xml should allow you to specify which update chain is used.  Can 
>> you confirm that works, Shawn?
> 
> I tried this a couple of years ago without luck.  Does it work now?
> 
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201308.mbox/%3c6c93c1a4-63ac-4cad-9f5b-c74f497c6...@gmail.com%3E
> 
> In the first email of the thread, I indicated I had tried 4.4 and
> 4.5-SNAPSHOT.
> 
> Thanks,
> Shawn
> 



Re: getting cached terms inside UpdateRequestProcessor...

2015-10-22 Thread Alexandre Rafalovitch
You need to tell the second call which documents to update. Are you doing
that?

There may also be a wrinkle in the URP order, but let's get the first step
working first.
On 22 Oct 2015 12:59 pm, "Roxana Danger" 
wrote:

> yes, it's working now... but I can not use the updateprocessor chain. I
> need to use first the DIH and then use UPR, but I am not having luck in
> updating my docs with the URL:
> http://localhost:8983/solr/reed_jobs/update/jtdetails?commit=true
>
> Do you manage to use an updateProcessor chain after use the DIH without
> using the update.chain parameter?
>
> Cheers,
> Roxana
>
>
> On 22 October 2015 at 17:42, Shawn Heisey  wrote:
>
> > On 10/22/2015 10:32 AM, Erik Hatcher wrote:
> > > Setting “update.chain” in the DataImportHandler handler defined in
> > solrconfig.xml should allow you to specify which update chain is used.
> Can
> > you confirm that works, Shawn?
> >
> > I tried this a couple of years ago without luck.  Does it work now?
> >
> >
> >
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201308.mbox/%3c6c93c1a4-63ac-4cad-9f5b-c74f497c6...@gmail.com%3E
> >
> > In the first email of the thread, I indicated I had tried 4.4 and
> > 4.5-SNAPSHOT.
> >
> > Thanks,
> > Shawn
> >
> >
>
>
> --
> Roxana Danger | Data Scientist Dragon Court, 27-29 Macklin Street, London,
> WC2B 5LX Tel: 020 7067 4568 [image: reed.co.uk] 
> The
> UK's #1 job site.  [image: Follow us on Twitter]
> 
>  [image:
> Like us on Facebook] 
>  It's time to Love Mondays »
> 
>


Re: Wildcard "?" ?

2015-10-22 Thread Bruno Mannina

Upayavira,

Thanks a lot for these information

Regards,
Bruno

Le 21/10/2015 19:24, Upayavira a écrit :

regexp will match the whole term. So, if you have stemming on, magnetic
may well stem to magnet, and that is the term against which the regexp
is executed.

If you want to do the regexp against the whole field, then you need to
do it against a string version of that field.

The process of using a regexp (and a wildcard for that matter) is:
  * search through the list of terms in your field for terms that match
  your regexp (uses an FST for speed)
  * search for documents that contain those resulting terms

Upayavira

On Wed, Oct 21, 2015, at 12:08 PM, Bruno Mannina wrote:

title:/magnet.?/ doesn't work for me because solr answers:

|title = "Magnetic folding system"|

but thanks to give me the idea to use regexp !!!

Le 21/10/2015 18:46, Upayavira a écrit :

No, you cannot tell Solr to handle wildcards differently. However, you
can use regular expressions for searching:

title:/magnet.?/ should do it.

Upayavira

On Wed, Oct 21, 2015, at 11:35 AM, Bruno Mannina wrote:

Dear Solr-user,

I'm surprise to see in my SOLR 5.0 that the wildward ? replace
inevitably 1 character.

my request is:

title:magnet? AND tire?

SOLR found only title with a character after magnet and tire but don't
found
title with only magnet AND tire


Do you know where can I tell to solr that ? wildcard means [0, 1]
character and not [1] character ?
Is it possible ?


Thanks a lot !

my field in my schema is defined like that:


  Field: title

Field-Type:
  org.apache.solr.schema.TextField
PI Gap:
  100

Flags:  Indexed Tokenized   Stored  Multivalued
Properties  y
y
y
y
Schema  y
y
y
y
Index   y
y
y


*

  org.apache.solr.analysis.TokenizerChain

*

  org.apache.solr.analysis.TokenizerChain




---
L'absence de virus dans ce courrier électronique a été vérifiée par le
logiciel antivirus Avast.
http://www.avast.com



---
L'absence de virus dans ce courrier électronique a été vérifiée par le
logiciel antivirus Avast.
http://www.avast.com





---
L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel 
antivirus Avast.
http://www.avast.com



Re: getting cached terms inside UpdateRequestProcessor...

2015-10-22 Thread Mikhail Khludnev
Hello Roxana,

I feel it's almost impossible. I can only suggest to commit to make new
terms visible.
There is SolrCore.getRealtimeSearcher() but I never understand what it
does.

On Thu, Oct 22, 2015 at 1:20 PM, Roxana Danger <
roxana.dan...@reedonline.co.uk> wrote:

> Hello,
>
> I would like to create an updateRequestProcessorChain that should to be
> executed after a DB DIH. I am extending UpdateRequestProcessorFactory and
> the UpdateRequestProcessor classes. The method processAdd of my
> UpdateRequestProcessor should be able to update the documents with  the
> indexed terms associated to a field. Notice that these terms should have
> been extracted with an analyzer before my updateRequestProcessorChain
> processor begins to execute.
>
> The problem I am getting is that at the point where processAdd is executed
> the field containing the terms has not been filled. To retrieve the terms I
> am using the SolrIndexSearcher provided during the request
> (req.getSearcher()). However, it seems that this searcher uses only the
> data physically stored and does not consider any of the imported data.
>
> Any idea on how can I access to searcher with all indexed/cached data when
> the processAdd method is executed?
>
> Thank you very much in advance.
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics





Re: How to get the join data by multiple cores?

2015-10-22 Thread Mikhail Khludnev
thread hijack:
Erick, wdyt about writing query-time analog of [child]
https://cwiki.apache.org/confluence/display/solr/Transforming+Result+Documents
?


On Thu, Oct 22, 2015 at 6:32 PM, Erick Erickson 
wrote:

> You will NOT get the stored fields from the child record
> with the join operation, it's called "pseudo join" for a
> good reason.
>
> It's usually a mistake to try to force Solr to performa just
> like a database. I would seriously consider flattening
> (denormalizing) the data if at all possible.
>
> Best,
> Erick
>
> On Wed, Oct 21, 2015 at 10:36 PM, cai xingliang 
> wrote:
> > {!join fromIndex=parent from=id to=parent_id}tag:hoge
> >
> > That should work.
> > On Oct 22, 2015 12:35 PM, "Shuhei Suzuki"  wrote:
> >
> >> hello,
> >> What can I do to throw a query such as the following in Solr?
> >>
> >>  SELECT
> >>   child. *, parent. *
> >>  FROM child
> >>  JOIN parent
> >>  WHERE child.parent_id = parent.id AND parent.tag = 'hoge'`
> >>
> >> child and parent is not that parent is more than in a many-to-one
> >> relationship.
> >> I try this but can not.
> >>
> >>  /select/?q={!join from=parent_id to=id fromIndex=parent}id:1+tag:hoge
> >>
> >>
> >>
> >>
> >>
> >> --
> >> View this message in context:
> >>
> http://lucene.472066.n3.nabble.com/How-to-get-the-join-data-by-multiple-cores-tp4235799.html
> >> Sent from the Solr - User mailing list archive at Nabble.com.
> >>
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics





Re: How to get the join data by multiple cores?

2015-10-22 Thread Erick Erickson
Mikhail:

Brilliant! Assuming we can get the "from" and "to" parameters out of
the query and, perhaps, the fromIndex (for cross-core) then it
_should_ just be a matter of fetching the from doc and adding the
fields. And since it's only operating on the returned documents it
also shouldn't be very expensive in the case of the "usual" 10-20
document retrieval sets.

I can see it slowing things down very considerably for large result
sets, but those can be slow currently anyway.

Not sure how to specify the fields that should come from the "from"
document, but that's a tractable problem. Perhaps a different (local?)
param (fl_from or some such?).

Sounds like a JIRA to me...

On Thu, Oct 22, 2015 at 1:12 PM, Mikhail Khludnev
 wrote:
> thread hijack:
> Erick, wdyt about writing query-time analog of [child]
> https://cwiki.apache.org/confluence/display/solr/Transforming+Result+Documents
> ?
>
>
> On Thu, Oct 22, 2015 at 6:32 PM, Erick Erickson 
> wrote:
>>
>> You will NOT get the stored fields from the child record
>> with the join operation, it's called "pseudo join" for a
>> good reason.
>>
>> It's usually a mistake to try to force Solr to performa just
>> like a database. I would seriously consider flattening
>> (denormalizing) the data if at all possible.
>>
>> Best,
>> Erick
>>
>> On Wed, Oct 21, 2015 at 10:36 PM, cai xingliang 
>> wrote:
>> > {!join fromIndex=parent from=id to=parent_id}tag:hoge
>> >
>> > That should work.
>> > On Oct 22, 2015 12:35 PM, "Shuhei Suzuki"  wrote:
>> >
>> >> hello,
>> >> What can I do to throw a query such as the following in Solr?
>> >>
>> >>  SELECT
>> >>   child. *, parent. *
>> >>  FROM child
>> >>  JOIN parent
>> >>  WHERE child.parent_id = parent.id AND parent.tag = 'hoge'`
>> >>
>> >> child and parent is not that parent is more than in a many-to-one
>> >> relationship.
>> >> I try this but can not.
>> >>
>> >>  /select/?q={!join from=parent_id to=id fromIndex=parent}id:1+tag:hoge
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> --
>> >> View this message in context:
>> >>
>> >> http://lucene.472066.n3.nabble.com/How-to-get-the-join-data-by-multiple-cores-tp4235799.html
>> >> Sent from the Solr - User mailing list archive at Nabble.com.
>> >>
>
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
>


Re: locks and high CPU

2015-10-22 Thread Rallavagu
Thanks Erick. Currently, migrating to 5.3 and it is taking a bit of 
time. Meanwhile, I looked at the JIRAs from the blog and the stack trace 
looks a bit different from what I see but not sure if they are related. 
Also, as per the stack trace I have included in my original email, it is 
the tomcat thread that is locking but not the recovery thread which will 
be responsible writing updates to followers. I agree that we might 
throttle updates but what is annoying is unable to see issues in 
controlled load test env.


Just to understand better, what is the tomcat thread doing in this case?

Thanks

On 10/22/15 12:53 PM, Erick Erickson wrote:

The details are in Tim's blog post and the linked JIRAs

Unfortunately, the only real solution I know of is to upgrade
to at least Solr 5.2. Meanwhile, throttling the indexing rate
will at least smooth out the issue. Not a great approach but
all there is for 4.6.

Best,
Erick

On Thu, Oct 22, 2015 at 10:48 AM, Rallavagu Kon  wrote:

Erick,

Indexing happening via Solr cloud server. This thread was from the leader. Some 
followers show symptom of high cpu during this time. You think this is from 
locking? What is the thread that is holding the lock doing? Also, we are unable 
to reproduce this issue in load test environment. Any clues would help.


On Oct 22, 2015, at 09:50, Erick Erickson  wrote:

Prior to Solr 5.2, there were several inefficiencies when distributing
updates to replicas, see:
https://lucidworks.com/blog/2015/06/10/indexing-performance-solr-5-2-now-twice-fast/.

The symptom was that there was significantly higher CPU utilization on
the followers
compared to the leader.

The only real fix is to upgrade to 5.2+ assuming that's your issue.

How are you indexing? Using SolrJ with CloudSolrServer would help if
you're not using
them.

Best,
Erick


On Thu, Oct 22, 2015 at 9:43 AM, Rallavagu  wrote:
Solr 4.6.1 cloud

Looking into thread dump 4-5 threads causing cpu to go very high and causing
issues. These are tomcat's http threads and are locking. Can anybody help me
understand what is going on here? I see that incoming connections coming in
for updates and they are being passed on to StreamingSolrServer and
subsequently ConcurrentUpdateSolrServer and they both have locks. Thanks.


"http-bio-8080-exec-4394" id=8774 idx=0x988 tid=14548 prio=5 alive,
native_blocked, daemon
at __lll_lock_wait+34(:0)@0x38caa0e262
at safepointSyncOnPollAccess+167(safepoint.c:83)@0x7fc29b9c9138
at trapiNormalHandler+484(traps_posix.c:220)@0x7fc29b9fd745
at _L_unlock_16+44(:0)@0x38caa0f710
at
java/util/concurrent/locks/ReentrantLock.lock(ReentrantLock.java:262)[optimized]
at
org/apache/solr/client/solrj/impl/ConcurrentUpdateSolrServer.blockUntilFinished(ConcurrentUpdateSolrServer.java:391)[inlined]
at
org/apache/solr/update/StreamingSolrServers.blockUntilFinished(StreamingSolrServers.java:98)[inlined]
at
org/apache/solr/update/SolrCmdDistributor.finish(SolrCmdDistributor.java:61)[inlined]
at
org/apache/solr/update/processor/DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:501)[inlined]
at
org/apache/solr/update/processor/DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1278)[optimized]
^-- Holding lock:
org/apache/solr/update/StreamingSolrServers$1@0x496cf6e50[biased lock]
^-- Holding lock:
org/apache/solr/update/StreamingSolrServers@0x49d32adc8[biased lock]
at
org/apache/solr/handler/ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83)[optimized]
at
org/apache/solr/handler/RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)[optimized]
at org/apache/solr/core/SolrCore.execute(SolrCore.java:1859)[optimized]
at
org/apache/solr/servlet/SolrDispatchFilter.execute(SolrDispatchFilter.java:721)[inlined]
at
org/apache/solr/servlet/SolrDispatchFilter.doFilter(SolrDispatchFilter.java:417)[inlined]
at
org/apache/solr/servlet/SolrDispatchFilter.doFilter(SolrDispatchFilter.java:201)[optimized]
at
org/apache/catalina/core/ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)[inlined]
at
org/apache/catalina/core/ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)[optimized]
at
org/apache/catalina/core/StandardWrapperValve.invoke(StandardWrapperValve.java:222)[optimized]
at
org/apache/catalina/core/StandardContextValve.invoke(StandardContextValve.java:123)[optimized]
at
org/apache/catalina/core/StandardHostValve.invoke(StandardHostValve.java:171)[optimized]
at
org/apache/catalina/valves/ErrorReportValve.invoke(ErrorReportValve.java:99)[optimized]
at
org/apache/catalina/valves/AccessLogValve.invoke(AccessLogValve.java:953)[optimized]
at
org/apache/catalina/core/StandardEngineValve.invoke(StandardEngineValve.java:118)[optimized]
at
org/apache/catalina/connector/CoyoteAdapter.service(CoyoteAdapter.java:408)[optimized]
at

Re: Is it possible to specigfy only one-character term synonymfor2-gram tokenizer?

2015-10-22 Thread Scott Chu
Hi Emir,

Very weirdly. I've reply to your email at home many times yesterday but they 
never show up in the solr-user email list again. Don't know why. So I reply 
this again at office. Hope this will show up.

Thanks to your explanation. I'll see PatternReplaceCharFilter as a workaround 
(As I know, Character filter are dealing with input stream before the 
tokenizer. In some way, indexed data no longer has original C1 if I do the 
replacement.) What I deal wth are published news articles and I don't know how 
the author of these articles feel about when they see C1 in their articles 
become C2 since some term containing C1 are proper nouns or terminologies. I'll 
talk to them to see if this is ok. Thanks anyway.

Scott Chu,scott@udngroup.com
2015/10/23 
- Original Message - 
From: Emir Arnautovic 
To: solr-user 
Date: 2015-10-22, 18:20:38
Subject: Re: Is it possible to specigfy only one-character term 
synonymfor2-gram tokenizer?


Hi Scott,
Using PatternReplaceCharFilter is not same as replacing raw data 
(replacing raw data is not proper solution as it does not solve issue 
when searching with "other" character). This is part of token 
standardization, no different than lower casing - it is standard 
approach as well when it comes to Latin characters:


Quick search of "MappingCharFilterFactory chinese" shows it is used - 
you should check if suitable for your case.

Thanks,
Emir

On 22.10.2015 11:48, Scott Chu wrote:
> Hi solr-user,
> Ya, I thought about replacing C1 with C2 in the underground raw data. 
> However, it's a huge data set (over 10M news articles) so I give up 
> this strategy eariler. My current temporary solution is going back to 
> use 1-gram tokenizer ((i.e.StandardTokenizer) so I can only set 1 
> rule. But it is kinda ugly, especially when applying highlight, e.g. 
> search "C1C2" Solr returns highlight snippet such as 
> "...C1C2...".
> Scott Chu,scott@udngroup.com 
> 2015/10/22
>
> - Original Message -
> *From: *Emir Arnautovic 
> *To: *solr-user 
> *Date: *2015-10-22, 17:08:26
> *Subject: *Re: Is it possible to specigfy only one-character term
> synonym for2-gram tokenizer?
>
> Hi Scott,
> I don't have experience with Chinese, but SynonymFilter works on
> tokens,
> so if CJKTokenizer recognizes C1 and Cm as tokens, it should work. If
> not, than you can try configuring PatternReplaceCharFilter to
> replace C1
> to C2 during indexing and searching and get a match.
>
> Thanks,
> Emir
>
> On 22.10.2015 10:53, Scott Chu wrote:
> > Hi solr-user,
> > I always uses CJKTokenizer on appropriate amount of Chinese news
> > articles. Say in Chinese, character C1 has same meaning as
> > character C2 (e.g 台=臺), Is it possible that I only add this
> line in
> > synonym.txt:
> > C1,C2 (and in true exmaple: 台, 臺)
> > and by applying CJKTokenizer and SynonymFilter, I only have to
> query
> > "C1Cm..." (say Cm is arbitrary Chinese character) and Solr will
> > return documents that matche whether "C1Cm" or "C2Cm"?
> > Scott Chu,scott@udngroup.com
>   >
> > 2015/10/22
> >
>
> -- 
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
>
>
>
>
> -
> 未在此訊息中找到病毒。
> 已透過 AVG 檢查 - www.avg.com
> 版本: 2015.0.6172 / 病毒庫: 4450/10867 - 發佈日期: 10/21/15
>

-- 
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/




-
未在此訊息中找到病毒。
已透過 AVG 檢查 - www.avg.com
版本: 2015.0.6172 / 病毒庫: 4450/10867 - 發佈日期: 10/21/15


Re: Highlighting content field problem when using JiebaTokenizerFactory

2015-10-22 Thread Scott Chu
Hi Edwin,

Since you've tested all my suggestions and the problem is still there, I can't 
think of anything wrong with your configuration. Now I can only suspect two 
things:

1. You said the problem only happens on "contents" field, so maybe there're 
something wrong with the contents of that field. Doe it contain any special 
thing in them, e.g. HTML tags or symbols. I recall SOLR-42 mentions something 
about HTML stripping will cause highlight problem. Maybe you can try purify 
that fields to be closed to pure text and see if highlight comes ok.

2. Maybe something imcompatible between JiebaTokenizer and Solr highlighter. If 
you switch to other tokenizers, e.g. Standard, CJK, SmartChinese (I don't use 
this since I am dealing with Traditional Chinese but I see you are dealing with 
Simplified Chinese), or 3rd-party MMSeg and see if the problem goes away. 
However when I'm googling similar problem, I saw you asked same question on 
August at Huaban/Jieba-analysis and somebody said he also uses JiebaTokenizer 
but he doesn't have your problem. So I see this could be less suspect.

The theory of your problem could be something in indexing process causes wrong 
position info. for that field and when Solr do highlighting, it retrieves wrong 
position info. and mark wrong position of highlight target terms.

Scott Chu,scott@udngroup.com
2015/10/23 
- Original Message - 
From: Zheng Lin Edwin Yeo 
To: solr-user 
Date: 2015-10-22, 22:22:14
Subject: Re: Highlighting content field problem when using JiebaTokenizerFactory


Hi Scott,

Thank you for your response and suggestions.

With respond to your questions, here are the answers:

1. I take a look at Jieba. It uses a dictionary and it seems to do a good
job on CJK. I doubt this problem may be from those filters (note: I can
understand you may use CJKWidthFilter to convert Japanese but doesn't
understand why you use CJKBigramFilter and EdgeNGramFilter). Have you tried
commenting out those filters, say leave only Jieba and StopFilter, and see

if this problem disppears?
*A) Yes, I have tried commenting out the other filters and only left with
Jieba and StopFilter. The problem is still there.*

2.Does this problem occur only on Chinese search words? Does it happen on
English search words?
*A) Yes, the same problem occurs on English words. For example, when I
search for "word", it will highlight in this way:  word*

3.To use FastVectorHighlighter, you seem to have to enable 3 term*
parameters in field declaration? I see only one is enabled. Please refer to
the answer in this stackoverflow question:
http://stackoverflow.com/questions/25930180/solr-how-to-highlight-the-whole-search-phrase-only
*A) I have tried to enable all 3 terms in the FastVectorHighlighter too,
but the same problem persists as well.*


Regards,
Edwin


On 22 October 2015 at 16:25, Scott Chu  wrote:

> Hi solr-user,
>
> Can't judge the cause on fast glimpse of your definition but some
> suggestions I can give:
>
> 1. I take a look at Jieba. It uses a dictionary and it seems to do a good
> job on CJK. I doubt this problem may be from those filters (note: I can
> understand you may use CJKWidthFilter to convert Japanese but doesn't
> understand why you use CJKBigramFilter and EdgeNGramFilter). Have you tried
> commenting out those filters, say leave only Jieba and StopFilter, and see
> if this problem disppears?
>
> 2.Does this problem occur only on Chinese search words? Does it happen on
> English search words?
>
> 3.To use FastVectorHighlighter, you seem to have to enable 3 term*
> parameters in field declaration? I see only one is enabled. Please refer to
> the answer in this stackoverflow question:
> http://stackoverflow.com/questions/25930180/solr-how-to-highlight-the-whole-search-phrase-only
>
>
> Scott Chu,scott@udngroup.com
> 2015/10/22
>
> - Original Message -
> *From: *Zheng Lin Edwin Yeo 
> *To: *solr-user 
> *Date: *2015-10-20, 12:04:11
> *Subject: *Re: Highlighting content field problem when using
> JiebaTokenizerFactory
>
> Hi Scott,
>
> Here's my schema.xml for content and title, which uses text_chinese. The

> problem only occurs in content, and not in title.
>
>  omitNorms="true" termVectors="true"/>
>  omitNorms="true" termVectors="true"/>
>
>
>  positionIncrementGap="100">
> 
>  segMode="SEARCH"/>
> 
> 
>  words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
>  maxGramSize="15"/>
> 
> 
> 
>  segMode="SEARCH"/>
> 
> 
>  words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
> 
> 
> 
>
>
> Here's my solrconfig.xml on the highlighting portion:
>
> 
> 
> explicit
> 10
> json
> true
> text
> id, title, content_type, last_modified, url, score 
>
> on
> id, title, content, author, tag
> true
> true
> html
> 200
> true
> signature
> true
> 100
> 
> 
>
>  class="solr.highlight.BreakIteratorBoundaryScanner">
> 
> WORD
> en
> SG
> 
> 
>
>
> Meanwhile, I'll take a look at the 

Unable to extract images content (OCR) from PDF files using Solr

2015-10-22 Thread Damien Picard
Hi,

I'm using Solr 5.3.0 on a Red Hat EL 7 and I try to extract content from
PDF, Word, LibreOffice, etc. docs using the ExtractingRequestHandler.

Everything works fine, except when I want to extract content from embedding
images in PDF/Word etc. documents :

I send an extract request like this :
POST /update/extract?literal.id
=ocrpdf8=attr_content=attr_

In attr_content, I get :
\n \n date 2015-08-28T13:23:03Z \n
pdf:PDFVersion 1.4 \n
xmp:CreatorTool PDFCreator Version 1.2.3 \n
 stream_content_type application/pdf \n
 Keywords \n
 subject \n
 dc:creator S050735 \n
 dcterms:created 2015-08-28T13:23:03Z \n
 Last-Modified 2015-08-28T13:23:03Z \n
 dcterms:modified 2015-08-28T13:23:03Z \n
 dc:format application/pdf; version=1.4 \n
 Last-Save-Date 2015-08-28T13:23:03Z \n
 stream_name imagepdf.pdf \n
 meta:save-date 2015-08-28T13:23:03Z \n
 pdf:encrypted false \n
 dc:title imagepdf \n
 modified 2015-08-28T13:23:03Z \n
 cp:subject \n
 Content-Type application/pdf \n
 stream_size 423660 \n
 X-Parsed-By org.apache.tika.parser.DefaultParser \n
 X-Parsed-By org.apache.tika.parser.pdf.PDFParser \n
 creator S050735 \n
 meta:author S050735 \n
 dc:subject \n
 meta:creation-date 2015-08-28T13:23:03Z \n
 stream_source_info the-file \n
 created Fri Aug 28 13:23:03 UTC 2015 \n
 xmpTPg:NPages 1 \n
 Creation-Date 2015-08-28T13:23:03Z \n
 meta:keyword \n
 Author S050735 \n
 producer GPL Ghostscript 9.04 \n
 imagepdf \n
 \n
 page \n
 Page 1 sur 1\n \n
 28/08/2015
http://confluence/download/attachments/158471300/image2015-3-3+18%3A10%3A4...
\n \n embedded:image0.jpg image0.jpg embedded:image1.jpg image1.jpg
embedded:image2.jpg image2.jpg \n

So, tika works fine, but it doesn't apply OCR content extraction on the
embedded images.

When I post an image (JPG) on /update/extract, I get its content indexed
throught Tesseract OCR (attr_content) field :
\n \n stream_size 55422 \n
 X-Parsed-By org.apache.tika.parser.DefaultParser \n
 X-Parsed-By org.apache.tika.parser.ocr.TesseractOCRParser \n
 stream_content_type image/jpeg \n
 stream_name OM_1.jpg \n
 stream_source_info the-file \n
 Content-Type image/jpeg \n \n \n
 ‘ '\"I“ \" \"' ./\nlrast. Shortly before the classes started I was
visiting a.\ncertain public school, a school set in a typically
English\ncountryside, which on the June clay of my visit was wonder-\nfully
beauliful. The Head Master—-no less typical than his\nschool and the
country-side—pointed out the charms of\nboth, and his pride came out in the
final remark which he made\nbeforehe left me. He explained that he had a
class to take\nin'I'heocritus. Then (with a. buoyant gesture); “ Can
you\n\n, conceive anything more delightful than a class in
Theocritus,\n\non such a day and in such a place?\"\n\n \n \n \n
stream_size 55422 \n X-Parsed-By org.apache.tika.parser.DefaultParser \n
X-Parsed-By org.apache.tika.parser.ocr.TesseractOCRParser \n X-Parsed-By
org.apache.tika.parser.jpeg.JpegParser \n stream_content_type image/jpeg \n
Resolution Units inch \n stream_source_info the-file \n Compression Type
Progressive, Huffman \n Data Precision 8 bits \n Number of Components 3 \n
tiff:ImageLength 286 \n Component 2 Cb component: Quantization table 1,
Sampling factors 1 horiz/1 vert \n Component 1 Y component: Quantization
table 0, Sampling factors 2 horiz/2 vert \n Image Height 286 pixels \n X
Resolution 72 dots \n Image Width 690 pixels \n stream_name OM_1.jpg \n
Component 3 Cr component: Quantization table 1, Sampling factors 1 horiz/1
vert \n tiff:BitsPerSample 8 \n tiff:ImageWidth 690 \n Content-Type
image/jpeg \n Y Resolution 72 dots

I see on Tika JIRA that I have to enable extractInlineImages in
org/apache/tika/parser/pdf/PDFParser.properties to force image extraction
on PDF. So I did it, and I package a tika-app-1.7.jar that contains the
tika-parsers-1.7.jar with this file modified to set to true this property.
Then, I test my Tika JAR using CLI :

# java -jar tika-app-1.7.jar -t /data/docs/imagepdf.pdf

In this case, I get the images content :


Page 1 sur 1

28/08/2015
http://confluence/download/attachments/158471300/image2015-3-3+18%3A10%3A4.
..

Simple Evan!
Use Case
Sdsedulet

So, I replace the solr/contrib/extraction/lib/tika-parsers-1.7.jar by my
modified one, but the images remains not extracted in my pdf.

Does anybody know what I'm doing wrong ?

Thank you.

-- 
Damien Picard
Expert GWT

Mob : 06 11 51 47 78


getting cached terms inside UpdateRequestProcessor...

2015-10-22 Thread Roxana Danger
Hello,

I would like to create an updateRequestProcessorChain that should to be
executed after a DB DIH. I am extending UpdateRequestProcessorFactory and
the UpdateRequestProcessor classes. The method processAdd of my
UpdateRequestProcessor should be able to update the documents with  the
indexed terms associated to a field. Notice that these terms should have
been extracted with an analyzer before my updateRequestProcessorChain
processor begins to execute.

The problem I am getting is that at the point where processAdd is executed
the field containing the terms has not been filled. To retrieve the terms I
am using the SolrIndexSearcher provided during the request
(req.getSearcher()). However, it seems that this searcher uses only the
data physically stored and does not consider any of the imported data.

Any idea on how can I access to searcher with all indexed/cached data when
the processAdd method is executed?

Thank you very much in advance.


Re: Is it possible to specigfy only one-character term synonym for2-gram tokenizer?

2015-10-22 Thread Emir Arnautovic

Hi Scott,
Using PatternReplaceCharFilter is not same as replacing raw data 
(replacing raw data is not proper solution as it does not solve issue 
when searching with "other" character). This is part of token 
standardization, no different than lower casing - it is standard 
approach as well when it comes to Latin characters:
mapping="mapping-ISOLatin1Accent.txt"/>


Quick search of "MappingCharFilterFactory chinese" shows it is used - 
you should check if suitable for your case.


Thanks,
Emir

On 22.10.2015 11:48, Scott Chu wrote:

Hi solr-user,
Ya, I thought about replacing C1 with C2 in the underground raw data. 
However, it's a huge data set (over 10M news articles) so I give up 
this strategy eariler. My current temporary solution is going back to 
use 1-gram tokenizer ((i.e.StandardTokenizer) so I can only set 1 
rule. But it is kinda ugly, especially when applying highlight, e.g. 
search "C1C2" Solr returns highlight snippet such as 
"...C1C2...".

Scott Chu,scott@udngroup.com 
2015/10/22

- Original Message -
*From: *Emir Arnautovic 
*To: *solr-user 
*Date: *2015-10-22, 17:08:26
*Subject: *Re: Is it possible to specigfy only one-character term
synonym for2-gram tokenizer?

Hi Scott,
I don't have experience with Chinese, but SynonymFilter works on
tokens,
so if CJKTokenizer recognizes C1 and Cm as tokens, it should work. If
not, than you can try configuring PatternReplaceCharFilter to
replace C1
to C2 during indexing and searching and get a match.

Thanks,
Emir

On 22.10.2015 10:53, Scott Chu wrote:
> Hi solr-user,
> I always uses CJKTokenizer on appropriate amount of Chinese news
> articles. Say in Chinese, character C1 has same meaning as
> character C2 (e.g 台=臺), Is it possible that I only add this
line in
> synonym.txt:
> C1,C2 (and in true exmaple: 台, 臺)
> and by applying CJKTokenizer and SynonymFilter, I only have to
query
> "C1Cm..." (say Cm is arbitrary Chinese character) and Solr will
> return documents that matche whether "C1Cm" or "C2Cm"?
> Scott Chu,scott@udngroup.com
 >
> 2015/10/22
>

-- 
Monitoring * Alerting * Anomaly Detection * Centralized Log Management

Solr & Elasticsearch Support * http://sematext.com/




-
未在此訊息中找到病毒。
已透過 AVG 檢查 - www.avg.com
版本: 2015.0.6172 / 病毒庫: 4450/10867 - 發佈日期: 10/21/15



--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: Is it possible to specigfy only one-character term synonym for 2-gram tokenizer?

2015-10-22 Thread Emir Arnautovic

Hi Scott,
I don't have experience with Chinese, but SynonymFilter works on tokens, 
so if CJKTokenizer recognizes C1 and Cm as tokens, it should work. If 
not, than you can try configuring PatternReplaceCharFilter to replace C1 
to C2 during indexing and searching and get a match.


Thanks,
Emir

On 22.10.2015 10:53, Scott Chu wrote:

Hi solr-user,
I always uses CJKTokenizer on appropriate amount of Chinese news 
articles. Say in Chinese, character C1 has same meaning as 
character C2 (e.g 台=臺), Is it possible that I only add this line in 
synonym.txt:

C1,C2 (and in true exmaple: 台, 臺)
and by applying CJKTokenizer and SynonymFilter, I only have to query 
"C1Cm..."  (say Cm is arbitrary Chinese character) and Solr will 
return documents that matche whether "C1Cm" or "C2Cm"?

Scott Chu,scott@udngroup.com 
2015/10/22



--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: Zookeeper Quorum leader election

2015-10-22 Thread Arcadius Ahouansou
The leader election issue we were having was solved by passing

-Djava.net.preferIPv4Stack=true

to zookeeper startup script

It seems our Linux servers have IPv6 enabled but we have no IPv6 network.

Hope this helps others.

Arcadius.


On 4 September 2015 at 04:57, Arcadius Ahouansou 
wrote:

>
> We have a quorum of 3 ZK nodes zk1, zk2 and zk3.
> All nodes are identicals.
>
> After multiple restart of the ZK nodes, always keeping the majority of 2,
> we have noticed that the node zk1 has never become the leader.
> Only zk2 and zk3 become leader.
>
> 1) Is there any know reason or possible misconfiguration that may cause
> zk1 to never become a leader? (note that we do not have any zk group
> set-up)
>
> 2) Is there any command or way to force zk1 to become leader of the quorum?
>
> Note that all connected SolrJ clients have zkhost set to
> zk1:port,zk2:port,zk3:port
>
> Thanks.
>
>
> --
> Arcadius Ahouansou
> Menelic Ltd | Information is Power
> M: 07908761999
> W: www.menelic.com
> ---
>



-- 
Arcadius Ahouansou
Menelic Ltd | Information is Power
M: 07908761999
W: www.menelic.com
---


Re: Is it possible to specigfy only one-character term synonym for2-gram tokenizer?

2015-10-22 Thread Scott Chu
Hi solr-user,

Ya, I thought about replacing C1 with C2 in the underground raw data. However, 
it's a huge data set (over 10M news articles) so I give up this strategy 
eariler. My current temporary solution is going back to use 1-gram tokenizer 
((i.e.StandardTokenizer) so I can only set 1 rule. But it is kinda ugly, 
especially when applying highlight, e.g. search "C1C2" Solr returns highlight 
snippet such as "...C1C2...".

Scott Chu,scott@udngroup.com
2015/10/22 
- Original Message - 
From: Emir Arnautovic 
To: solr-user 
Date: 2015-10-22, 17:08:26
Subject: Re: Is it possible to specigfy only one-character term synonym 
for2-gram tokenizer?


Hi Scott,
I don't have experience with Chinese, but SynonymFilter works on tokens, 
so if CJKTokenizer recognizes C1 and Cm as tokens, it should work. If 
not, than you can try configuring PatternReplaceCharFilter to replace C1 
to C2 during indexing and searching and get a match.

Thanks,
Emir

On 22.10.2015 10:53, Scott Chu wrote:
> Hi solr-user,
> I always uses CJKTokenizer on appropriate amount of Chinese news 
> articles. Say in Chinese, character C1 has same meaning as 
> character C2 (e.g 台=臺), Is it possible that I only add this line in 
> synonym.txt:
> C1,C2 (and in true exmaple: 台, 臺)
> and by applying CJKTokenizer and SynonymFilter, I only have to query 
> "C1Cm..." (say Cm is arbitrary Chinese character) and Solr will 
> return documents that matche whether "C1Cm" or "C2Cm"?
> Scott Chu,scott@udngroup.com 
> 2015/10/22
>

-- 
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/




-
未在此訊息中找到病毒。
已透過 AVG 檢查 - www.avg.com
版本: 2015.0.6172 / 病毒庫: 4450/10867 - 發佈日期: 10/21/15


Re: getting cached terms inside UpdateRequestProcessor...

2015-10-22 Thread Erik Hatcher
Roxana -

What is the purpose of doing this?  (that’ll help guide the best approach)

It can be quite handy to get the terms from analysis into a field as stored 
values and to separate terms into separate fields and such.  Here’s a 
presentation where I detailed an update script trick that accomplishes this:  
http://www.slideshare.net/erikhatcher/solr-indexing-and-analysis-tricks 


Within Solr, the example/files area has this very trick implemented to pull our 
URLs and e-mail addresses from full text into separate specific fields.  See 
http://svn.apache.org/repos/asf/lucene/dev/branches/branch_5x/solr/example/files/conf/update-script.js
 

 (“var analyzer = “… and below)

Does that trick accomplish what you need?   If not, please detail what you’re 
after and we’ll try to help.

—
Erik Hatcher, Senior Solutions Architect
http://www.lucidworks.com 




> On Oct 22, 2015, at 6:20 AM, Roxana Danger  
> wrote:
> 
> Hello,
> 
> I would like to create an updateRequestProcessorChain that should to be
> executed after a DB DIH. I am extending UpdateRequestProcessorFactory and
> the UpdateRequestProcessor classes. The method processAdd of my
> UpdateRequestProcessor should be able to update the documents with  the
> indexed terms associated to a field. Notice that these terms should have
> been extracted with an analyzer before my updateRequestProcessorChain
> processor begins to execute.
> 
> The problem I am getting is that at the point where processAdd is executed
> the field containing the terms has not been filled. To retrieve the terms I
> am using the SolrIndexSearcher provided during the request
> (req.getSearcher()). However, it seems that this searcher uses only the
> data physically stored and does not consider any of the imported data.
> 
> Any idea on how can I access to searcher with all indexed/cached data when
> the processAdd method is executed?
> 
> Thank you very much in advance.



Select sibling data via XPathEntityProcessor

2015-10-22 Thread Routley, Alan
Hi,

Given an xml structure:




Subject
032-001946363


Subject
037-001946370


Author
040-001959713


Author
040-001959829


Subject
032-001961797


Author
040-001961798




I’m trying to use the XPathEntityProcessor to put all the Subject Id’s into one 
multiValued field and the Author Id’s into another.

I was hoping I could use field’s with the following, but the XPath does not 
seem to be supported.

http://www.bl.uk/>
The British Library’s latest Annual Report and Accounts : 
www.bl.uk/aboutus/annrep/index.html
Help the British Library conserve the world's knowledge. Adopt a Book. 
www.bl.uk/adoptabook
The Library's St Pancras site is WiFi - enabled
*
The information contained in this e-mail is confidential and may be legally 
privileged. It is intended for the addressee(s) only. If you are not the 
intended recipient, please delete this e-mail and notify the 
postmas...@bl.uk : The contents of this e-mail must 
not be disclosed or copied without the sender's consent.
The statements and opinions expressed in this message are those of the author 
and do not necessarily reflect those of the British Library. The British 
Library does not take any responsibility for the views of the author.
*
Think before you print


Some SolR nodes are hanging, bringing down the entire cluster

2015-10-22 Thread Stephane Lagraulet
Hello all,

We experienced a two major problems in two days on one of our data centers.
Here is our setup: 15 nodes, 3 shards, one replica per node, around 50Gb of 
index per shard.
We are running Solr 4.10.4 on Ubuntu servers using jdk 1.8.0u51.
We have an ensemble of 5 zookeeper nodes to coordinate the cluster.

We usually have an update rate of around 500 up/s coming from solrj clients.

Suddenly for un unknown reason one of the shard leaders was not able to connect 
to any of its slaves and initiated a recovery on all its slaves.
At this point we were not able to perform any queries on the entire cluster.
On our 15 nodes some nodes were responding but most of the nodes were not 
answering at all (on all shards).
Their CPU was low so I used VisualVM to see what was going on.
It appeared that the hanged nodes were using around 600 threads, most of them 
being "httpShardExecutor" threads: around 100 running and a lot in park mode.
We restarted on of these nodes, and as soon as it started it created these 600 
threads.
We finally managed to get back our cluster by stopping all the incoming traffic 
and restarted the master node of the affected shard and everything was back in 
a few minutes.
I was wondering if we hit 
SOLR-7109 but I'm not sure 
about this.

Any help would be appreciated.
Stephan


Re: Select sibling data via XPathEntityProcessor

2015-10-22 Thread Alexandre Rafalovitch
I don't think DIH supports siblings. Have you thought of using XSLT
processor before sending XML to Solr. Or using it instead of DIH
during the update (not a well know part of Solr):
https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Index+Handlers#UploadingDatawithIndexHandlers-UsingXSLTtoTransformXMLIndexUpdates

With XSLT, you could just confirm your format directly into Solr XML
Update format and not bother with field mapping.

Regards,
   Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 22 October 2015 at 10:17, Routley, Alan  wrote:
> Hi,
>
> Given an xml structure:
>
> 
> 
> 
> Subject
> 032-001946363
> 
> 
> Subject
> 037-001946370
> 
> 
> Author
> 040-001959713
> 
> 
> Author
> 040-001959829
> 
> 
> Subject
> 032-001961797
> 
> 
> Author
> 040-001961798
> 
> 
> 
>
> I’m trying to use the XPathEntityProcessor to put all the Subject Id’s into 
> one multiValued field and the Author Id’s into another.
>
> I was hoping I could use field’s with the following, but the XPath does not 
> seem to be supported.
>
> http://www.bl.uk/>
> The British Library’s latest Annual Report and Accounts : 
> www.bl.uk/aboutus/annrep/index.html
> Help the British Library conserve the world's knowledge. Adopt a Book. 
> www.bl.uk/adoptabook
> The Library's St Pancras site is WiFi - enabled
> *
> The information contained in this e-mail is confidential and may be legally 
> privileged. It is intended for the addressee(s) only. If you are not the 
> intended recipient, please delete this e-mail and notify the 
> postmas...@bl.uk : The contents of this e-mail must 
> not be disclosed or copied without the sender's consent.
> The statements and opinions expressed in this message are those of the author 
> and do not necessarily reflect those of the British Library. The British 
> Library does not take any responsibility for the views of the author.
> *
> Think before you print


Re: `cat /dev/null > solr-8983-console.log` frees host's memory

2015-10-22 Thread Shawn Heisey
On 10/22/2015 12:24 AM, Shalin Shekhar Mangar wrote:
> Should we remove the console appender by default? This is very trappy I guess.

The only time we should need console logging is when Solr is run in the
foreground, and in that case, it should not be saved to a file, just
printed on the console.  The log4j console output contains the same
information as the rotated logfile, so even when Solr is run in the
background, I think it's completely unnecessary to save it.

We do need to save console output in the startup script when running in
the background, because startup problems and some kinds of developer
debug output are likely to be reported there.

The start script should be able to use a different log4j config file
that logs to the console when running in the foreground.  Looking at the
bash script, that will probably require a little bit of rework on the
script flow, but shouldn't be particularly difficult.  I have not looked
at the Windows script.

When I wrote my own init scripts a few years back, I was using log4j
before the official switch in 4.3.0, and I was only sending log4j output
to a file, not the console.

My init script sent stdout and stderr to different files -- "logs/out"
and "logs/err".  I like this arrangement, but I wonder if that's too
much complexity for the general case.

I opened SOLR-8186 for the enhancement idea.

Thanks,
Shawn



Re: Highlighting content field problem when using JiebaTokenizerFactory

2015-10-22 Thread Zheng Lin Edwin Yeo
Hi Scott,

Thank you for your response and suggestions.

With respond to your questions, here are the answers:

1. I take a look at Jieba. It uses a dictionary and it seems to do a good
job on CJK. I doubt this problem may be from those filters (note: I can
understand you may use CJKWidthFilter to convert Japanese but doesn't
understand why you use CJKBigramFilter and EdgeNGramFilter). Have you tried
commenting out those filters, say leave only Jieba and StopFilter, and see
if this problem disppears?
*A) Yes, I have tried commenting out the other filters and only left with
Jieba and StopFilter. The problem is still there.*

2.Does this problem occur only on Chinese search words? Does it happen on
English search words?
*A) Yes, the same problem occurs on English words. For example, when I
search for "word", it will highlight in this way:  word*

3.To use FastVectorHighlighter, you seem to have to enable 3 term*
parameters in field declaration? I see only one is enabled. Please refer to
the answer in this stackoverflow question:
http://stackoverflow.com/questions/25930180/solr-how-to-highlight-the-whole-search-phrase-only
*A) I have tried to enable all 3 terms in the FastVectorHighlighter too,
but the same problem persists as well.*


Regards,
Edwin


On 22 October 2015 at 16:25, Scott Chu  wrote:

> Hi solr-user,
>
> Can't judge the cause on fast glimpse of your definition but some
> suggestions I can give:
>
> 1. I take a look at Jieba. It uses a dictionary and it seems to do a good
> job on CJK. I doubt this problem may be from those filters (note: I can
> understand you may use CJKWidthFilter to convert Japanese but doesn't
> understand why you use CJKBigramFilter and EdgeNGramFilter). Have you tried
> commenting out those filters, say leave only Jieba and StopFilter, and see
> if this problem disppears?
>
> 2.Does this problem occur only on Chinese search words? Does it happen on
> English search words?
>
> 3.To use FastVectorHighlighter, you seem to have to enable 3 term*
> parameters in field declaration? I see only one is enabled. Please refer to
> the answer in this stackoverflow question:
> http://stackoverflow.com/questions/25930180/solr-how-to-highlight-the-whole-search-phrase-only
>
>
> Scott Chu,scott@udngroup.com
> 2015/10/22
>
> - Original Message -
> *From: *Zheng Lin Edwin Yeo 
> *To: *solr-user 
> *Date: *2015-10-20, 12:04:11
> *Subject: *Re: Highlighting content field problem when using
> JiebaTokenizerFactory
>
> Hi Scott,
>
> Here's my schema.xml for content and title, which uses text_chinese. The
> problem only occurs in content, and not in title.
>
>  omitNorms="true" termVectors="true"/>
> omitNorms="true" termVectors="true"/>
>
>
>positionIncrementGap="100">
>  
>   segMode="SEARCH"/>
> 
> 
>  words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
>  maxGramSize="15"/>
> 
>  
>  
>   segMode="SEARCH"/>
> 
> 
>  words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
> 
>   
>
>
>
> Here's my solrconfig.xml on the highlighting portion:
>
>   
>   
>explicit
>10
>json
>true
>   text
>   id, title, content_type, last_modified, url, score 
>
>   on
>id, title, content, author, tag
>   true
>true
>html
>   200
> true
> signature
> true
> 100
>   
>   
>
>  class="solr.highlight.BreakIteratorBoundaryScanner">
>  
> WORD
> en
> SG
>  
> 
>
>
> Meanwhile, I'll take a look at the articles too.
>
> Thank you.
>
> Regards,
> Edwin
>
>
> On 20 October 2015 at 11:32, Scott Chu  <+scott@udngroup.com>> wrote:
>
> > Hi Edwin,
> >
> > I didn't use Jieba on Chinese (I use only CJK, very foundamental, I
> > know) so I didn't experience this problem.
> >
> > I'd suggest you post your schema.xml so we can see how you define your
> > content field and the field type it uses?
> >
> > In the mean time, refer to these articles, maybe the answer or workaround
> > can be deducted from them.
> >
> > https://issues.apache.org/jira/browse/SOLR-3390
> >
> > http://qnalist.com/questions/661133/solr-is-highlighting-wrong-words
> >
> > http://qnalist.com/questions/667066/highlighting-marks-wrong-words
> >
> > Good luck!
> >
> >
> >
> >
> > Scott Chu,scott@udngroup.com <+scott@udngroup.com>
> > 2015/10/20
> >
> > - Original Message -
> > *From: *Zheng Lin Edwin Yeo  <+edwinye...@gmail.com>>
> > *To: *solr-user  <+solr-user@lucene.apache.org>>
> > *Date: *2015-10-13, 17:04:29
> > *Subject: *Highlighting content field problem when using
> > JiebaTokenizerFactory
> >
> > Hi,
> >
> > I'm trying to use the JiebaTokenizerFactory to index Chinese characters
> in
> >
> > Solr. It works fine with the segmentation when I'm using
> > the Analysis function on the Solr Admin UI.
> >
> > However, when I tried to do the 

Split shard onto new physical volumes

2015-10-22 Thread Nikolay Shuyskiy

Hello.

We have a Solr 5.3.0 installation with ~4 TB index size, and the volume  
containing it is almost full. I hoped to utilize SolrCloud power to split  
index into two shards or Solr nodes, thus spreading index across several  
physical devices. But as I look closer, it turns out that splitting shard  
will create two new shards *on the same node* (and on the same storage  
volume), so it's not possible for more-than-a-half-full volume.


I imagined that I could, say, add two new nodes to SolrCloud, and split  
shard so that two new shards ("halves" of the one being split) will be  
created on those new nodes.


Right now the only way to split shard in my situation I see is to create  
two directories (shard_1_0 and shard_1_1) and mount new volumes onto them  
*before* calling SPLITSHARD. Then I would be able to split shards, and  
after adding two new nodes, these new shards will be replicated, and I'll  
be able to clean up all the data on the first node.


Please advise me on this, I hope I've missed something that would ease  
that kind of scaling.


--
Yrs sincerely,
 Nikolay Shuyskiy


Re: getting cached terms inside UpdateRequestProcessor...

2015-10-22 Thread Roxana Danger
Hi Alex,

My idea behind this is avoid two calls: first, the importer and after the
updater. As there is an update processor chain than can be used after the
DIH, I thorough it was possible to get a real-time updater.

So, I am getting your advice and dividing the process in different steps. I
have the following configuration:


  
  
  
  



  
 retrieveDetails
  
   



  db-data-config.xml
  

  

So, after import (notice it does not contains the updtate.chain). I have
try to run the update with the following request:
http://localhost:8983/solr/reed_jobs/update/details?commit=true
but it returns immediately with status 0 but does not execute the update...
How should the update be called for reindex/update all the imported docs.
with my chain?


Best regards,
Roxana


On 22 October 2015 at 14:14, Alexandre Rafalovitch 
wrote:

> You are doing things out of order. It's DIH, URP, then indexer. Any
> attempt to subvert that order for the record being indexed will end in
> problems.
>
> Have you considered doing a dual path? Index, then update. Of course,
> your fields all need to be stored for that.
>
> Also, perhaps you need to rethink the problem on a higher level. If
> all you need to do is to extract tokenized content of a field during
> search, you can do that in several ways, such as faceting on that
> field, or - I believe - using terms end-point.
>
> Regards,
>   Alex.
> 
> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> http://www.solr-start.com/
>
>
> On 22 October 2015 at 06:20, Roxana Danger
>  wrote:
> > Hello,
> >
> > I would like to create an updateRequestProcessorChain that should to be
> > executed after a DB DIH. I am extending UpdateRequestProcessorFactory and
> > the UpdateRequestProcessor classes. The method processAdd of my
> > UpdateRequestProcessor should be able to update the documents with  the
> > indexed terms associated to a field. Notice that these terms should have
> > been extracted with an analyzer before my updateRequestProcessorChain
> > processor begins to execute.
> >
> > The problem I am getting is that at the point where processAdd is
> executed
> > the field containing the terms has not been filled. To retrieve the
> terms I
> > am using the SolrIndexSearcher provided during the request
> > (req.getSearcher()). However, it seems that this searcher uses only the
> > data physically stored and does not consider any of the imported data.
> >
> > Any idea on how can I access to searcher with all indexed/cached data
> when
> > the processAdd method is executed?
> >
> > Thank you very much in advance.
>



-- 
Roxana Danger | Data Scientist Dragon Court, 27-29 Macklin Street, London,
WC2B 5LX Tel: 020 7067 4568 [image: reed.co.uk]  The
UK's #1 job site.  [image: Follow us on Twitter]

 [image:
Like us on Facebook] 
 It's time to Love Mondays »



Re: Split shard onto new physical volumes

2015-10-22 Thread Shawn Heisey
On 10/22/2015 8:29 AM, Nikolay Shuyskiy wrote:
> I imagined that I could, say, add two new nodes to SolrCloud, and split
> shard so that two new shards ("halves" of the one being split) will be
> created on those new nodes.
> 
> Right now the only way to split shard in my situation I see is to create
> two directories (shard_1_0 and shard_1_1) and mount new volumes onto
> them *before* calling SPLITSHARD. Then I would be able to split shards,
> and after adding two new nodes, these new shards will be replicated, and
> I'll be able to clean up all the data on the first node.

The reason that they must be on the same node is because index splitting
is a *Lucene* operation, and Lucene has no knowledge of Solr nodes, only
the one index on the one machine.

Depending on the overall cloud distribution, one option *might* be to
add a replica of the shard you want to split to one or more new nodes
with plenty of disk space, and after it is replicated, delete it from
any nodes where the disk is nearly full.  Then do the split operation,
and once it's done, use ADDREPLICA/DELETEREPLICA to arrange everything
the way you want it.

Thanks,
Shawn



Re: Highlighting content field problem when using JiebaTokenizerFactory

2015-10-22 Thread Zheng Lin Edwin Yeo
Hi Scott,

Thank you for your respond.

1. You said the problem only happens on "contents" field, so maybe there're
something wrong with the contents of that field. Doe it contain any special
thing in them, e.g. HTML tags or symbols. I recall SOLR-42 mentions
something about HTML stripping will cause highlight problem. Maybe you can
try purify that fields to be closed to pure text and see if highlight comes
ok.
*A) I check that the SOLR-42 is mentioning about the
HTMLStripWhiteSpaceTokenizerFactory, which I'm not using. I believe that
tokenizer is already deprecated too. I've tried with all kinds of content
for rich-text documents, and all of them have the same problem.*

2. Maybe something imcompatible between JiebaTokenizer and Solr
highlighter. If you switch to other tokenizers, e.g. Standard, CJK,
SmartChinese (I don't use this since I am dealing with Traditional Chinese
but I see you are dealing with Simplified Chinese), or 3rd-party MMSeg and
see if the problem goes away. However when I'm googling similar problem, I
saw you asked same question on August at Huaban/Jieba-analysis and somebody
said he also uses JiebaTokenizer but he doesn't have your problem. So I see
this could be less suspect.
*A) I was thinking about the incompatible issue too, as I previously
thought that JiebaTokenizer is optimised for Solr 4.x, so it may have issue
in 5.x. But the person from Hunban/Jieba-analysis said that he doesn't have
this problem in Solr 5.1. I also face the same problem in Solr 5.1, and
although I'm using Solr 5.3.0 now, the same problem persist. *

I'm looking at the indexing process too, to see if there's any problem
there. But just can't figure out why it only happen to JiebaTokenizer, and
it only happen for content field.


Regards,
Edwin


On 23 October 2015 at 09:41, Scott Chu  wrote:

> Hi Edwin,
>
> Since you've tested all my suggestions and the problem is still there, I
> can't think of anything wrong with your configuration. Now I can only
> suspect two things:
>
> 1. You said the problem only happens on "contents" field, so maybe
> there're something wrong with the contents of that field. Doe it contain
> any special thing in them, e.g. HTML tags or symbols. I recall SOLR-42
> mentions something about HTML stripping will cause highlight problem. Maybe
> you can try purify that fields to be closed to pure text and see if
> highlight comes ok.
>
> 2. Maybe something imcompatible between JiebaTokenizer and Solr
> highlighter. If you switch to other tokenizers, e.g. Standard, CJK,
> SmartChinese (I don't use this since I am dealing with Traditional Chinese
> but I see you are dealing with Simplified Chinese), or 3rd-party MMSeg and
> see if the problem goes away. However when I'm googling similar problem, I
> saw you asked same question on August at Huaban/Jieba-analysis and somebody
> said he also uses JiebaTokenizer but he doesn't have your problem. So I see
> this could be less suspect.
>
> The theory of your problem could be something in indexing process causes
> wrong position info. for that field and when Solr do highlighting, it
> retrieves wrong position info. and mark wrong position of highlight target
> terms.
>
> Scott Chu,scott@udngroup.com
> 2015/10/23
>
> - Original Message -
> *From: *Zheng Lin Edwin Yeo 
> *To: *solr-user 
> *Date: *2015-10-22, 22:22:14
> *Subject: *Re: Highlighting content field problem when using
> JiebaTokenizerFactory
>
> Hi Scott,
>
> Thank you for your response and suggestions.
>
> With respond to your questions, here are the answers:
>
> 1. I take a look at Jieba. It uses a dictionary and it seems to do a good
> job on CJK. I doubt this problem may be from those filters (note: I can
> understand you may use CJKWidthFilter to convert Japanese but doesn't
> understand why you use CJKBigramFilter and EdgeNGramFilter). Have you tried
> commenting out those filters, say leave only Jieba and StopFilter, and see
>
> if this problem disppears?
> *A) Yes, I have tried commenting out the other filters and only left with
> Jieba and StopFilter. The problem is still there.*
>
> 2.Does this problem occur only on Chinese search words? Does it happen on
> English search words?
> *A) Yes, the same problem occurs on English words. For example, when I
> search for "word", it will highlight in this way:  word*
>
> 3.To use FastVectorHighlighter, you seem to have to enable 3 term*
> parameters in field declaration? I see only one is enabled. Please refer to
> the answer in this stackoverflow question:
>
> http://stackoverflow.com/questions/25930180/solr-how-to-highlight-the-whole-search-phrase-only
> *A) I have tried to enable all 3 terms in the FastVectorHighlighter too,
> but the same problem persists as well.*
>
>
> Regards,
> Edwin
>
>
> On 22 October 2015 at 16:25, Scott Chu  <+scott@udngroup.com>> wrote:
>
> > Hi solr-user,
> >
> > Can't judge the cause on 

EdgeNGramFilterFactory for Chinese characters

2015-10-22 Thread Zheng Lin Edwin Yeo
Hi,

Would like to check, is it good to use EdgeNGramFilterFactory for indexes
that contains Chinese characters?
Will it affect the accuracy of the search for Chinese words?

I have rich-text documents that are in both English and Chinese, and
currently I have EdgeNGramFilterFactory enabled during indexing, as I need
it for partial matching for English words. But this means it will also
break up each of the Chinese characters into different tokens.

I'm using the HMMChineseTokenizerFactory for my tokenizer.

Thank you.

Regards,
Edwin


Re: Solr fails to start with log file not found error

2015-10-22 Thread Shawn Heisey
On 10/22/2015 10:49 PM, awhosit wrote:
Not working one is solr 5.2.1/SLES 12.
> But I have working one with solr 5.2.1/SLES 11 and solr 5.2.1/Ubuntu 14.
> 
> From the log left in sol-8983-console.log is as follow.
> I'm using OpenJDK 1.7 as follow.
> 
> java version "1.7.0_85"
> OpenJDK Runtime Environment (IcedTea 2.6.1) (suse-18.2-x86_64)
> OpenJDK 64-Bit Server VM (build 24.85-b03, mixed mode)
> 
> 
> Exception in thread "main" java.lang.UnsupportedClassVersionError:
> org/eclipse/jetty/start/Main : Unsupported major.minor version 51.0

This error message means you're trying to start Solr with Java 6, not
Java 7.  You might have OpenJDK 7 on your computer, but you also have an
older version, and the older version is the one that is being used.

I understand that suse uses RPM, so run this command to see if Java 6 is
installed with a package:

rpm -qa | egrep "(java|jdk)"

If it is, then you should be able to uninstall it and let version 7 have
center stage.

On a CentOS 6 system, I get this output by running that command.  I've
got Java 8 installed on this system:

[root@mcp ~]# rpm -qa | egrep "(java|jdk)"
java-1.8.0-openjdk-1.8.0.51-3.b16.el6_7.x86_64
tzdata-java-2015f-1.el6.noarch
java_cup-0.10k-5.el6.x86_64
java-1.5.0-gcj-devel-1.5.0.0-29.1.el6.x86_64
java-1.5.0-gcj-1.5.0.0-29.1.el6.x86_64
java-1.8.0-openjdk-devel-1.8.0.51-3.b16.el6_7.x86_64
gcc-java-4.4.7-16.el6.x86_64
java-1.8.0-openjdk-headless-1.8.0.51-3.b16.el6_7.x86_64
[root@mcp ~]# java -version
openjdk version "1.8.0_51"
OpenJDK Runtime Environment (build 1.8.0_51-b16)
OpenJDK 64-Bit Server VM (build 25.51-b03, mixed mode)

Thanks,
Shawn



Get this committed

2015-10-22 Thread William Bell
I can confirm this is working in PROD at 100M hits a day.

Can we commit it please? Begging here.

https://issues.apache.org/jira/browse/SOLR-7993

-- 
Bill Bell
billnb...@gmail.com
cell 720-256-8076


Re: Solr fails to start with log file not found error

2015-10-22 Thread awhosit
Hi, I'm newbie on solr, but have same issue.

More precisely, only one machine can't start solr with the message, "cannot
open {solr.log} file for reading: No such file or directory." Obviously
there is no file and even I created empty one, it doesn't help.

I've tried also - moving around the directory here and there, changing
owner/group like root/myself/solr:solr, use install_solr_service.sh or just
unzip, etc, but this machine doesn't allow to start.

Not working one is solr 5.2.1/SLES 12. 
But I have working one with solr 5.2.1/SLES 11 and solr 5.2.1/Ubuntu 14.

>From the log left in sol-8983-console.log is as follow.
I'm using OpenJDK 1.7 as follow.

java version "1.7.0_85"
OpenJDK Runtime Environment (IcedTea 2.6.1) (suse-18.2-x86_64)
OpenJDK 64-Bit Server VM (build 24.85-b03, mixed mode)


Exception in thread "main" java.lang.UnsupportedClassVersionError:
org/eclipse/jetty/start/Main : Unsupported major.minor version 51.0
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClassCond(Unknown Source)
at java.lang.ClassLoader.defineClass(Unknown Source)
at java.security.SecureClassLoader.defineClass(Unknown Source)
at java.net.URLClassLoader.defineClass(Unknown Source)
at java.net.URLClassLoader.access$000(Unknown Source)
at java.net.URLClassLoader$1.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
Could not find the main class: org.eclipse.jetty.start.Main. Program will
exit.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-fails-to-start-with-log-file-not-found-error-tp4179130p4236026.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: getting cached terms inside UpdateRequestProcessor...

2015-10-22 Thread Roxana Danger
Dear Mikhail,
Thank you very much for your advice. I have tried, but the realTimeSearcher
didn't help...
This may looks very silly but: can a commit be called with
RunUpdateProcessorFactory? Can I use it twice in a
updateRequestProcessorChain?
Thank you very much again,
Roana



On 22 October 2015 at 13:08, Mikhail Khludnev 
wrote:

> Hello Roxana,
>
> I feel it's almost impossible. I can only suggest to commit to make new
> terms visible.
> There is SolrCore.getRealtimeSearcher() but I never understand what it
> does.
>
> On Thu, Oct 22, 2015 at 1:20 PM, Roxana Danger <
> roxana.dan...@reedonline.co.uk> wrote:
>
> > Hello,
> >
> > I would like to create an updateRequestProcessorChain that should to be
> > executed after a DB DIH. I am extending UpdateRequestProcessorFactory and
> > the UpdateRequestProcessor classes. The method processAdd of my
> > UpdateRequestProcessor should be able to update the documents with  the
> > indexed terms associated to a field. Notice that these terms should have
> > been extracted with an analyzer before my updateRequestProcessorChain
> > processor begins to execute.
> >
> > The problem I am getting is that at the point where processAdd is
> executed
> > the field containing the terms has not been filled. To retrieve the
> terms I
> > am using the SolrIndexSearcher provided during the request
> > (req.getSearcher()). However, it seems that this searcher uses only the
> > data physically stored and does not consider any of the imported data.
> >
> > Any idea on how can I access to searcher with all indexed/cached data
> when
> > the processAdd method is executed?
> >
> > Thank you very much in advance.
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> 
> 
>



-- 
Roxana Danger | Data Scientist Dragon Court, 27-29 Macklin Street, London,
WC2B 5LX Tel: 020 7067 4568 [image: reed.co.uk]  The
UK's #1 job site.  [image: Follow us on Twitter]

 [image:
Like us on Facebook] 
 It's time to Love Mondays »



Re: getting cached terms inside UpdateRequestProcessor...

2015-10-22 Thread Alexandre Rafalovitch
You are doing things out of order. It's DIH, URP, then indexer. Any
attempt to subvert that order for the record being indexed will end in
problems.

Have you considered doing a dual path? Index, then update. Of course,
your fields all need to be stored for that.

Also, perhaps you need to rethink the problem on a higher level. If
all you need to do is to extract tokenized content of a field during
search, you can do that in several ways, such as faceting on that
field, or - I believe - using terms end-point.

Regards,
  Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 22 October 2015 at 06:20, Roxana Danger
 wrote:
> Hello,
>
> I would like to create an updateRequestProcessorChain that should to be
> executed after a DB DIH. I am extending UpdateRequestProcessorFactory and
> the UpdateRequestProcessor classes. The method processAdd of my
> UpdateRequestProcessor should be able to update the documents with  the
> indexed terms associated to a field. Notice that these terms should have
> been extracted with an analyzer before my updateRequestProcessorChain
> processor begins to execute.
>
> The problem I am getting is that at the point where processAdd is executed
> the field containing the terms has not been filled. To retrieve the terms I
> am using the SolrIndexSearcher provided during the request
> (req.getSearcher()). However, it seems that this searcher uses only the
> data physically stored and does not consider any of the imported data.
>
> Any idea on how can I access to searcher with all indexed/cached data when
> the processAdd method is executed?
>
> Thank you very much in advance.


Re: Index Multiple entity in one collection core

2015-10-22 Thread Alexandre Rafalovitch
When you run a full-import, Solr will try to delete old documents
before importing the new ones. If there is several top-level entities,
they step on each other foot.

Use preImportDeleteQuery to avoid that (as per
https://cwiki.apache.org/confluence/display/solr/Uploading+Structured+Data+Store+Data+with+the+Data+Import+Handler
).

You can test that by running an indexer on just table1 entity and
seeing if things get indexed.

Regards,
   Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 22 October 2015 at 00:25, anurupborah2001  wrote:
> HI,
> I am having difficulty in indexing multiple entity in one
> collection..When i try to index only the entity defined at last gets
> index..Please help to assist as I am getting hard time to solve it.
> The below are the config :
> --
> data-config.xml
> --
> 
> 
>  type="SimplePropertiesWriter" filename="demo.properties" />
>
>  zeroDateTimeBehavior="convertToNull"
>   name="ds-1"
>   driver="com.mysql.jdbc.Driver"
>   url="jdbc:mysql://127.0.0.1/demo"
>   batchSize="-1"
>   user="root"
>   autoCommit="true"
>   password="password" />
> 
>  pk="table1_id"
> dataSource="ds-1"
>
> transformer="HTMLStripTransformer,RegexTransformer,TemplateTransformer,DateFormatTransformer,script:GenerateId,LogTransformer"
> logTemplate="The demo is ${table1.table1_id}" 
> logLevel="info"
> query="select
> table1_id,table1_desc,table1_flag,DATE_FORMAT(table1_date_updated,'%Y-%m-%dT%TZ')
> from table1 Where table1_flag=1 AND '${dih.request.clean}' != 'false' OR
> table1_date_updated  '${dih.table1.last_index_time}'"
> >
> 
> 
>  stripHTML="true"/>
>  name="solr_table1_date_updated_dt"
> dateTimeFormat="-MM-dd'T'HH:mm:ss" locale="en" />
> 
>  pk="table2_id"
> dataSource="ds-1"
>
> transformer="HTMLStripTransformer,RegexTransformer,TemplateTransformer,DateFormatTransformer,script:GenerateId,LogTransformer"
> logTemplate="The table2 is ${table2.table2_id}" 
> logLevel="info"
> query="select
> table2_id,table2_name,table2_flag,DATE_FORMAT(table2_date_updated,'%Y-%m-%dT%TZ')
> from table2 Where table2_flag=1 AND '${dih.request.clean}' != 'false' OR
> table2_date_updated  '${dih.table2.last_index_time}'"
> >
> 
> 
> 
>  pk="table3_id,table3_frid"
>
> transformer="HTMLStripTransformer,RegexTransformer,DateFormatTransformer,script:GenerateId,LogTransformer"
> logTemplate="The table3 is 
> ${table3.table3_id}" logLevel="info"
> query="select
> table3_id,table3_frid,table3_name,table3_desc,table3_subdesc,table3_keyword,table3_flag,DATE_FORMAT(table3_date_updated,'%Y-%m-%dT%TZ')
> from table3 Where  table3_frid=$table1.table1_id} AND table3_flag=1"
> >
>  name="solr_table3_name"/>
>  name="solr_table3_desc" stripHTML="true"/>
>  name="solr_table3_subdesc"
> stripHTML="true"/>
>  name="solr_table3_keyword"/>
>  name="solr_table3_date_updated_dt"
> dateTimeFormat="-MM-dd'T'HH:mm:ss" locale="en"/>
> 
> 
>
>
> 
> 
>
>
> schema.xml
> --
>   />
> 
>
> singlekey
>
>
>
>  multiValued="false"  />
>  stored="true"  multiValued="false"  />
>
>
> 
>  multiValued="true" />
>  stored="true" multiValued="true" />
>  multiValued="true" />
>  multiValued="true" />
>  multiValued="false" stored="true" />
>
>
>
> Please kindly help me in this...I am not able to index the table 1, instead
> the table 2 and table 3 (which are 1 to many relationship tables) are
> getting indexed but table1 not getting indexed..
>
> Thanks for help in advance
> Regards
> Anurup
>
>
>
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Index-Multiple-entity-in-one-collection-core-tp4235810.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: getting cached terms inside UpdateRequestProcessor...

2015-10-22 Thread Roxana Danger
Hi Erik,

Thanks for the links, but the analyzers are called correctly. The problem
is that I need to get access to the whole set of terms through a searcher,
but the request searcher cannot retrieve any terms because the commit
method has not been called already.

My idea behind this is avoid two calls: first, the importer and after the
updater. As there is an update processor chain than can be used after the
DIH, I thorough it was possible to get a real-time updater. My specific
problem is: given a set of texts, I need to make a solr index and add to
this index a graph containing certain dependences between the indexed
terms. So, my idea was to use an import (associating the appropriate
analyzer to the textual field) and use the updateProcessorchain that:
first, construct the graph, and then, add a field to link a document with a
graph node.

However, this does not seem to be a good approach (see Alexander reply),
and I am trying to call sequentially the importer and updater. Any other
proposal for avoid the double call are welcome!

I am also having trouble on calling the updateProcessor to make all changes
on the imported documents.

Thank you very much,
Roxana




On 22 October 2015 at 15:10, Erik Hatcher  wrote:

> Roxana -
>
> What is the purpose of doing this?  (that’ll help guide the best approach)
>
> It can be quite handy to get the terms from analysis into a field as
> stored values and to separate terms into separate fields and such.  Here’s
> a presentation where I detailed an update script trick that accomplishes
> this:
> http://www.slideshare.net/erikhatcher/solr-indexing-and-analysis-tricks <
> http://www.slideshare.net/erikhatcher/solr-indexing-and-analysis-tricks>
>
> Within Solr, the example/files area has this very trick implemented to
> pull our URLs and e-mail addresses from full text into separate specific
> fields.  See
> http://svn.apache.org/repos/asf/lucene/dev/branches/branch_5x/solr/example/files/conf/update-script.js
> <
> http://svn.apache.org/repos/asf/lucene/dev/branches/branch_5x/solr/example/files/conf/update-script.js>
> (“var analyzer = “… and below)
>
> Does that trick accomplish what you need?   If not, please detail what
> you’re after and we’ll try to help.
>
> —
> Erik Hatcher, Senior Solutions Architect
> http://www.lucidworks.com 
>
>
>
>
> > On Oct 22, 2015, at 6:20 AM, Roxana Danger <
> roxana.dan...@reedonline.co.uk> wrote:
> >
> > Hello,
> >
> > I would like to create an updateRequestProcessorChain that should to be
> > executed after a DB DIH. I am extending UpdateRequestProcessorFactory and
> > the UpdateRequestProcessor classes. The method processAdd of my
> > UpdateRequestProcessor should be able to update the documents with  the
> > indexed terms associated to a field. Notice that these terms should have
> > been extracted with an analyzer before my updateRequestProcessorChain
> > processor begins to execute.
> >
> > The problem I am getting is that at the point where processAdd is
> executed
> > the field containing the terms has not been filled. To retrieve the
> terms I
> > am using the SolrIndexSearcher provided during the request
> > (req.getSearcher()). However, it seems that this searcher uses only the
> > data physically stored and does not consider any of the imported data.
> >
> > Any idea on how can I access to searcher with all indexed/cached data
> when
> > the processAdd method is executed?
> >
> > Thank you very much in advance.
>
>


-- 
Roxana Danger | Data Scientist Dragon Court, 27-29 Macklin Street, London,
WC2B 5LX Tel: 020 7067 4568 [image: reed.co.uk]  The
UK's #1 job site.  [image: Follow us on Twitter]

 [image:
Like us on Facebook] 
 It's time to Love Mondays »



Re: Zookeeper Quorum leader election

2015-10-22 Thread Erick Erickson
Thanks for adding that to our collective knowledge store!

On Thu, Oct 22, 2015 at 2:44 AM, Arcadius Ahouansou
 wrote:
> The leader election issue we were having was solved by passing
>
> -Djava.net.preferIPv4Stack=true
>
> to zookeeper startup script
>
> It seems our Linux servers have IPv6 enabled but we have no IPv6 network.
>
> Hope this helps others.
>
> Arcadius.
>
>
> On 4 September 2015 at 04:57, Arcadius Ahouansou 
> wrote:
>
>>
>> We have a quorum of 3 ZK nodes zk1, zk2 and zk3.
>> All nodes are identicals.
>>
>> After multiple restart of the ZK nodes, always keeping the majority of 2,
>> we have noticed that the node zk1 has never become the leader.
>> Only zk2 and zk3 become leader.
>>
>> 1) Is there any know reason or possible misconfiguration that may cause
>> zk1 to never become a leader? (note that we do not have any zk group
>> set-up)
>>
>> 2) Is there any command or way to force zk1 to become leader of the quorum?
>>
>> Note that all connected SolrJ clients have zkhost set to
>> zk1:port,zk2:port,zk3:port
>>
>> Thanks.
>>
>>
>> --
>> Arcadius Ahouansou
>> Menelic Ltd | Information is Power
>> M: 07908761999
>> W: www.menelic.com
>> ---
>>
>
>
>
> --
> Arcadius Ahouansou
> Menelic Ltd | Information is Power
> M: 07908761999
> W: www.menelic.com
> ---


Re: How to get the join data by multiple cores?

2015-10-22 Thread Erick Erickson
You will NOT get the stored fields from the child record
with the join operation, it's called "pseudo join" for a
good reason.

It's usually a mistake to try to force Solr to performa just
like a database. I would seriously consider flattening
(denormalizing) the data if at all possible.

Best,
Erick

On Wed, Oct 21, 2015 at 10:36 PM, cai xingliang  wrote:
> {!join fromIndex=parent from=id to=parent_id}tag:hoge
>
> That should work.
> On Oct 22, 2015 12:35 PM, "Shuhei Suzuki"  wrote:
>
>> hello,
>> What can I do to throw a query such as the following in Solr?
>>
>>  SELECT
>>   child. *, parent. *
>>  FROM child
>>  JOIN parent
>>  WHERE child.parent_id = parent.id AND parent.tag = 'hoge'`
>>
>> child and parent is not that parent is more than in a many-to-one
>> relationship.
>> I try this but can not.
>>
>>  /select/?q={!join from=parent_id to=id fromIndex=parent}id:1+tag:hoge
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/How-to-get-the-join-data-by-multiple-cores-tp4235799.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>


Re: getting cached terms inside UpdateRequestProcessor...

2015-10-22 Thread Jack Krupansky
It is still not clear what problem you are really trying to solve. This is
what we call an XY problem - you are focusing on your intended solution but
not describing the original, underlying problem, the application itself.
IOW, there may be a much more appropriate solution for us to suggest if
only we know what your problem really was.

-- Jack Krupansky

On Thu, Oct 22, 2015 at 11:18 AM, Roxana Danger <
roxana.dan...@reedonline.co.uk> wrote:

> Hi Erik,
>
> Thanks for the links, but the analyzers are called correctly. The problem
> is that I need to get access to the whole set of terms through a searcher,
> but the request searcher cannot retrieve any terms because the commit
> method has not been called already.
>
> My idea behind this is avoid two calls: first, the importer and after the
> updater. As there is an update processor chain than can be used after the
> DIH, I thorough it was possible to get a real-time updater. My specific
> problem is: given a set of texts, I need to make a solr index and add to
> this index a graph containing certain dependences between the indexed
> terms. So, my idea was to use an import (associating the appropriate
> analyzer to the textual field) and use the updateProcessorchain that:
> first, construct the graph, and then, add a field to link a document with a
> graph node.
>
> However, this does not seem to be a good approach (see Alexander reply),
> and I am trying to call sequentially the importer and updater. Any other
> proposal for avoid the double call are welcome!
>
> I am also having trouble on calling the updateProcessor to make all changes
> on the imported documents.
>
> Thank you very much,
> Roxana
>
>
>
>
> On 22 October 2015 at 15:10, Erik Hatcher  wrote:
>
> > Roxana -
> >
> > What is the purpose of doing this?  (that’ll help guide the best
> approach)
> >
> > It can be quite handy to get the terms from analysis into a field as
> > stored values and to separate terms into separate fields and such.
> Here’s
> > a presentation where I detailed an update script trick that accomplishes
> > this:
> > http://www.slideshare.net/erikhatcher/solr-indexing-and-analysis-tricks
> <
> > http://www.slideshare.net/erikhatcher/solr-indexing-and-analysis-tricks>
> >
> > Within Solr, the example/files area has this very trick implemented to
> > pull our URLs and e-mail addresses from full text into separate specific
> > fields.  See
> >
> http://svn.apache.org/repos/asf/lucene/dev/branches/branch_5x/solr/example/files/conf/update-script.js
> > <
> >
> http://svn.apache.org/repos/asf/lucene/dev/branches/branch_5x/solr/example/files/conf/update-script.js
> >
> > (“var analyzer = “… and below)
> >
> > Does that trick accomplish what you need?   If not, please detail what
> > you’re after and we’ll try to help.
> >
> > —
> > Erik Hatcher, Senior Solutions Architect
> > http://www.lucidworks.com 
> >
> >
> >
> >
> > > On Oct 22, 2015, at 6:20 AM, Roxana Danger <
> > roxana.dan...@reedonline.co.uk> wrote:
> > >
> > > Hello,
> > >
> > > I would like to create an updateRequestProcessorChain that should to be
> > > executed after a DB DIH. I am extending UpdateRequestProcessorFactory
> and
> > > the UpdateRequestProcessor classes. The method processAdd of my
> > > UpdateRequestProcessor should be able to update the documents with  the
> > > indexed terms associated to a field. Notice that these terms should
> have
> > > been extracted with an analyzer before my updateRequestProcessorChain
> > > processor begins to execute.
> > >
> > > The problem I am getting is that at the point where processAdd is
> > executed
> > > the field containing the terms has not been filled. To retrieve the
> > terms I
> > > am using the SolrIndexSearcher provided during the request
> > > (req.getSearcher()). However, it seems that this searcher uses only the
> > > data physically stored and does not consider any of the imported data.
> > >
> > > Any idea on how can I access to searcher with all indexed/cached data
> > when
> > > the processAdd method is executed?
> > >
> > > Thank you very much in advance.
> >
> >
>
>
> --
> Roxana Danger | Data Scientist Dragon Court, 27-29 Macklin Street, London,
> WC2B 5LX Tel: 020 7067 4568 [image: reed.co.uk] 
> The
> UK's #1 job site.  [image: Follow us on Twitter]
> 
>  [image:
> Like us on Facebook] 
>  It's time to Love Mondays »
> 
>


OOM on solr cloud 5.2.1, does not trigger oom_solr.sh

2015-10-22 Thread Raja Pothuganti
Hi,

Some times I see OOM happening on replicas,but does not trigger script
oom_solr.sh which was passed in as
-XX:OnOutOfMemoryError=/actualLocation/solr/bin/oom_solr.sh 8091.

These OOM happened while DIH importing data from database. Is this known
issue? is there any quick fix?

Here are stack traces when OOM happened


1)
org.apache.solr.common.SolrException; null:java.lang.RuntimeException:
java.lang.OutOfMemoryError: Java heap space
at 
org.apache.solr.servlet.HttpSolrCall.sendError(HttpSolrCall.java:593)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:465)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java
:227)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java
:196)
at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandle
r.java:1652)
at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:14
3)
at 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.jav
a:223)
at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.jav
a:1127)
at 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java
:185)
at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java
:1061)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:14
1)
at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHan
dlerCollection.java:215)
at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection
.java:110)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:
97)
at org.eclipse.jetty.server.Server.handle(Server.java:497)
at 
org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310)
at 
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
at 
org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java
:635)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:
555)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.OutOfMemoryError: Java heap space



2)
org.apache.solr.common.SolrException;
org.apache.solr.common.SolrException: Exception writing document id
R277453962 to the index; possible analysis error.
at 
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.jav
a:167)
at 
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdatePro
cessorFactory.java:69)
at 
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRe
questProcessor.java:51)
at 
org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(Dist
ributedUpdateProcessor.java:955)
at 
org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(Dist
ributedUpdateProcessor.java:1110)
at 
org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(Dist
ributedUpdateProcessor.java:706)
at 
org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdatePro
cessorFactory.java:104)
at 
org.apache.solr.handler.loader.JavabinLoader$1.update(JavabinLoader.java:10
1)
at 
org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readOuterM
ostDocIterator(JavaBinUpdateRequestCodec.java:179)
at 
org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readIterat
or(JavaBinUpdateRequestCodec.java:135)
at 
org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:241)
at 
org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readNamedL
ist(JavaBinUpdateRequestCodec.java:121)
at 
org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:206)
at 
org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:126)
at 
org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal(Ja
vaBinUpdateRequestCodec.java:186)
at 
org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDocs(JavabinLoader
.java:111)
at 
org.apache.solr.handler.loader.JavabinLoader.load(JavabinLoader.java:58)
at 
org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.ja
va:98)
at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentS
treamHandlerBase.java:74)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase
.java:143)
at 

Re: getting cached terms inside UpdateRequestProcessor...

2015-10-22 Thread Alexandre Rafalovitch
Well, I guess I imagined three steps:
1) Run DIH
2) Get the tokenized representation for each document using facets or
other approaches
3) Submit document partial-update request with additional custom
processing through URP

Your example seems to be skipping step 2, so the URP chain does not
know which documents to actually work on and is basically an empty
call.

Again, I suspect knowing the business objectives may bring other
solutions to the front.

Regards,
   Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 22 October 2015 at 10:49, Roxana Danger
 wrote:
> Hi Alex,
>
> My idea behind this is avoid two calls: first, the importer and after the
> updater. As there is an update processor chain than can be used after the
> DIH, I thorough it was possible to get a real-time updater.
>
> So, I am getting your advice and dividing the process in different steps. I
> have the following configuration:
>
> 
>   
>   
>   
>   
> 
>
> 
>   
>  retrieveDetails
>   
>
>
>  class="org.apache.solr.handler.dataimport.DataImportHandler">
> 
>   db-data-config.xml
>   
> 
>   
>
> So, after import (notice it does not contains the updtate.chain). I have
> try to run the update with the following request:
> http://localhost:8983/solr/reed_jobs/update/details?commit=true
> but it returns immediately with status 0 but does not execute the update...
> How should the update be called for reindex/update all the imported docs.
> with my chain?
>
>
> Best regards,
> Roxana
>
>
> On 22 October 2015 at 14:14, Alexandre Rafalovitch 
> wrote:
>
>> You are doing things out of order. It's DIH, URP, then indexer. Any
>> attempt to subvert that order for the record being indexed will end in
>> problems.
>>
>> Have you considered doing a dual path? Index, then update. Of course,
>> your fields all need to be stored for that.
>>
>> Also, perhaps you need to rethink the problem on a higher level. If
>> all you need to do is to extract tokenized content of a field during
>> search, you can do that in several ways, such as faceting on that
>> field, or - I believe - using terms end-point.
>>
>> Regards,
>>   Alex.
>> 
>> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
>> http://www.solr-start.com/
>>
>>
>> On 22 October 2015 at 06:20, Roxana Danger
>>  wrote:
>> > Hello,
>> >
>> > I would like to create an updateRequestProcessorChain that should to be
>> > executed after a DB DIH. I am extending UpdateRequestProcessorFactory and
>> > the UpdateRequestProcessor classes. The method processAdd of my
>> > UpdateRequestProcessor should be able to update the documents with  the
>> > indexed terms associated to a field. Notice that these terms should have
>> > been extracted with an analyzer before my updateRequestProcessorChain
>> > processor begins to execute.
>> >
>> > The problem I am getting is that at the point where processAdd is
>> executed
>> > the field containing the terms has not been filled. To retrieve the
>> terms I
>> > am using the SolrIndexSearcher provided during the request
>> > (req.getSearcher()). However, it seems that this searcher uses only the
>> > data physically stored and does not consider any of the imported data.
>> >
>> > Any idea on how can I access to searcher with all indexed/cached data
>> when
>> > the processAdd method is executed?
>> >
>> > Thank you very much in advance.
>>
>
>
>
> --
> Roxana Danger | Data Scientist Dragon Court, 27-29 Macklin Street, London,
> WC2B 5LX Tel: 020 7067 4568 [image: reed.co.uk]  The
> UK's #1 job site.  [image: Follow us on Twitter]
> 
>  [image:
> Like us on Facebook] 
>  It's time to Love Mondays »
> 


Highlighting queries in parentheses

2015-10-22 Thread Michał Słomkowski

Hello,

recently I've deployed Solr 5.2.1 and I've observed the following issue:

My documents have two fields: id and text. Solr is configured to use 
FastVectorHighlighter (I've tried StandardHighlighter too, no 
difference). I've created the schema.xml, solrconfig.xml hasn't been 
changed in any way.


I have a following highlighting query: text:((foo AND bar) OR eggs). 
Let's say the documents contains only bar and eggs. Currently both of 
them are highlighted. However the desired behaviour is to highlight eggs 
only since (foo AND bar) is not true.


The query I send has following parameters:

'fl': 'id',
'hl': 'true',
'hl.requireFieldMatch': 'true',
'hl.fragListBuilder': 'single',
'hl.fragsize': '0',
'hl.fl': 'text',
'hl.mergeContiguous': 'true',
'hl.useFastVectorHighlighter': 'true',
'hl.q': 'text:((foo AND bar) OR eggs)'

I'd like to know what should I do to make it work as expected.





Re: getting cached terms inside UpdateRequestProcessor...

2015-10-22 Thread Erik Hatcher
Setting “update.chain” in the DataImportHandler handler defined in 
solrconfig.xml should allow you to specify which update chain is used.  Can you 
confirm that works, Shawn?

This is from DataImportHandler.java:

  UpdateRequestProcessorChain processorChain =
req.getCore().getUpdateProcessorChain(params); 
  UpdateRequestProcessor processor = processorChain.createProcessor(req, rsp);
  SolrResourceLoader loader = req.getCore().getResourceLoader();
  DIHWriter sw = getSolrWriter(processor, loader, requestParams, req);






> On Oct 22, 2015, at 12:19 PM, Shawn Heisey  wrote:
> 
> On 10/22/2015 10:09 AM, Roxana Danger wrote:
>> The DIH is executed correctly and the tokenized representation is obtained
>> correctly, but the URP chain is not executed with the call:
>> http://localhost:8983/solr/reed_jobs/update/details?commit=true
>> Isn't it the correct URL? is there any parameter missing?
> 
> The only way I've found to get an update chain to be used by DIH is to
> make it default.  From what I've been able to determine, and I have not
> verified this, DIH does *not* use the update request handler(s) ... it
> updates the index in a more direct manner.
> 
> If there is a way to use a custom chain (not the default) with DIH, I'd
> love to know about it.
> 
> Thanks,
> Shawn
> 



Re: getting cached terms inside UpdateRequestProcessor...

2015-10-22 Thread Shawn Heisey
On 10/22/2015 10:32 AM, Erik Hatcher wrote:
> Setting “update.chain” in the DataImportHandler handler defined in 
> solrconfig.xml should allow you to specify which update chain is used.  Can 
> you confirm that works, Shawn?

I tried this a couple of years ago without luck.  Does it work now?

http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201308.mbox/%3c6c93c1a4-63ac-4cad-9f5b-c74f497c6...@gmail.com%3E

In the first email of the thread, I indicated I had tried 4.4 and
4.5-SNAPSHOT.

Thanks,
Shawn



Solrcloud (4.10) reports the end of soft commit before all shard replicas finished committing

2015-10-22 Thread vsolakhian
We have a strange behavior of our Sorlcloud related code after upgrading from
from Solr 4.4 to Solr 4.10 (as part of upgrading from Cloudera CDH 4.6 to
Cloudera CDH 5.4.5).

We have a Solrcloud collection with three replicas of one shard.

Our code does batch indexing, then submits a soft commit request,using
SolrJ's

org.apache.solr.client.solrj.impl.CloudSolrServer.commit(waitFlush,
waitSearcher, softCommit)

method with:

waitFlush = true; waitSearcher = true; softCommit = true

After this synchronous commit invocation returns, we submit a query for the
newly indexed data.

Starting with Solr 4.10 we noticed that the query returns zero results
sometimes (about 70% of  tests).

In one of these cases logs (modified to make shorter) from our application
and from Solr servers show:

*1. Our application:*

  - 13:33:49:602 INFO Sending index commit request to CloudCollection
  - 13:35:49:752 INFO Finished index commit in 120110 millisecs
  - 13:35:50:349 INFO Sent query

*2. Solr server 1 (replica1):*

  - 13:33:49,612 INFO org.apache.solr.update.UpdateHandler: start commit ...
  - 13:34:23,486 INFO org.apache.solr.core.SolrCore: ===SolrEventListener
started warmup: event = newSearcher
  - 13:35:35,701 INFO org.apache.solr.core.SolrCore: ===SolrEventListener
done with newSearcher: totalTime = 72215  totalWarmupTime = 63349
  - 13:35:35,703 INFO org.apache.solr.core.SolrCore:
[CloudCollection_shard1_replica1] Registered new searcher
Searcher@4daaebbf[CloudCollection_shard1_replica1

*3. Solr server 2 (replica2):*

  - 13:33:49,604 INFO org.apache.solr.update.UpdateHandler: start commit ...
  - 13:35:49,163 INFO org.apache.solr.core.SolrCore: ===SolrEventListener
started warmup: event = newSearcher
  - 13:37:15,627 INFO org.apache.solr.core.SolrCore: ===SolrEventListener
done with newSearcher: totalTime = 86463  totalWarmupTime = 76713
  - 13:37:15,632 INFO org.apache.solr.core.SolrCore:
[CloudCollection_shard1_replica2] Registered new searcher
Searcher@57ddbc57[CloudCollection_shard1_replica2]

*4. Solr server 3 (replica3):*

  - 13:33:49,601 INFO org.apache.solr.update.UpdateHandler: start commit ...
  - 13:35:24,525 INFO org.apache.solr.core.SolrCore: ===SolrEventListener
started warmup: event = newSearcher

/--> QUERY IS RECEIVED HERE/

2015-10-21 13:35:50,416 INFO org.apache.solr.core.SolrCore.Request:
[CloudCollection_shard1_replica3] webapp=/solr path=/select
params={facet=true==1==0=file_metadata_id:"123413"=xml=standard=customer_company_id:"1010112"=customer_account_id:"1185005"=2.2=0}
hits=0 status=0 QTime=70

  - 13:36:43,535 INFO org.apache.solr.core.SolrCore: ===SolrEventListener
done with newSearcher: totalTime = 79009  totalWarmupTime = 69564
  - 13:36:43,537 INFO org.apache.solr.core.SolrCore:
[CloudCollection_shard1_replica3] Registered new searcher
Searcher@79d1be80[CloudCollection_shard1_replica3

*SUMMARY:*

The application code returns from commit (which took about 2 minutes)
before replica2 and replica3 finished committing and opened a new searcher.
At the time of sending the query only Solr server 1
(CloudCollection_shard1_replica1) has finished commit and is ready to return
the right result of the query.
The query is received by the Solr server 3
(CloudCollection_shard1_replica3) when the commit is not finished and new
searcher is not opened yet, thus returning zero results.

I checked time on all hosts and they were all in sync.

Any opinions/explanations are appreciated.

Thanks,

Victor



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solrcloud-4-10-reports-the-end-of-soft-commit-before-all-shard-replicas-finished-committing-tp4235934.html
Sent from the Solr - User mailing list archive at Nabble.com.


locks and high CPU

2015-10-22 Thread Rallavagu

Solr 4.6.1 cloud

Looking into thread dump 4-5 threads causing cpu to go very high and 
causing issues. These are tomcat's http threads and are locking. Can 
anybody help me understand what is going on here? I see that incoming 
connections coming in for updates and they are being passed on to 
StreamingSolrServer and subsequently ConcurrentUpdateSolrServer and they 
both have locks. Thanks.



"http-bio-8080-exec-4394" id=8774 idx=0x988 tid=14548 prio=5 alive, 
native_blocked, daemon

at __lll_lock_wait+34(:0)@0x38caa0e262
at safepointSyncOnPollAccess+167(safepoint.c:83)@0x7fc29b9c9138
at trapiNormalHandler+484(traps_posix.c:220)@0x7fc29b9fd745
at _L_unlock_16+44(:0)@0x38caa0f710
at 
java/util/concurrent/locks/ReentrantLock.lock(ReentrantLock.java:262)[optimized]
at 
org/apache/solr/client/solrj/impl/ConcurrentUpdateSolrServer.blockUntilFinished(ConcurrentUpdateSolrServer.java:391)[inlined]
at 
org/apache/solr/update/StreamingSolrServers.blockUntilFinished(StreamingSolrServers.java:98)[inlined]
at 
org/apache/solr/update/SolrCmdDistributor.finish(SolrCmdDistributor.java:61)[inlined]
at 
org/apache/solr/update/processor/DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:501)[inlined]
at 
org/apache/solr/update/processor/DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1278)[optimized]
^-- Holding lock: 
org/apache/solr/update/StreamingSolrServers$1@0x496cf6e50[biased lock]
^-- Holding lock: 
org/apache/solr/update/StreamingSolrServers@0x49d32adc8[biased lock]
at 
org/apache/solr/handler/ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83)[optimized]
at 
org/apache/solr/handler/RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)[optimized]

at org/apache/solr/core/SolrCore.execute(SolrCore.java:1859)[optimized]
at 
org/apache/solr/servlet/SolrDispatchFilter.execute(SolrDispatchFilter.java:721)[inlined]
at 
org/apache/solr/servlet/SolrDispatchFilter.doFilter(SolrDispatchFilter.java:417)[inlined]
at 
org/apache/solr/servlet/SolrDispatchFilter.doFilter(SolrDispatchFilter.java:201)[optimized]
at 
org/apache/catalina/core/ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)[inlined]
at 
org/apache/catalina/core/ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)[optimized]
at 
org/apache/catalina/core/StandardWrapperValve.invoke(StandardWrapperValve.java:222)[optimized]
at 
org/apache/catalina/core/StandardContextValve.invoke(StandardContextValve.java:123)[optimized]
at 
org/apache/catalina/core/StandardHostValve.invoke(StandardHostValve.java:171)[optimized]
at 
org/apache/catalina/valves/ErrorReportValve.invoke(ErrorReportValve.java:99)[optimized]
at 
org/apache/catalina/valves/AccessLogValve.invoke(AccessLogValve.java:953)[optimized]
at 
org/apache/catalina/core/StandardEngineValve.invoke(StandardEngineValve.java:118)[optimized]
at 
org/apache/catalina/connector/CoyoteAdapter.service(CoyoteAdapter.java:408)[optimized]
at 
org/apache/coyote/http11/AbstractHttp11Processor.process(AbstractHttp11Processor.java:1023)[optimized]
at 
org/apache/coyote/AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:589)[optimized]
at 
org/apache/tomcat/util/net/JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:310)[optimized]
^-- Holding lock: 
org/apache/tomcat/util/net/SocketWrapper@0x496e58810[thin lock]
at 
java/util/concurrent/ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)[inlined]
at 
java/util/concurrent/ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)[optimized]

at java/lang/Thread.run(Thread.java:682)[optimized]
at jrockit/vm/RNI.c2java(J)V(Native Method)


Re: getting cached terms inside UpdateRequestProcessor...

2015-10-22 Thread Roxana Danger
Hi Alexandre,
The DIH is executed correctly and the tokenized representation is obtained
correctly, but the URP chain is not executed with the call:
http://localhost:8983/solr/reed_jobs/update/details?commit=true
Isn't it the correct URL? is there any parameter missing?
Best,
Roxana



On 22 October 2015 at 16:17, Alexandre Rafalovitch 
wrote:

> Well, I guess I imagined three steps:
> 1) Run DIH
> 2) Get the tokenized representation for each document using facets or
> other approaches
> 3) Submit document partial-update request with additional custom
> processing through URP
>
> Your example seems to be skipping step 2, so the URP chain does not
> know which documents to actually work on and is basically an empty
> call.
>
> Again, I suspect knowing the business objectives may bring other
> solutions to the front.
>
> Regards,
>Alex.
> 
> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> http://www.solr-start.com/
>
>
> On 22 October 2015 at 10:49, Roxana Danger
>  wrote:
> > Hi Alex,
> >
> > My idea behind this is avoid two calls: first, the importer and after the
> > updater. As there is an update processor chain than can be used after the
> > DIH, I thorough it was possible to get a real-time updater.
> >
> > So, I am getting your advice and dividing the process in different
> steps. I
> > have the following configuration:
> >
> > 
> >   
> >   
> >   
> >   
> > 
> >
> > 
> >   
> >  retrieveDetails
> >   
> >
> >
> >  > class="org.apache.solr.handler.dataimport.DataImportHandler">
> > 
> >   db-data-config.xml
> >   
> > 
> >   
> >
> > So, after import (notice it does not contains the updtate.chain). I have
> > try to run the update with the following request:
> > http://localhost:8983/solr/reed_jobs/update/details?commit=true
> > but it returns immediately with status 0 but does not execute the
> update...
> > How should the update be called for reindex/update all the imported docs.
> > with my chain?
> >
> >
> > Best regards,
> > Roxana
> >
> >
> > On 22 October 2015 at 14:14, Alexandre Rafalovitch 
> > wrote:
> >
> >> You are doing things out of order. It's DIH, URP, then indexer. Any
> >> attempt to subvert that order for the record being indexed will end in
> >> problems.
> >>
> >> Have you considered doing a dual path? Index, then update. Of course,
> >> your fields all need to be stored for that.
> >>
> >> Also, perhaps you need to rethink the problem on a higher level. If
> >> all you need to do is to extract tokenized content of a field during
> >> search, you can do that in several ways, such as faceting on that
> >> field, or - I believe - using terms end-point.
> >>
> >> Regards,
> >>   Alex.
> >> 
> >> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> >> http://www.solr-start.com/
> >>
> >>
> >> On 22 October 2015 at 06:20, Roxana Danger
> >>  wrote:
> >> > Hello,
> >> >
> >> > I would like to create an updateRequestProcessorChain that should to
> be
> >> > executed after a DB DIH. I am extending UpdateRequestProcessorFactory
> and
> >> > the UpdateRequestProcessor classes. The method processAdd of my
> >> > UpdateRequestProcessor should be able to update the documents with
> the
> >> > indexed terms associated to a field. Notice that these terms should
> have
> >> > been extracted with an analyzer before my updateRequestProcessorChain
> >> > processor begins to execute.
> >> >
> >> > The problem I am getting is that at the point where processAdd is
> >> executed
> >> > the field containing the terms has not been filled. To retrieve the
> >> terms I
> >> > am using the SolrIndexSearcher provided during the request
> >> > (req.getSearcher()). However, it seems that this searcher uses only
> the
> >> > data physically stored and does not consider any of the imported data.
> >> >
> >> > Any idea on how can I access to searcher with all indexed/cached data
> >> when
> >> > the processAdd method is executed?
> >> >
> >> > Thank you very much in advance.
> >>
> >
> >
> >
> > --
> > Roxana Danger | Data Scientist Dragon Court, 27-29 Macklin Street,
> London,
> > WC2B 5LX Tel: 020 7067 4568 [image: reed.co.uk] 
> The
> > UK's #1 job site.  [image: Follow us on Twitter]
> > 
> >  [image:
> > Like us on Facebook] 
> >  It's time to Love Mondays
> »
> > 
>



-- 
Roxana Danger | Data Scientist Dragon Court, 27-29 Macklin Street, London,
WC2B 5LX Tel: 020 7067 4568 [image: reed.co.uk]  The
UK's #1 job site.  [image: Follow us on Twitter]

 

Re: getting cached terms inside UpdateRequestProcessor...

2015-10-22 Thread Shawn Heisey
On 10/22/2015 10:09 AM, Roxana Danger wrote:
> The DIH is executed correctly and the tokenized representation is obtained
> correctly, but the URP chain is not executed with the call:
> http://localhost:8983/solr/reed_jobs/update/details?commit=true
> Isn't it the correct URL? is there any parameter missing?

The only way I've found to get an update chain to be used by DIH is to
make it default.  From what I've been able to determine, and I have not
verified this, DIH does *not* use the update request handler(s) ... it
updates the index in a more direct manner.

If there is a way to use a custom chain (not the default) with DIH, I'd
love to know about it.

Thanks,
Shawn



Re: getting cached terms inside UpdateRequestProcessor...

2015-10-22 Thread Roxana Danger
Yes, it arrives there...


On 22 October 2015 at 17:32, Erik Hatcher  wrote:

> Setting “update.chain” in the DataImportHandler handler defined in
> solrconfig.xml should allow you to specify which update chain is used.  Can
> you confirm that works, Shawn?
>
> This is from DataImportHandler.java:
>
>   UpdateRequestProcessorChain processorChain =
> req.getCore().getUpdateProcessorChain(params);
>   UpdateRequestProcessor processor = processorChain.createProcessor(req,
> rsp);
>   SolrResourceLoader loader = req.getCore().getResourceLoader();
>   DIHWriter sw = getSolrWriter(processor, loader, requestParams, req);
>
>
>
>
>
>
> > On Oct 22, 2015, at 12:19 PM, Shawn Heisey  wrote:
> >
> > On 10/22/2015 10:09 AM, Roxana Danger wrote:
> >> The DIH is executed correctly and the tokenized representation is
> obtained
> >> correctly, but the URP chain is not executed with the call:
> >> http://localhost:8983/solr/reed_jobs/update/details?commit=true
> >> Isn't it the correct URL? is there any parameter missing?
> >
> > The only way I've found to get an update chain to be used by DIH is to
> > make it default.  From what I've been able to determine, and I have not
> > verified this, DIH does *not* use the update request handler(s) ... it
> > updates the index in a more direct manner.
> >
> > If there is a way to use a custom chain (not the default) with DIH, I'd
> > love to know about it.
> >
> > Thanks,
> > Shawn
> >
>
>


-- 
Roxana Danger | Data Scientist Dragon Court, 27-29 Macklin Street, London,
WC2B 5LX Tel: 020 7067 4568 [image: reed.co.uk]  The
UK's #1 job site.  [image: Follow us on Twitter]

 [image:
Like us on Facebook] 
 It's time to Love Mondays »



Re: locks and high CPU

2015-10-22 Thread Erick Erickson
Prior to Solr 5.2, there were several inefficiencies when distributing
updates to replicas, see:
https://lucidworks.com/blog/2015/06/10/indexing-performance-solr-5-2-now-twice-fast/.

The symptom was that there was significantly higher CPU utilization on
the followers
compared to the leader.

The only real fix is to upgrade to 5.2+ assuming that's your issue.

How are you indexing? Using SolrJ with CloudSolrServer would help if
you're not using
them.

Best,
Erick

On Thu, Oct 22, 2015 at 9:43 AM, Rallavagu  wrote:
> Solr 4.6.1 cloud
>
> Looking into thread dump 4-5 threads causing cpu to go very high and causing
> issues. These are tomcat's http threads and are locking. Can anybody help me
> understand what is going on here? I see that incoming connections coming in
> for updates and they are being passed on to StreamingSolrServer and
> subsequently ConcurrentUpdateSolrServer and they both have locks. Thanks.
>
>
> "http-bio-8080-exec-4394" id=8774 idx=0x988 tid=14548 prio=5 alive,
> native_blocked, daemon
> at __lll_lock_wait+34(:0)@0x38caa0e262
> at safepointSyncOnPollAccess+167(safepoint.c:83)@0x7fc29b9c9138
> at trapiNormalHandler+484(traps_posix.c:220)@0x7fc29b9fd745
> at _L_unlock_16+44(:0)@0x38caa0f710
> at
> java/util/concurrent/locks/ReentrantLock.lock(ReentrantLock.java:262)[optimized]
> at
> org/apache/solr/client/solrj/impl/ConcurrentUpdateSolrServer.blockUntilFinished(ConcurrentUpdateSolrServer.java:391)[inlined]
> at
> org/apache/solr/update/StreamingSolrServers.blockUntilFinished(StreamingSolrServers.java:98)[inlined]
> at
> org/apache/solr/update/SolrCmdDistributor.finish(SolrCmdDistributor.java:61)[inlined]
> at
> org/apache/solr/update/processor/DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:501)[inlined]
> at
> org/apache/solr/update/processor/DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1278)[optimized]
> ^-- Holding lock:
> org/apache/solr/update/StreamingSolrServers$1@0x496cf6e50[biased lock]
> ^-- Holding lock:
> org/apache/solr/update/StreamingSolrServers@0x49d32adc8[biased lock]
> at
> org/apache/solr/handler/ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83)[optimized]
> at
> org/apache/solr/handler/RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)[optimized]
> at org/apache/solr/core/SolrCore.execute(SolrCore.java:1859)[optimized]
> at
> org/apache/solr/servlet/SolrDispatchFilter.execute(SolrDispatchFilter.java:721)[inlined]
> at
> org/apache/solr/servlet/SolrDispatchFilter.doFilter(SolrDispatchFilter.java:417)[inlined]
> at
> org/apache/solr/servlet/SolrDispatchFilter.doFilter(SolrDispatchFilter.java:201)[optimized]
> at
> org/apache/catalina/core/ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)[inlined]
> at
> org/apache/catalina/core/ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)[optimized]
> at
> org/apache/catalina/core/StandardWrapperValve.invoke(StandardWrapperValve.java:222)[optimized]
> at
> org/apache/catalina/core/StandardContextValve.invoke(StandardContextValve.java:123)[optimized]
> at
> org/apache/catalina/core/StandardHostValve.invoke(StandardHostValve.java:171)[optimized]
> at
> org/apache/catalina/valves/ErrorReportValve.invoke(ErrorReportValve.java:99)[optimized]
> at
> org/apache/catalina/valves/AccessLogValve.invoke(AccessLogValve.java:953)[optimized]
> at
> org/apache/catalina/core/StandardEngineValve.invoke(StandardEngineValve.java:118)[optimized]
> at
> org/apache/catalina/connector/CoyoteAdapter.service(CoyoteAdapter.java:408)[optimized]
> at
> org/apache/coyote/http11/AbstractHttp11Processor.process(AbstractHttp11Processor.java:1023)[optimized]
> at
> org/apache/coyote/AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:589)[optimized]
> at
> org/apache/tomcat/util/net/JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:310)[optimized]
> ^-- Holding lock:
> org/apache/tomcat/util/net/SocketWrapper@0x496e58810[thin lock]
> at
> java/util/concurrent/ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)[inlined]
> at
> java/util/concurrent/ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)[optimized]
> at java/lang/Thread.run(Thread.java:682)[optimized]
> at jrockit/vm/RNI.c2java(J)V(Native Method)


Re: Solr Full text search

2015-10-22 Thread Shawn Heisey
On 10/22/2015 10:37 AM, vitaly bulgakov wrote:
> But it returns no results when the query has a term which is not in a
> document. 
> Say searching for "building constructor" I get a result, but
> searching for "good building constructor" returns no results because there
> are no documents containing all three terms. 
> 
> What should I do to be able to search ignoring non-existing terms? 

There are two reasons for this to fail.

One reason is that your query is expecting all three terms to be in the
index -- using AND for the default operator.  This might be done with
the q.op parameter if you're using the standard query parser, or the mm
parameter (set to 100%) if you're using dismax/edismax.  Older versions
of Solr will let you configure the default operator in schema.xml, but
this is discouraged, and I don't think it works in Solr 5.x.  The
default operator defaults to OR, but many people will set that to AND.

The other possible reason is that you are passing the query text with
the quotes, making it a phrase query, which means that the terms must
all be present in the index, must be next to each other, and must be in
that precise order.  If this is the problem, try using q=(good building
constructor) instead of q="good building constructor".

Thanks,
Shawn



Re: getting cached terms inside UpdateRequestProcessor...

2015-10-22 Thread Roxana Danger
yes, it's working now... but I can not use the updateprocessor chain. I
need to use first the DIH and then use UPR, but I am not having luck in
updating my docs with the URL:
http://localhost:8983/solr/reed_jobs/update/jtdetails?commit=true

Do you manage to use an updateProcessor chain after use the DIH without
using the update.chain parameter?

Cheers,
Roxana


On 22 October 2015 at 17:42, Shawn Heisey  wrote:

> On 10/22/2015 10:32 AM, Erik Hatcher wrote:
> > Setting “update.chain” in the DataImportHandler handler defined in
> solrconfig.xml should allow you to specify which update chain is used.  Can
> you confirm that works, Shawn?
>
> I tried this a couple of years ago without luck.  Does it work now?
>
>
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201308.mbox/%3c6c93c1a4-63ac-4cad-9f5b-c74f497c6...@gmail.com%3E
>
> In the first email of the thread, I indicated I had tried 4.4 and
> 4.5-SNAPSHOT.
>
> Thanks,
> Shawn
>
>


-- 
Roxana Danger | Data Scientist Dragon Court, 27-29 Macklin Street, London,
WC2B 5LX Tel: 020 7067 4568 [image: reed.co.uk]  The
UK's #1 job site.  [image: Follow us on Twitter]

 [image:
Like us on Facebook] 
 It's time to Love Mondays »