Re: Removing words like "FONT-SIZE: 9pt; FONT-FAMILY: arial" from content

2019-01-11 Thread Zheng Lin Edwin Yeo
Thanks for your reply.

What I have found is that in the EML file, there are 2 Content-Type, one is
text/html, and the other is text/plain.

The text/html will words like "*FONT-SIZE: 9pt; FONT-FAMILY: arial*" in the
content, but for the text/plain, there is no such words, and the content is
clean (just what is in the email).

As such, I believe that the indexing is done on the text/html part. Is
there any way that we can change the settings so that the indexing is done
on the text/plain part?

Regards,
Edwin

On Wed, 2 Jan 2019 at 03:27, Gus Heck  wrote:

> Although Vincenzo and Alexandre's suggestions may be helpful in the right
> circumstances, there is a continuum of answers to the original question
> here. This continuum is mostly relevant if indexing and querying is likely
> to happen simultaneously or the data volume is large enough relative to the
> server to make you wish indexing would finish faster. Otherwise
> maintainability, local talent and time investment concerns probably
> dominate, with the caveat that in many cases, initial success may lead to a
> future with large data volumes or where querying and indexing do become
> simultaneous.
>
> 1) Vincenzo's answer would be suitable for a single or a few small fields
> with a very narrow set of possible html like tags. If the number of
> patterns that need to be matched is high or the length of the text for
> matching is long I would expect this solution to begin to negatively impact
> performance.
>
> 2) Alexandre's suggestion is much better in the case where there is a
> moderate amount of text and the input could be generalized html, but as the
> amount of text that needs to have html stripped grows the performance of
> the server will also degrade faster than necessary with increased indexing
> load.
>
> 3) If the Solr Cloud you are indexing into will need to simultaneously need
> to provide good response times for queries, and you are not able to supply
> it with an over abundance of hardware relative to the query/indexing load,
> then you should consider pre-processing the documents in an external
> ingestion system such as JesterJ, Fusion, or a variety of other solutions
> out there. As the indexing and query load goes up, the best practice is to
> move as much pre-processing work out of solr as possible so that solr can
> continue to do what it does well and return queries quickly.
>
> In the end, like most engineering decisions, it's a cost trade off
> consideration. What costs more, investing in setting up external processing
> or investing in server hardware. If it's a small amount of data loaded
> batch style prior to querying, you are in a good place and any of these
> will work. Just do whatever is fastest/easiest to implement. If you need to
> support a high volume of data being loaded into solr in a timely manner or
> you require minimal impact to query latency due to indexing, you want some
> variation of 3.
>
> -Gus
>
> On Sun, Dec 30, 2018 at 10:29 PM Alexandre Rafalovitch  >
> wrote:
>
> > Specifically, a custome Update Request Processor chain can be used before
> > indexing. Probably with HTMLStripFieldUpdateProcessorFactory
> > Regards,
> >  Alex
> >
> > On Sun, Dec 30, 2018, 9:26 PM Vincenzo D'Amore  wrote:
> >
> > > Hi,
> > >
> > > I think this kind of text manipulation should be done before indexing,
> if
> > > you have font-size font-family in your text, very likely you’re
> indexing
> > an
> > > html with css.
> > > If I’m right, you’re just entering in a hell of words that should be
> > > removed from your text.
> > >
> > > On the other hand, if you have to do this at index time, a quick and
> > dirty
> > > solution is using the pattern-replace filter.
> > >
> > >
> > >
> >
> https://lucene.apache.org/solr/guide/7_5/filter-descriptions.html#pattern-replace-filter
> > >
> > > Ciao,
> > > Vincenzo
> > >
> > > --
> > > mobile: 3498513251
> > > skype: free.dev
> > >
> > > > On 31 Dec 2018, at 02:47, Zheng Lin Edwin Yeo 
> > > wrote:
> > > >
> > > > Hi,
> > > >
> > > > I noticed that during the indexing of EMLfiles, there are words like
> > > > "*FONT-SIZE:
> > > > 9pt; FONT-FAMILY: arial*" that are being indexed into the content as
> > > well.
> > > >
> > > > Would like to check, how are we able to remove those words during the
> > > > indexing?
> > > >
> > > > I am using Solr 7.5.0
> > > >
> > > > Regards,
> > > > Edwin
> > >
> >
>
>
> --
> http://www.the111shift.com
>


Re: SOLR v7 Security Issues Caused Denial of Use - Sonatype Application Composition Report

2019-01-11 Thread Bob Hathaway
Hi Shawn,

Thanks for the great answers.  Thanks also to  Jörn Franke  and  Gus Heck
for responses.  The images were sent for convenience of the issues listed
below them.  We are working to get infosec approval.

It would be helpful to put the security links prominently on the solr
splash and download pages.

I also found these links to be useful:

This is the Solr Security Wiki page with a list of CVE’s which
Sonatype reports.


https://wiki.apache.org/solr/SolrSecurity#Solr_and_Vulnerability_Scanning_Tools

Apache  » Solr
 :
Security Vulnerabilities

https://www.cvedetails.com/vulnerability-list/vendor_id-45/product_id-18263/Apache-Solr.html


-- Forwarded message -
From: Shawn Heisey 
Date: Fri, Jan 4, 2019 at 1:49 PM
Subject: Re: SOLR v7 Security Issues Caused Denial of Use - Sonatype
Application Composition Report
To: 


On 1/3/2019 11:15 AM, Bob Hathaway wrote:
> We want to use SOLR v7 but Sonatype scans past v6.5 show dozens of
> critical and severe security issues and dozens of licensing issues.

None of the images that you attached to your message are visible to us.
Attachments are regularly stripped by Apache mailing lists and cannot be
relied on.

Some of the security issues you've mentioned could be problems.  But if
you follow recommendations and make sure that Solr is not directly
accessible to unauthorized parties, it will not be possible for those
parties to exploit security issues without first finding and exploiting
a vulnerability on an authorized system.

Vulnerabilities in SolrJ, if any exist, are slightly different, but
unless unauthorized parties have the ability to *directly* send input to
SolrJ code without intermediate code sanitizing the input, they will not
be able to exploit those vulnerabilities. JSON support in SolrJ is
provided by noggit, not jackson, and JSON/XML are not used by recent
versions of SolrJ unless they are very specifically requested by the
programmer.  Are there any vulnerabilities you've found that affect
SolrJ itself, separately from the rest of Solr?

As we become aware of issues with either project code or third-party
software, we get them fixed.  Sometimes it is not completely
straightforward to upgrade to newer versions of third-party software,
but staying current is a priority.

Licensing issues are of major concern to the entire Apache Foundation.
As a project, we are unaware of any licensing problems at this time.
All of the third-party software that is included with Solr should be
available under a license that is compatible with the Apache license.  I
didn't examine the list you sent super closely, but what I did look at
didn't look like a problem.

https://www.apache.org/legal/resolved.html#category-b

The mere presence of GPL in the available licenses for third party
software is not an indication of a problem.  If that were the ONLY
license available, then it would be a problem.

Thanks,
Shawn


Re: what are the best client interface ?

2019-01-11 Thread markus kalkbrenner
The latest module versions of Drupal and Typo3 now both use the solarium 
library.
I think solarium is the most used PHP library for Solr and it is the most 
active project.

But as one of the maintainers of the Drupal integration and the solarium 
library itself, my opinion might not be totally objective ;-)

Markus Kalkbrenner
Dipl.-Ing. (FH) techn. Informatik
CTO

bio.logis Genetic Information Management GmbH



> Am 11.01.2019 um 19:19 schrieb Davis, Daniel (NIH/NLM) [C] 
> :
> 
> WordPress and Drupal both have ways to interface with Solr through 
> plugins/modules.   Not sure that describes your PHP website.
> 
> I like Ruby on Rails "projectblacklight" for an easy and usable discovery 
> layer.
> 
> We are a Python/Django shop - we've had good luck with Django-haystack and 
> pysolr.
> 
>> -Original Message-
>> From: said 
>> Sent: Friday, January 11, 2019 9:45 AM
>> To: solr-user@lucene.apache.org
>> Subject: what are the best client interface ?
>> 
>> I want to integrate my *Solr* search engine with my *PHP* website and I
>> hesitate over doing interface with *Velocity UI *or with *Solarium* ? what
>> do you think about ?
>> Thank you for help.
>> 
>> 
>> 
>> --
>> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: 6.3 -> 6.4 Sorting responseWriter renamed

2019-01-11 Thread Raveendra Yerraguntla

Hi Joel,
Thanks for the quick response.
Our current usage is below. Could you guide me in using the new class and write 
method. 

public class customSearchHandler extends SearchHandler {@Override

public void inform(SolrCore core)

{

   super.inform(core);
…. 

core.registerResponseWriter("xsort", new SortingResponseWriter(){ 
   @Override
   public void write(Writer out, SolrQueryRequest req,SolrQueryResponse 
response) throws IOException {
  try {
 if (handleResponseWriter) { 
CustomController.singleton.prepareThreadForWork();
 }  
 super.write(out, req,response);
  }
  finally {
 CustomController.singleton.releaseFromThread();
  }
   }
});
}

Signature of new class isExportWriter(SolrQueryRequest req, SolrQueryResponse 
res, String wt) {public void write(OutputStream os) throws IOException {

  

On Friday, January 11, 2019, 1:55:15 PM EST, Joel Bernstein 
 wrote:  
 
 The functionality should be exactly the same. The config files though need
to be changed. I would recommend adding any custom configs that you have to
the new configs following the ExportWriter changes.


Joel Bernstein
http://joelsolr.blogspot.com/


On Thu, Jan 10, 2019 at 11:21 AM Raveendra Yerraguntla
 wrote:

> Hello All,
>
> In 6.4 (Solr-9717)  SortingResponseWriter is renamed to ExportWriter and
> moved to a different package.
>
> For migrating to higher Solr (post 6.4) versions, I  need to help with
> compatible functionalities.
>
>
> Application is using  SortingResponseWriter in the searcher handlers
> inform method to register responseWriters for the xSort.
>
> Since the class and write methods Signature  is changed, what are
> alternative ways to use the functionality.
>
>
>  Thanks
> Ravi
>
>
  

Log4j Configuration

2019-01-11 Thread deathbycaramel
Hi,

I'm running solr v6.6.5 using a pretty generic log4j properties file:

# Default Solr log4j config
# rootLogger log level may be programmatically overridden by -Dsolr.log.level
solr.log=${solr.log.dir}
log4j.rootLogger=INFO, file, CONSOLE

# Console appender will be programmatically disabled when Solr is started with 
option -Dsolr.log.muteconsole
log4j.appender.CONSOLE=org.apache.log4j.ConsoleAppender
log4j.appender.CONSOLE.layout=org.apache.log4j.EnhancedPatternLayout
log4j.appender.CONSOLE.layout.ConversionPattern=%d{-MM-dd HH:mm:ss.SSS} 
%-5p (%t) [%X{collection} %X{shard} %X{replica} %X{core}] %c{1.} %m%n

#- size rotation with log cleanup.
log4j.appender.file=org.apache.log4j.RollingFileAppender
log4j.appender.file.MaxFileSize=4MB
log4j.appender.file.MaxBackupIndex=9

#- File to log to and log format
log4j.appender.file.File=${solr.log}/solr.log
log4j.appender.file.layout=org.apache.log4j.EnhancedPatternLayout
#log4j.appender.file.layout.ConversionPattern=%d{-MM-dd HH:mm:ss.SSS/zzz } 
%-5p (%t) [%X{collection} %X{shard} %X{replica} %X{core}] %c{1.} %m%n
log4j.appender.file.layout.ConversionPattern=%d{-MM-dd 
HH:mm:ss}{America/New_York} %-5p (%t) [%X{collection} %X{shard} %X{replica} 
%X{core}] %c{1.} %m%n

# Adjust logging levels that should differ from root logger
log4j.logger.org.apache.zookeeper=WARN
log4j.logger.org.apache.hadoop=WARN
log4j.logger.org.eclipse.jetty=WARN
log4j.logger.org.eclipse.jetty.server.Server=INFO
log4j.logger.org.eclipse.jetty.server.ServerConnector=INFO

# set to INFO to enable infostream log messages
log4j.logger.org.apache.solr.update.LoggingInfoStream=OFF

This is working.

It would be nice if there was a way to log the client's http header?  We 
eventually want to use this as an element in our log analysis software.  Other 
elements in the http request would be helpful too if that's possible.  From 
reading I see there is a way to write to more than one  log, I've tried it, 
along with trying to and it isn't working.

From reading it seems like I should be able to log a lot of information with 
log4j.  I get the impression the Solr implementation might be a 'lighter' 
version tailored specifically for debugging and monitoring, not sure if that's 
true.

Here is my latest log4j.properties that isn't working for reference:

# Default Solr log4j config
# rootLogger log level may be programmatically overridden by -Dsolr.log.level
solr.log=${solr.log.dir}

log4j.rootLogger=INFO, file, CONSOLE

# Console appender will be programmatically disabled when Solr is started with 
option -Dsolr.log.muteconsole
log4j.appender.CONSOLE=org.apache.log4j.ConsoleAppender
log4j.appender.CONSOLE.layout=org.apache.log4j.EnhancedPatternLayout
log4j.appender.CONSOLE.layout.ConversionPattern=%d{-MM-dd HH:mm:ss.SSS} 
%-5p (%t) [%X{collection} %X{shard} %X{replica} %X{core}] %c{1.} %m%n

#- size rotation with log cleanup.
log4j.appender.file=org.apache.log4j.RollingFileAppender
log4j.appender.file.MaxFileSize=4MB
log4j.appender.file.MaxBackupIndex=9

#- File to log to and log format
log4j.appender.file.File=${solr.log}/solr.log
log4j.appender.file.layout=org.apache.log4j.EnhancedPatternLayout
log4j.appender.file.layout.ConversionPattern=%d{-MM-dd HH:mm:ss.SSS} %-5p 
(%t) [%X{collection} %X{shard} %X{replica} %X{core}] %c{1.} %m%n

#-
#- Logger for HTTP requests
log4j.logger.httplog=INFO, httplog

#- size rotation with log cleanup.
log4j.appender.httplog=org.apache.log4j.RollingFileAppender
log4j.appender.httplog.MaxFileSize=4MB
log4j.appender.httplog.MaxBackupIndex=9

#- File to log to and log format
log4j.appender.httplog.File=${solr.log}/solr_http.log
log4j.appender.httplog.layout=org.apache.log4j.PatternLayout
#log4j.appender.httplog.layout.ConversionPattern=%-5p - %d{-MM-dd 
HH:mm:ss.SSS}; %C; %m\n
log4j.appender.httplog.layout.ConversionPattern=%d{-MM-dd HH:mm:ss.SSS} 
%-5p (%t) [%X{collection} %X{shard} %X{replica} %X{core}] %c{1.} %m%n
#-

# Adjust logging levels that should differ from root logger
#log4j.logger.org.apache.zookeeper=WARN
#log4j.logger.org.apache.hadoop=WARN
log4j.logger.org.eclipse.jetty=WARN
log4j.logger.org.eclipse.jetty.server.Server=INFO
log4j.logger.org.eclipse.jetty.server.ServerConnector=INFO
log4j.logger.org.apache.solr.core.SolrCore.Request=DEBUG
log4j.logger.org.apache.solr.search=DEBUG

# set to INFO to enable infostream log messages
log4j.logger.org.apache.solr.update.LoggingInfoStream=OFF

Thank you for any assistance.

Re: 6.3 -> 6.4 Sorting responseWriter renamed

2019-01-11 Thread Joel Bernstein
The functionality should be exactly the same. The config files though need
to be changed. I would recommend adding any custom configs that you have to
the new configs following the ExportWriter changes.


Joel Bernstein
http://joelsolr.blogspot.com/


On Thu, Jan 10, 2019 at 11:21 AM Raveendra Yerraguntla
 wrote:

> Hello All,
>
> In 6.4 (Solr-9717)  SortingResponseWriter is renamed to ExportWriter and
> moved to a different package.
>
> For migrating to higher Solr (post 6.4) versions, I  need to help with
> compatible functionalities.
>
>
> Application is using  SortingResponseWriter in the searcher handlers
> inform method to register responseWriters for the xSort.
>
> Since the class and write methods Signature   is changed, what are
> alternative ways to use the functionality.
>
>
>  Thanks
> Ravi
>
>


RE: what are the best client interface ?

2019-01-11 Thread Davis, Daniel (NIH/NLM) [C]
WordPress and Drupal both have ways to interface with Solr through 
plugins/modules.   Not sure that describes your PHP website.

I like Ruby on Rails "projectblacklight" for an easy and usable discovery layer.

We are a Python/Django shop - we've had good luck with Django-haystack and 
pysolr.

> -Original Message-
> From: said 
> Sent: Friday, January 11, 2019 9:45 AM
> To: solr-user@lucene.apache.org
> Subject: what are the best client interface ?
> 
> I want to integrate my *Solr* search engine with my *PHP* website and I
> hesitate over doing interface with *Velocity UI *or with *Solarium* ? what
> do you think about ?
> Thank you for help.
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


what are the best client interface ?

2019-01-11 Thread said
I want to integrate my *Solr* search engine with my *PHP* website and I
hesitate over doing interface with *Velocity UI *or with *Solarium* ? what
do you think about ? 
Thank you for help.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Schema.xml, copyField, Slash, ignoreCase ?

2019-01-11 Thread Steve Rowe
Hi Bruno,

ignoreCase: Looks like you already have achieved this?

auto truncation: This is caused by inclusion of PorterStemFilterFactory in your 
"text_en" field type.  If you don't want its effects (i.e. treating different 
forms of the same word interchangeably), remove the filter.

process slash char: I think you want the slash to be included in symbol terms 
rather than interpreted as a term separator.  One way to achieve this is to 
first, pre-tokenization, convert the slash to a string that does not include a 
term separator, and then post-tokenization, convert the substituted string back 
to a slash.

Here's a version of your text_en that uses PatternReplaceCharFilterFactory[1] 
to convert slashes inside of symbol-ish terms (the pattern is a guess based on 
the symbol text you've provided; you'll likely need to adjust it) to "_": a 
string unlikely to otherwise occur, and which will not be interpreted by 
StandardTokenizer as a term separator; and then PatternReplaceFilterFactory[1] 
to convert "_" back to slashes.  Note that the patterns for the two are 
slightly different, since the *char filter* is given as input the entire field 
text, while the *filter* is given the text of single terms.

- 

  








  
  









  

-

[1] 
http://archive.apache.org/dist/lucene/solr/ref-guide/apache-solr-ref-guide-5.4.pdf

--
Steve


> On Jan 11, 2019, at 4:18 AM, Bruno Mannina  
> wrote:
> 
> I need to have default “text” field with:
> 
> - ignoreCase,
> 
> - no auto truncation,
> 
> - process slash char
> 
> 
> 
> I would like to perform only query on the field “text”
> 
> Queries can contain:  code or keywords or both.
> 
> 
> 
> I have 2 fields named symbol and title, and 1 alias ti (old field that I
> can’t delete or modify)
> 
> 
> 
> * Symbol contains code with slash (i.e A62C21/02)
> 
>  required="true" stored="true"/>
> 
> 
> 
> * Title contains English text and also symbol
> 
> stored="true" termVectors="true" termPositions="true" termOffsets="true"/>
> 
> 
> 
> { "symbol": "B65D81/20",
> 
> "title": [
> 
> "under vacuum or superatmospheric pressure, or in a special atmosphere,
> e.g. of inert gas  {(B65D81/28  takes precedence; containers with
> pressurising means for maintaining ball pressure A63B39/025)} "
> 
> ]}
> 
> 
> 
> * Ti is an alias of title
> 
> stored="true" termVectors="true" termPositions="true" termOffsets="true"/>
> 
> 
> 
> * Text is
> 
>  multiValued="true"/>
> 
> 
> 
> - Alias are:
> 
> 
> 
>
> 
>
> 
>
> 
>
> 
> 
> 
> 
> 
> If I do these queries :
> 
> 
> 
> * ti:airbag   à it’s ok
> 
> * title:airbag  à not good for me because it found
> airbags
> 
> * ti:b65D81/28  à not good, debug shows ti:b65d81 OR ti:28
> 
> * ti:”b65D81/28”  à it’s ok
> 
> * symbol:b65D81/28  à it’s ok (even without “ “)
> 
> 
> 
> NOW with “text” field
> 
> * b65D81/28  à not good, debug shows text:b65d81 OR
> text:28
> 
> * airbag   à it’s ok
> 
> * “b65D81/28”  à it’s ok
> 
> 
> 
> It will be great if I can enter symbol without “ “
> 
> 
> 
> Could you help me to have a text field which solve this problem ? (please
> find below all def of my fields)
> 
> 
> 
> Many thanks for your help.
> 
> 
> 
> String_ci is my own definition
> 
> 
> 
> sortMissingLast="true" omitNorms="true">
> 
>
> 
>  
> 
>  
> 
>
> 
>
> 
> 
> 
> positionIncrementGap="100" multiValued="true">
> 
>  
> 
>
> 
> words="stopwords.txt" />
> 
>
> 
>  
> 
>  
> 
>
> 
> words="stopwords.txt" />
> 
> ignoreCase="true" expand="true"/>
> 
>
> 
>  
> 
>
> 
> 
> 
> positionIncrementGap="100">
> 
>  
> 
>
> 
> words="lang/stopwords_en.txt"/>
> 
>
> 
>
> 
> protected="protwords.txt"/>
> 
>
> 
>  
> 
>  
> 
>
> 
> ignoreCase="true" expand="true"/>
> 
> words="lang/stopwords_en.txt"/>
> 
>
> 
>
> 
>protected="protwords.txt"/>
> 
>
> 
>  
> 
>
> 
> 
> 
> 
> 
> Best Regards
> 
> Bruno
> 
> 
> 
> 
> 
> ---
> L'absence de virus dans ce courrier électronique a été vérifiée par le 
> logiciel antivirus Avast.
> https://www.avast.com/antivirus



Re: REBALANCELEADERS is not reliable

2019-01-11 Thread Erick Erickson
bq: You have to check if the cores, participating in leadership
election, are _really_
in sync. And this must be done before starting any rebalance.
Sounds ugly... :-(

This _should_ not be necessary. I'll add parenthetically that leader
election has
been extensively re-worked in Solr 7.3+ though because "interesting" things
could happen.

Manipulating the leader election queue is really no different than
having to deal with, say, someone killing the leader un-gracefully. It  should
"just work". That said if you're seeing evidence to the contrary that's reality.

What do you mean by "stats" though? It's perfectly ordinary for there to
be different numbers of _deleted_ documents on various replicas, and
consequently things like term frequencies and doc frequencies being
different. What's emphatically _not_ expected is for there to be different
numbers of "live" docs.

"making sure nodes are in sync" is certainly an option. That should all
be automatic if you pause indexing and issue a commit, _then_
do a rebalance.

I certainly agree that the code is broken and needs to be fixed, but I
also have to ask how many shards are we talking here? The code was
originally written for the case where 100s of leaders could be on the
same node, until you get in to a significant number of leaders on
a single node (10s at least) there haven't been reliable stats showing
that it's a performance issue. If you have threshold numbers where
you've seen it make a material difference it'd be great to share them.

And I won't be getting back to this until the weekend, other urgent
stuff has come up...

Best,
Erick

On Fri, Jan 11, 2019 at 12:58 AM Bernd Fehling
 wrote:
>
> Hi Erik,
> yes, I would be happy to test any patches.
>
> Good news, I got rebalance working.
> After running the rebalance about 50 times with debugger and watching
> the behavior of my problem shard and its core_nodes within my test cloud
> I came to the point of failure. I solved it and now it works.
>
> Bad news, rebalance is still not reliable and there are many more
> problems and point of failure initiated by rebalanceLeaders or better
> by re-queueing the watchlist.
>
> How I located _my_ problem:
> Test cloud is 5 server (VM), 5 shards, 3 replica per shard, 1 java
> instance per server. 3 separate zookeepers.
> My problem, shard2 wasn't willing to rebalance to a specific core_node.
> core_nodes related (core_node1, core_node2, core_node10).
> core_node10 was the preferredLeader.
> It was just changing leader ship between core_node1 and core_node2,
> back and forth, whenever I called rebalanceLeader.
> First step, I stopped the server holding core_node2.
> Result, the leadership was staying at core_node1 whenever I called 
> rebalanceLeaders.
> Second step, from debugger I _forced_ during rebalanceLeaders the
> system to give the leadership to core_node10.
> Result, there was no leader anymore for that shard. Yes it can happen,
> you can end up with a shard having no leader but active core_nodes!!!
> To fix this I was giving preferredLeader to core_node1 and called 
> rebalanceLeaders.
> After that, preferredLeader was set back to core_node10 and I was back
> at the point I started, all calls to rebalanceLeaders kept the leader at 
> core_node1.
>
>  From the debug logs I got the hint about PeerSync of cores and 
> IndexFingerprint.
> The stats from my problem core_node10 showed that they differ from leader 
> core_node1.
> And the system notices the difference, starts a PeerSync and ends with 
> success.
> But actually the PeerSync seem to fail, because the stats of core_node1 and
> core_node10 still differ afterwards.
> Solution, I also stopped my server holding my problem core_node10, wiped all 
> data
> directories and started that server again. The core_nodes where rebuilt from 
> leader
> and now they are really in sync.
> Calling now rebalanceLeaders ended now with success to preferredLeader.
>
> My guess:
> You have to check if the cores, participating in leadership election, are 
> _really_
> in sync. And this must be done before starting any rebalance.
> Sounds ugly... :-(
>
> Next question, why is PeerSync not reporting an error?
> There is an info about "PeerSync START", "PeerSync Received 0 versions from 
> ... fingeprint:null"
> and "PeerSync DONE. sync succeeded" but the cores are not really in sync.
>
> Another test I did (with my new knowledge about synced cores):
> - Removing all preferredLeader properties
> - stopping, wiping data directory, starting all server one by one to get
>all cores of all shards in sync
> - setting one preferredLeader for each shard but different from the actual 
> leader
> - calling rebalanceLeaders succeeded only at 2 shards with the first run,
>not for all 5 shards (even with really all cores in sync).
> - after calling rebalanceLeaders again the other shards succeeded also.
> Result, rebalanceLeaders is still not reliable.
>
> I have to mention that I have about 520.000 docs per core in my test 

Re: Delayed/waiting requests

2019-01-11 Thread Erick Erickson
Jimi's comment is one of the very common culprits.

Autowarming is another. Are you indexing at the same
time? If so it could well be  you aren't autowarming and
the spikes are caused by using a new IndexSearcher
that has to read much of the index off disk when commits
happen. The "smoking gun" here would be if the spikes
correlate to your commits (soft or hard-with-opensearcher-true).

Best,
Erick

On Fri, Jan 11, 2019 at 1:23 AM Gael Jourdan-Weil
 wrote:
>
> Interesting indeed, we did not see anything with VisualVM but having a look 
> at the GC logs could gives us more info, especially on the pauses.
>
> I will collect data over the week-end and look at it.
>
>
> Thanks
>
> 
> De : Hullegård, Jimi 
> Envoyé : vendredi 11 janvier 2019 03:46:02
> À : solr-user@lucene.apache.org
> Objet : Re: Delayed/waiting requests
>
> Could be caused by garbage collection in the jvm.
>
> https://wiki.apache.org/solr/SolrPerformanceProblems
>
> Go down to the segment called “GC pause problems”
>
> /Jimi
>
> Sent from my iPhone
>
> On 11 Jan 2019, at 05:05, Gael Jourdan-Weil 
> mailto:gael.jourdan-w...@kelkoogroup.com>> 
> wrote:
>
> Hello,
>
> We are experiencing some performance issues on a simple SolrCloud cluster of 
> 3 replicas (1 core) but what we found during our analysis seems a bit odd, so 
> we thought the community could have relevant ideas on this.
>
> Load: between 30 and 40 queries per second, constant over time of analysis
>
> Symptoms: high response time over short period of time but quite frequently.
> We are talking about requests response time going from 50ms to 5000ms or even 
> worse during less than 5 seconds, and then going back to normal.
>
> What we found out: just before response time increase, requests seems to be 
> delayed.
> That is during 2/3 seconds, requests pile up, no response is sent, and then 
> all requests are resolved and responses are all returned to the clients at 
> the same time.
> Very much like if there was a lock happening somewhere. But we found no 
> "lock" time nor at JVM or system level.
>
> Does someone can think of something that could explain this in the way Solr 
> works ?
> Or ideas to track down the root cause..
>
> Solr version is 7.2.1.
>
> Thanks for reading,
>
> Gaël Jourdan-Weil
>
> Svenskt Näringsliv behandlar dina personuppgifter i enlighet med GDPR. Här 
> kan du läsa mer om vår behandling och dina rättigheter, 
> Integritetspolicy


Re: Schema.xml, copyField, Slash, ignoreCase ?

2019-01-11 Thread Erick Erickson
The admin UI>>(select a core)>>analysis page is your friend here. It'll
show you exactly what each filter in your analysis chain does and from
there you'll need to mix and match filters, your tokenizer and the like
to support the use-cases you need.

My guess is that the field type you're using contains
WordDelimiterFilterFactory which is splitting up on the slash.
Similarly for your aribag/airbags problem, probably you have
one of the stemmers in your analysis chain.

See "Filter Descriptions" in your version of the ref guide.

And one caution: The admin>>core>>analysis chain
shows you what happens _after_ query parsing. So if
you enter (without quotes) "bing bong" those tokens
will be shown. What fools people is that the query _parser_
gets in there first, so they'll then wonder why
field:bing bong
doesn't work. It's because the parser made it into
field:bing default_field:bong. So you'll still (potentially)
have to quote or escape some terms on input, it depends
on the query parser you're using.

Best,
Erick

On Fri, Jan 11, 2019 at 1:40 AM Bruno Mannina
 wrote:
>
> Hello,
>
>
>
> I’m facing a problem concerning the default field “text” (SOLR 5.4) and
> queries which contains / (slash)
>
>
>
> I need to have default “text” field with:
>
> - ignoreCase,
>
> - no auto truncation,
>
> - process slash char
>
>
>
> I would like to perform only query on the field “text”
>
> Queries can contain:  code or keywords or both.
>
>
>
> I have 2 fields named symbol and title, and 1 alias ti (old field that I
> can’t delete or modify)
>
>
>
> * Symbol contains code with slash (i.e A62C21/02)
>
>  required="true" stored="true"/>
>
>
>
> * Title contains English text and also symbol
>
>  stored="true" termVectors="true" termPositions="true" termOffsets="true"/>
>
>
>
> { "symbol": "B65D81/20",
>
> "title": [
>
>  "under vacuum or superatmospheric pressure, or in a special atmosphere,
> e.g. of inert gas  {(B65D81/28  takes precedence; containers with
> pressurising means for maintaining ball pressure A63B39/025)} "
>
> ]}
>
>
>
> * Ti is an alias of title
>
>  stored="true" termVectors="true" termPositions="true" termOffsets="true"/>
>
>
>
> * Text is
>
>  multiValued="true"/>
>
>
>
> - Alias are:
>
>
>
> 
>
> 
>
> 
>
> 
>
>
>
>
>
> If I do these queries :
>
>
>
> * ti:airbag   à it’s ok
>
> * title:airbag  à not good for me because it found
> airbags
>
> * ti:b65D81/28  à not good, debug shows ti:b65d81 OR ti:28
>
> * ti:”b65D81/28”  à it’s ok
>
> * symbol:b65D81/28  à it’s ok (even without “ “)
>
>
>
> NOW with “text” field
>
> * b65D81/28  à not good, debug shows text:b65d81 OR
> text:28
>
> * airbag   à it’s ok
>
> * “b65D81/28”  à it’s ok
>
>
>
> It will be great if I can enter symbol without “ “
>
>
>
> Could you help me to have a text field which solve this problem ? (please
> find below all def of my fields)
>
>
>
> Many thanks for your help.
>
>
>
> String_ci is my own definition
>
>
>
>  sortMissingLast="true" omitNorms="true">
>
> 
>
>   
>
>   
>
> 
>
> 
>
>
>
>  positionIncrementGap="100" multiValued="true">
>
>   
>
> 
>
>  words="stopwords.txt" />
>
> 
>
>   
>
>   
>
> 
>
>  words="stopwords.txt" />
>
>  ignoreCase="true" expand="true"/>
>
> 
>
>   
>
> 
>
>
>
>  positionIncrementGap="100">
>
>   
>
> 
>
>  words="lang/stopwords_en.txt"/>
>
> 
>
> 
>
>  protected="protwords.txt"/>
>
> 
>
>   
>
>   
>
> 
>
>  ignoreCase="true" expand="true"/>
>
>  words="lang/stopwords_en.txt"/>
>
> 
>
> 
>
> protected="protwords.txt"/>
>
> 
>
>   
>
> 
>
>
>
>
>
> Best Regards
>
> Bruno
>
>
>
>
>
> ---
> L'absence de virus dans ce courrier électronique a été vérifiée par le 
> logiciel antivirus Avast.
> https://www.avast.com/antivirus


Re: Single query to get the count for all individual collections

2019-01-11 Thread Zheng Lin Edwin Yeo
Thanks for the reply.

I have tried out on adding a new field to contains the collection id, and
use json facet query to get the count. This is working.

Regards,
Edwin

On Thu, 10 Jan 2019 at 23:33, Hullegård, Jimi <
jimi.hulleg...@svensktnaringsliv.se> wrote:

> Unless someone else has a cleaver solution, maybe one option could be to
> add a new field that simply contains the collection id. Then you could do a
> facet query on that field to get the count per collection.
>
> /Jimi
>
> -Ursprungligt meddelande-
> Från: Zheng Lin Edwin Yeo 
> Skickat: den 10 januari 2019 10:41
> Till: solr-user@lucene.apache.org
> Ämne: Single query to get the count for all individual collections
>
> Hi,
>
> I would like to find out, is there any way that I can send a single query
> to retrieve the numFound for all the individual collections?
>
> I have tried with this query
>
> http://localhost:8983/solr/collection1/select?q=*:*=collection1,collection2
> However, this query is doing the sum of all the collections, instead of
> showing the count for each of the collection.
>
> I am using Solr 7.5.0.
>
> Regards,
> Edwin
> Svenskt Näringsliv behandlar dina personuppgifter i enlighet med GDPR. Här
> kan du läsa mer om vår behandling och dina rättigheter, Integritetspolicy<
> https://www.svensktnaringsliv.se/dataskydd/integritet-och-behandling-av-personuppgifter_697219.html?utm_source=sn-email_medium=email
> >
>


Re: Solr relevancy score different on replicated nodes

2019-01-11 Thread Erick Erickson
What Elizabeth said.

Really, this is an intractable problem. Even in the TLOG
and PULL replica case, an index getting updates will
still fire their replication requests at different wall-clock
time. Even if that were coordinated, the vagaries of
networks etc. would _still_ mean the various replicas
would see slightly different "snapshots" of the index.
True, the window would be smaller

The only situations I've seen where the scores on different
replicas are always identical is when the index is optimized,
which isn't recommended except if you can do it
all the time. Or TLOG and PULL replicas are used and
the index is not undergoing continuous updates.

As for locking subsequent requests to a set of nodes, the
idea has been bandied about but usually falls down when
it's realized that this has the potential to unevenly distribute
the load.

Best,
Erick

On Fri, Jan 11, 2019 at 3:13 AM Elizabeth Haubert
 wrote:
>
> Hello,
>
> To a certain extent, I agree with Eric, that this isn't a problem, but
> looks like one.  The nature of TF*IDF is such that you will see different
> scores for the same query over time on the same replica, or different
> replicas for the same query with most replication schemes. This is mildly
> annoying when the score is displayed to the user, although I have found
> most end users do not pay that much attention to the floating point score.
> Testers do.  On a small index with high write/delete traffic and homogenous
> docs, I've seen it cause document re-orderings when the same query is
> repeated and sent to different replicas such as for paging, and that is
> noticeable to end users.
>
> How big is your index, and how different are the percentages you are
> seeing?  This is a much more pronounced problem on smaller indices; it is
> possible this is a problem with your test setup, but not production.
>
> Your solution at directing users to a consistent replica will solve the
> change in values over a session-sized window of time.   With a single
> shard, you could use a Master/Slave setup, direct queries at a given
> slave.  This has a number of operational consequences though, as it means
> you will lose the benefits of SolrCloud.
>
> Mikhail's suggestion to use ExactStats would be cleaner:
> https://lucene.apache.org/solr/guide/6_6/distributed-requests.html#DistributedRequests-ConfiguringstatsCache_DistributedIDF_
>
>
> Elizabeth
>
> On Fri, Jan 11, 2019 at 3:56 AM Ashish Bisht 
> wrote:
>
> > Hi Erick,
> >
> > Your statement "*At best, I've seen UIs where they display, say, 1 to 5
> > stars that are just showing the percentile that the particular doc had
> > _relative to the max score*"  is something we are trying to achieve,but we
> > are dealing in percentages rather stars(ratings)
> >
> > Change in MaxScore per node is messing it.
> >
> > I was thinking if it possible to make one complete request(for a term) go
> > though one replica,i.e if to the client we could tell which replica hit the
> > first request and subsequently further paginated requests should go though
> > that replica until keyword is changed.Do you think it is possible or a good
> > idea?If yes is there a way in solr to know which replica served request?
> >
> > Regards
> > Ashish
> >
> >
> >
> >
> > --
> > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
> >


Need help on Solr authorization

2019-01-11 Thread sathish kumar
Hi,

We have a two node Solr setup(version is 7.2.1) with embedded zookeeper
running in Solr Server 1.

We have recently enabled SSL and also enabled basic authentication and
RuleBasedAuthorizationPlugin.

As part of testing, created new user with admin role and assigned the
permissions "collection-admin-read" & “read” to this role.

When I try to query a data for any collection name, the system is unable to
talk with shards of other server.

I am getting the following error in both command line and Solr admin
browser.

Can someone help me to identify what configurations I am missing? Let me
know if you need any more info.



Followed this url for SSL setup:
https://lucene.apache.org/solr/guide/7_2/enabling-ssl.html

Command used: curl --cacert solr-ssl.cacert.pem --user solr:SolrRocks
https://solr-node-1:8080/solr//select?q=*:*

Error:

{

  "error":{

"metadata":[

  "error-class","org.apache.solr.common.SolrException",


   
"root-error-class","sun.security.provider.certpath.SunCertPathBuilderException"],

"msg":"Error trying to proxy request for url:
https://solr-node-2:8080/solr/ba_test/select;,

"trace":"org.apache.solr.common.SolrException: Error trying to proxy
request for url: https://solr-node-2:8080/solr/ba_test/select\n\tat
org.apache.solr.servlet.HttpSolrCall.remoteQuery(HttpSolrCall.java:646)\n\tat
org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:500)\n\tat
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:382)\n\tat
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:326)\n\tat
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1751)\n\tat
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)\n\tat
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)\n\tat
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)\n\tat
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)\n\tat
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)\n\tat
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)\n\tat
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)\n\tat
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)\n\tat
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)\n\tat
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)\n\tat
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)\n\tat
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)\n\tat
org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)\n\tat
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)\n\tat
org.eclipse.jetty.server.Server.handle(Server.java:534)\n\tat
org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)\n\tat
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)\n\tat
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)\n\tat
org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108)\n\tat
org.eclipse.jetty.io.ssl.SslConnection.onFillable(SslConnection.java:251)\n\tat
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)\n\tat
org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108)\n\tat
org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)\n\tat
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)\n\tat
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)\n\tat
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)\n\tat
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)\n\tat
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)\n\tat
java.lang.Thread.run(Thread.java:748)\nCaused by:
javax.net.ssl.SSLHandshakeException:
sun.security.validator.ValidatorException: PKIX path building failed:
sun.security.provider.certpath.SunCertPathBuilderException: unable to find
valid certification path to requested target\n\tat
sun.security.ssl.Alerts.getSSLException(Alerts.java:192)\n\tat
sun.security.ssl.SSLSocketImpl.fatal(SSLSocketImpl.java:1959)\n\tat
sun.security.ssl.Handshaker.fatalSE(Handshaker.java:302)\n\tat
sun.security.ssl.Handshaker.fatalSE(Handshaker.java:296)\n\tat
sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1514)\n\tat
sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:216)\n\tat
sun.security.ssl.Handshaker.processLoop(Handshaker.java:1026)\n\tat
sun.security.ssl.Handshaker.process_record(Handshaker.java:961)\n\tat

Re: Solr relevancy score different on replicated nodes

2019-01-11 Thread Elizabeth Haubert
Hello,

To a certain extent, I agree with Eric, that this isn't a problem, but
looks like one.  The nature of TF*IDF is such that you will see different
scores for the same query over time on the same replica, or different
replicas for the same query with most replication schemes. This is mildly
annoying when the score is displayed to the user, although I have found
most end users do not pay that much attention to the floating point score.
Testers do.  On a small index with high write/delete traffic and homogenous
docs, I've seen it cause document re-orderings when the same query is
repeated and sent to different replicas such as for paging, and that is
noticeable to end users.

How big is your index, and how different are the percentages you are
seeing?  This is a much more pronounced problem on smaller indices; it is
possible this is a problem with your test setup, but not production.

Your solution at directing users to a consistent replica will solve the
change in values over a session-sized window of time.   With a single
shard, you could use a Master/Slave setup, direct queries at a given
slave.  This has a number of operational consequences though, as it means
you will lose the benefits of SolrCloud.

Mikhail's suggestion to use ExactStats would be cleaner:
https://lucene.apache.org/solr/guide/6_6/distributed-requests.html#DistributedRequests-ConfiguringstatsCache_DistributedIDF_


Elizabeth

On Fri, Jan 11, 2019 at 3:56 AM Ashish Bisht 
wrote:

> Hi Erick,
>
> Your statement "*At best, I've seen UIs where they display, say, 1 to 5
> stars that are just showing the percentile that the particular doc had
> _relative to the max score*"  is something we are trying to achieve,but we
> are dealing in percentages rather stars(ratings)
>
> Change in MaxScore per node is messing it.
>
> I was thinking if it possible to make one complete request(for a term) go
> though one replica,i.e if to the client we could tell which replica hit the
> first request and subsequently further paginated requests should go though
> that replica until keyword is changed.Do you think it is possible or a good
> idea?If yes is there a way in solr to know which replica served request?
>
> Regards
> Ashish
>
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


Bugs with Re-ranking/LtR and ExplainAugmenterFactory

2019-01-11 Thread Sambhav Kothari (BLOOMBERG/ LONDON)
Hello,

Currently, if we use the ExplainAugmenterFactory with LtR, instead of using the 
model/re-rankers explain method, it uses the default query explain (tf-idf 
explanation). This happens because the BasicResultContext doesn't wrap the 
query(https://github.com/apache/lucene-solr/blob/1d85cd783863f75cea133fb9c452302214165a4d/solr/core/src/java/org/apache/solr/response/BasicResultContext.java#L67)
  with the RankQuery when its set to context's query, which is then used by the 
ExplainAugmenterFactory. 
(https://github.com/apache/lucene-solr/blob/1d85cd783863f75cea133fb9c452302214165a4d/solr/core/src/java/org/apache/solr/response/transform/ExplainAugmenterFactory.java#L111).
 

As a result there are discrepancies between queries like - 


http://localhost:8983/solr/collection1/select?q=*:*=collectionName=json=[explain
 style=nl],score={!ltr model=linear-model}


http://localhost:8983/solr/collection1/select?q=*:*=collectionName=json=score={!ltr
 model=linear-model}=true

the former outputs the explain from the SimilarityScorer's explain while the 
latter uses the correct LtR ModelScorer's explain.

There are a few other problems with the explain augmenter - for eg. it doesn't 
work with grouping (although the other doc transformers like LtR's 
LTRFeatureLoggerTransformerFactory work with grouping).

Just wanted to discuss these issues before creating tickets on Jira.

Thanks,
Sam

Schema.xml, copyField, Slash, ignoreCase ?

2019-01-11 Thread Bruno Mannina
Hello,



I’m facing a problem concerning the default field “text” (SOLR 5.4) and
queries which contains / (slash)



I need to have default “text” field with:

- ignoreCase,

- no auto truncation,

- process slash char



I would like to perform only query on the field “text”

Queries can contain:  code or keywords or both.



I have 2 fields named symbol and title, and 1 alias ti (old field that I
can’t delete or modify)



* Symbol contains code with slash (i.e A62C21/02)





* Title contains English text and also symbol





{ "symbol": "B65D81/20",

"title": [

 "under vacuum or superatmospheric pressure, or in a special atmosphere,
e.g. of inert gas  {(B65D81/28  takes precedence; containers with
pressurising means for maintaining ball pressure A63B39/025)} "

]}



* Ti is an alias of title





* Text is





- Alias are:















If I do these queries :



* ti:airbag   à it’s ok

* title:airbag  à not good for me because it found
airbags

* ti:b65D81/28  à not good, debug shows ti:b65d81 OR ti:28

* ti:”b65D81/28”  à it’s ok

* symbol:b65D81/28  à it’s ok (even without “ “)



NOW with “text” field

* b65D81/28  à not good, debug shows text:b65d81 OR
text:28

* airbag   à it’s ok

* “b65D81/28”  à it’s ok



It will be great if I can enter symbol without “ “



Could you help me to have a text field which solve this problem ? (please
find below all def of my fields)



Many thanks for your help.



String_ci is my own definition







  

  









  







  

  









  







  













  

  











   



  







Best Regards

Bruno





---
L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel 
antivirus Avast.
https://www.avast.com/antivirus


RE: Delayed/waiting requests

2019-01-11 Thread Gael Jourdan-Weil
Interesting indeed, we did not see anything with VisualVM but having a look at 
the GC logs could gives us more info, especially on the pauses.

I will collect data over the week-end and look at it.


Thanks


De : Hullegård, Jimi 
Envoyé : vendredi 11 janvier 2019 03:46:02
À : solr-user@lucene.apache.org
Objet : Re: Delayed/waiting requests

Could be caused by garbage collection in the jvm.

https://wiki.apache.org/solr/SolrPerformanceProblems

Go down to the segment called “GC pause problems”

/Jimi

Sent from my iPhone

On 11 Jan 2019, at 05:05, Gael Jourdan-Weil 
mailto:gael.jourdan-w...@kelkoogroup.com>> 
wrote:

Hello,

We are experiencing some performance issues on a simple SolrCloud cluster of 3 
replicas (1 core) but what we found during our analysis seems a bit odd, so we 
thought the community could have relevant ideas on this.

Load: between 30 and 40 queries per second, constant over time of analysis

Symptoms: high response time over short period of time but quite frequently.
We are talking about requests response time going from 50ms to 5000ms or even 
worse during less than 5 seconds, and then going back to normal.

What we found out: just before response time increase, requests seems to be 
delayed.
That is during 2/3 seconds, requests pile up, no response is sent, and then all 
requests are resolved and responses are all returned to the clients at the same 
time.
Very much like if there was a lock happening somewhere. But we found no "lock" 
time nor at JVM or system level.

Does someone can think of something that could explain this in the way Solr 
works ?
Or ideas to track down the root cause..

Solr version is 7.2.1.

Thanks for reading,

Gaël Jourdan-Weil

Svenskt Näringsliv behandlar dina personuppgifter i enlighet med GDPR. Här kan 
du läsa mer om vår behandling och dina rättigheter, 
Integritetspolicy


Re: REBALANCELEADERS is not reliable

2019-01-11 Thread Bernd Fehling

Hi Erik,
yes, I would be happy to test any patches.

Good news, I got rebalance working.
After running the rebalance about 50 times with debugger and watching
the behavior of my problem shard and its core_nodes within my test cloud
I came to the point of failure. I solved it and now it works.

Bad news, rebalance is still not reliable and there are many more
problems and point of failure initiated by rebalanceLeaders or better
by re-queueing the watchlist.

How I located _my_ problem:
Test cloud is 5 server (VM), 5 shards, 3 replica per shard, 1 java
instance per server. 3 separate zookeepers.
My problem, shard2 wasn't willing to rebalance to a specific core_node.
core_nodes related (core_node1, core_node2, core_node10).
core_node10 was the preferredLeader.
It was just changing leader ship between core_node1 and core_node2,
back and forth, whenever I called rebalanceLeader.
First step, I stopped the server holding core_node2.
Result, the leadership was staying at core_node1 whenever I called 
rebalanceLeaders.
Second step, from debugger I _forced_ during rebalanceLeaders the
system to give the leadership to core_node10.
Result, there was no leader anymore for that shard. Yes it can happen,
you can end up with a shard having no leader but active core_nodes!!!
To fix this I was giving preferredLeader to core_node1 and called 
rebalanceLeaders.
After that, preferredLeader was set back to core_node10 and I was back
at the point I started, all calls to rebalanceLeaders kept the leader at 
core_node1.

From the debug logs I got the hint about PeerSync of cores and IndexFingerprint.
The stats from my problem core_node10 showed that they differ from leader 
core_node1.
And the system notices the difference, starts a PeerSync and ends with success.
But actually the PeerSync seem to fail, because the stats of core_node1 and
core_node10 still differ afterwards.
Solution, I also stopped my server holding my problem core_node10, wiped all 
data
directories and started that server again. The core_nodes where rebuilt from 
leader
and now they are really in sync.
Calling now rebalanceLeaders ended now with success to preferredLeader.

My guess:
You have to check if the cores, participating in leadership election, are 
_really_
in sync. And this must be done before starting any rebalance.
Sounds ugly... :-(

Next question, why is PeerSync not reporting an error?
There is an info about "PeerSync START", "PeerSync Received 0 versions from ... 
fingeprint:null"
and "PeerSync DONE. sync succeeded" but the cores are not really in sync.

Another test I did (with my new knowledge about synced cores):
- Removing all preferredLeader properties
- stopping, wiping data directory, starting all server one by one to get
  all cores of all shards in sync
- setting one preferredLeader for each shard but different from the actual 
leader
- calling rebalanceLeaders succeeded only at 2 shards with the first run,
  not for all 5 shards (even with really all cores in sync).
- after calling rebalanceLeaders again the other shards succeeded also.
Result, rebalanceLeaders is still not reliable.

I have to mention that I have about 520.000 docs per core in my test cloud
and that there might also be a timing issue between calling rebalanceLeaders,
detecting that cores to become leader are not in sync with actual leader,
and resync while waiting for new leader election.

So far,
Bernd


Am 10.01.19 um 17:02 schrieb Erick Erickson:

Bernd:

Don't feel bad about missing it, I wrote the silly stuff and it took me
some time to remember.

Those are  the rules.

It's always humbling to look back at my own code and say "that
idiot should have put some comments in here..." ;)

yeah, I agree there are a lot of moving parts here. I have a note to
myself to provide better feedback in the response. You're absolutely
right that we fire all these commands and hope they all work.  Just
returning "success" status doesn't guarantee leadership change.

I'll be on another task the rest of this week, but I should be able
to dress things up over the weekend. That'll give you a patch to test
if you're willing.

The actual code changes are pretty minimal, the bulk of the patch
will be the reworked test.

Best,
Erick



Re: Solr relevancy score different on replicated nodes

2019-01-11 Thread Ashish Bisht
Hi Erick,

Your statement "*At best, I've seen UIs where they display, say, 1 to 5
stars that are just showing the percentile that the particular doc had
_relative to the max score*"  is something we are trying to achieve,but we
are dealing in percentages rather stars(ratings)

Change in MaxScore per node is messing it.

I was thinking if it possible to make one complete request(for a term) go
though one replica,i.e if to the client we could tell which replica hit the
first request and subsequently further paginated requests should go though
that replica until keyword is changed.Do you think it is possible or a good
idea?If yes is there a way in solr to know which replica served request?

Regards
Ashish




--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html