Re: Case sensitive synonym value

2017-08-23 Thread Aman Deep Singh
Hi all,
Any update on this issue.
-Aman

On 10-Aug-2017 8:27 AM, "Aman Deep Singh"  wrote:

> Yes,
> Ignore case is set to true and it is working fine
>
>
> On 10-Aug-2017 12:43 AM, "Erick Erickson"  wrote:
>
> You set ignoreCase="true" though, right?
>
> On Wed, Aug 9, 2017 at 8:39 AM, Aman Deep Singh
>  wrote:
> > Hi Erick,
> > I tried even before going to lowercase factory value is in lowercase
> >
> > this is the analysis tab result for the query  iwatch
> > where {"iwatch":["iWatch","appleWatch"]} is configured in managed
> synonym
> > ST
> > iwatch
> > SF
> > *applewatch*
> > *iwatch*
> > PRF
> > applewatch
> > iwatch
> > PRF
> > applewatch
> > iwatch
> > WDF
> > applewatch
> > iwatch
> > LCF
> > applewatch
> > iwatch
> >
> > Thanks,
> > Aman Deep Singh
> >
> > On 09-Aug-2017 8:46 PM, "Erick Erickson" 
> wrote:
> >
> >> Admin/analysis is a good place to start figuring this out. For
> >> instance, do you have lowerCaseFilter configured in your analysis
> >> chain somewhere that's doing the conversion?
> >>
> >> Best,
> >> Erick
> >>
> >> On Wed, Aug 9, 2017 at 5:34 AM, Aman Deep Singh
> >>  wrote:
> >> > Hi,
> >> > I'm trying to use the ManagedSynonyms with *ignoreCase=true*
> >> > It is working fine for the identifying part but the problem comes in
> >> > synonym value
> >> > Suppose i have a synonym *iwatch ==>appleWatch,iWatch*
> >> > If the user query is *iwatch (in any case)*  it was identifying the
> >> synonym
> >> > and replace the token with *applewatch and iwatch*  (in
> lowercase),which
> >> i
> >> > didn't want
> >> > I need the synonyms to comes with the same case what i have configured
> >> > i.e. *appleWatch and iWatch*
> >> > Any idea on how to so that .
> >> >
> >> > Thanks,
> >> > Aman Deep Singh
> >>
>
>
>


Re: Solr uses lots of shared memory!

2017-08-23 Thread Shalin Shekhar Mangar
Very interesting. Do you have many DocValue fields? Have you always
had them i.e. did you see this problem before you turned on DocValues?
The DocValue fields are in a separate file and they will be memory
mapped on demand. One thing you can experiment with is to use
preload=true option on the MMapDirectoryFactory which will mmap all
index files on startup [1]. Once you do this, and if you still notice
shared memory leakage then it may be a genuine memory leak that we
should investigate.

[1] - 
http://lucene.apache.org/solr/guide/6_6/datadir-and-directoryfactory-in-solrconfig.html#DataDirandDirectoryFactoryinSolrConfig-SpecifyingtheDirectoryFactoryForYourIndex

On Wed, Aug 23, 2017 at 7:02 PM, Markus Jelsma
 wrote:
> I do not think it is a problem of reporting after watching top after restart 
> of some Solr instances, it dropped back to `normal`, around 350 MB, which i 
> think it high to but anyway.
>
> Two hours later, the restarted nodes are slowly increasing shared memory 
> consumption to about 1500 MB now. I don't understand why shared memory usage 
> should/would increase slowly over time, it makes little sense to me and i 
> cannot remember Solr doing this in the past ten years.
>
> But it seems to correlate to index size on disk, these main text search nodes 
> have an index of around 16 GB and up 3 GB of shared memory after a few days. 
> Logs nodes up to 800 MB index size and 320 MB of shared memory, the low 
> latency nodes have four different cores that make up just over 100 MB index 
> size, shared memory consumption is just 22 MB, which seems more reasonable 
> for the case of shared memory.
>
> I can also force Solr to 'leak' shared memory just by sending queries to it. 
> My freshly restarted local node used 68 MB shared memory at startup. Two 
> minutes and 25.000 queries later it was already 2748 MB! At first there is a 
> very sharp increase to 2000, then it takes almost two minutes more to 
> increase to 2748. I can decrease the maximum shared memory usage to 1200 if i 
> query (via edismax) only on fields of one language instead of 25 orso. I can 
> decrease it as well further if i disable highlighting (HUH?) but still query 
> on all fields.
>
> * We have tried patching Java's ByteBuffer [1] because it seemed to fit the 
> problems, it does not fix it.
> * We have also removed all our custom plugins, so it has become a vanilla 
> Solr 6.6 just with our stripped down schema and solrconfig, it neither fixes 
> it.
>
> Why does it slowly increase over time?
> Why does it appear to correlate to index size?
> Is anyone else seeing this on their 6.6 cloud production or local machines?
>
> Thanks,
> Markus
>
> [1]: http://www.evanjones.ca/java-bytebuffer-leak.html
>
> -Original message-
>> From:Shawn Heisey 
>> Sent: Tuesday 22nd August 2017 17:32
>> To: solr-user@lucene.apache.org
>> Subject: Re: Solr uses lots of shared memory!
>>
>> On 8/22/2017 7:24 AM, Markus Jelsma wrote:
>> > I have never seen this before, one of our collections, all nodes eating 
>> > tons of shared memory!
>> >
>> > Here's one of the nodes:
>> > 10497 solr  20   0 19.439g 4.505g 3.139g S   1.0 57.8   2511:46 java
>> >
>> > RSS is roughly equal to heap size + usual off-heap space + shared memory. 
>> > Virtual is equal to RSS and index size on disk. For two other collections, 
>> > the nodes use shared memory as expected, in the MB range.
>> >
>> > How can Solr, this collection, use so much shared memory? Why?
>>
>> I've seen this on my own servers at work, and when I add up a subset of
>> the memory numbers I can see from the system, it ends up being more
>> memory than I even have in the server.
>>
>> I suspect there is something odd going on in how Java reports memory
>> usage to the OS, or maybe a glitch in how Linux interprets Java's memory
>> usage.  At some point in the past, numbers were reported correctly.  I
>> do not know if the change came about because of a Solr upgrade, because
>> of a Java upgrade, or because of an OS kernel upgrade.  All three were
>> upgraded between when I know the numbers looked right and when I noticed
>> they were wrong.
>>
>> https://www.dropbox.com/s/91uqlrnfghr2heo/solr-memory-sorted-top.png?dl=0
>>
>> This screenshot shows that Solr is using 17GB of memory, 41.45GB of
>> memory is being used by the OS disk cache, and 10.23GB of memory is
>> free.  Add those up, and it comes to 68.68GB ... but the machine only
>> has 64GB of memory, and that total doesn't include the memory usage of
>> the other processes seen in the screenshot.  This impossible situation
>> means that something is being misreported somewhere.  If I deduct that
>> 11GB of SHR from the RES value, then all the numbers work.
>>
>> The screenshot was almost 3 years ago, so I do not know what machine it
>> came from, and therefore I can't be sure what the actual heap size was.
>> I think it was about 6GB -- the difference between RES and SHR.  I have

Re: Solr uses lots of shared memory!

2017-08-23 Thread Erick Erickson
I suspect you've already seen this, but top and similar can be
confusing when trying to interpret MMapDirectory. Uwe has an excellent
explication:

http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

Best,
Erick

On Wed, Aug 23, 2017 at 9:10 AM, Markus Jelsma
 wrote:
> I have the problem in production and local, with default Solr 6.6 JVM 
> arguments, environments are:
>
> openjdk version "1.8.0_141"
> OpenJDK Runtime Environment (build 1.8.0_141-8u141-b15-1~deb9u1-b15)
> OpenJDK 64-Bit Server VM (build 25.141-b15, mixed mode)
> Linux idx1 4.9.0-3-amd64 #1 SMP Debian 4.9.30-2+deb9u1 (2017-06-18) x86_64 
> GNU/Linux
>
> and
>
> openjdk version "1.8.0_131"
> OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-2ubuntu1.17.04.3-b11)
> OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)
> Linux midas 4.10.0-32-generic #36-Ubuntu SMP Tue Aug 8 12:10:06 UTC 2017 
> x86_64 x86_64 x86_64 GNU/Linux
>
> Regarding the node that shows the problem, can you reproduce it locally? Fire 
> it up, put some data in, confirm low shared space usage, and execute few 
> thousands queries against it? We immediately see a sharp rise in shared 
> memory, MB's per second until it reaches some sort of plateau.
>
> -Original message-
>> From:Shawn Heisey 
>> Sent: Wednesday 23rd August 2017 16:37
>> To: solr-user@lucene.apache.org
>> Subject: Re: Solr uses lots of shared memory!
>>
>> On 8/23/2017 7:32 AM, Markus Jelsma wrote:
>> > Why does it slowly increase over time?
>> > Why does it appear to correlate to index size?
>> > Is anyone else seeing this on their 6.6 cloud production or local machines?
>>
>> More detailed information included here.  My 6.6 dev install is NOT
>> having the problem, but a much older version IS.
>>
>> I grabbed this screenshot only moments ago from a production server
>> which is exhibiting a large SHR value for the Solr process:
>>
>> https://www.dropbox.com/s/q79lo2gft9es06u/idxa1-top-big-shr.png?dl=0
>>
>> This is Solr 4.7.2, with a 10 month uptime for the Solr process, running
>> with these arguments:
>>
>> -DSTOP.KEY=REDACTED
>> -DSTOP.PORT=8078
>> -Djetty.port=8981
>> -Dsolr.solr.home=/index/solr4
>> -Dcom.sun.management.jmxremote.authenticate=false
>> -Dcom.sun.management.jmxremote.ssl=false
>> -Dcom.sun.management.jmxremote.port=8686
>> -Dcom.sun.management.jmxremote
>> -XX:+PrintReferenceGC
>> -XX:+PrintAdaptiveSizePolicy
>> -XX:+PrintGCDetails
>> -XX:+PrintGCDateStamps
>> -Xloggc:logs/gc.log-verbose:gc
>> -XX:+AggressiveOpts
>> -XX:+UseLargePages
>> -XX:InitiatingHeapOccupancyPercent=75
>> -XX:MaxGCPauseMillis=250
>> -XX:G1HeapRegionSize=8m
>> -XX:+ParallelRefProcEnabled
>> -XX:+PerfDisableSharedMem
>> -XX:+UseG1GC
>> -Dlog4j.configuration=file:etc/log4j.properties
>> -Xmx8192M
>> -Xms4096M
>>
>> The OS is CentOS 6, with the following Java and kernel:
>>
>> java version "1.7.0_72"
>> Java(TM) SE Runtime Environment (build 1.7.0_72-b14)
>> Java HotSpot(TM) 64-Bit Server VM (build 24.72-b04, mixed mode)
>>
>> Linux idxa1 2.6.32-431.11.2.el6.centos.plus.x86_64 #1 SMP Tue Mar 25
>> 21:36:54 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
>>
>> I also just grabbed a screenshot from my dev server, running Ubuntu 14,
>> Solr 6.6.0, a LOT more index data, and a more recent Java version.  Solr
>> has an uptime of about one month.  This server was installed with the
>> service installer script, so it uses bin/solr.  It doesn't seem to have
>> the same problem:
>>
>> https://www.dropbox.com/s/85h1weuopa643za/bigindy5-top-small-shr.png?dl=0
>>
>> java version "1.8.0_144"
>> Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
>> Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)
>>
>> Linux bigindy5 3.13.0-105-generic #152-Ubuntu SMP Fri Dec 2 15:37:11 UTC
>> 2016 x86_64 x86_64 x86_64 GNU/Linux
>>
>> The arguments for this one are very similar to the production server:
>>
>> -DSTOP.KEY=solrrocks
>> -DSTOP.PORT=7982
>> -Dcom.sun.management.jmxremote
>> -Dcom.sun.management.jmxremote.authenticate=false
>> -Dcom.sun.management.jmxremote.local.only=false
>> -Dcom.sun.management.jmxremote.port=18982
>> -Dcom.sun.management.jmxremote.rmi.port=18982
>> -Dcom.sun.management.jmxremote.ssl=false
>> -Djetty.home=/opt/solr6/server
>> -Djetty.port=8982
>> -Dlog4j.configuration=file:/index/solr6/log4j.properties
>> -Dsolr.install.dir=/opt/solr6
>> -Dsolr.log.dir=/index/solr6/logs
>> -Dsolr.log.muteconsole
>> -Dsolr.solr.home=/index/solr6/data
>> -Duser.timezone=UTC
>> -XX:+AggressiveOpts
>> -XX:+ParallelRefProcEnabled
>> -XX:+PrintGCApplicationStoppedTime
>> -XX:+PrintGCDateStamps
>> -XX:+PrintGCDetails
>> -XX:+PrintGCTimeStamps
>> -XX:+PrintHeapAtGC
>> -XX:+PrintTenuringDistribution
>> -XX:+UseG1GC
>> -XX:+UseGCLogFileRotation
>> -XX:+UseLargePages
>> -XX:G1HeapRegionSize=8m
>> -XX:GCLogFileSize=20M
>> -XX:InitiatingHeapOccupancyPercent=75
>> -XX:MaxGCPauseMillis=250
>> -XX:NumberOfGCLogFiles=9
>> 

Re: Facet on a Payload field type?

2017-08-23 Thread Chris Hostetter

: The payload idea was from my boss, it's similar to how they did this in
: Endeca.
...
: My alternate idea is to have sets of facet fields for different languages,
: then let our service layer determine the correct one for the user's
: language, but I'm curious as to how others have solved this.

Let's back up for a minute -- can you please explain your ultimate goal, 
from a "solr client application" perspective? (assuming we have no 
knowledge of how/how you might have used Endeca in the past)

What is it you want your application to be able to do when indexing docs 
to solr and making queries to solr?  give us some real world examples



(If i had to guess: i gather maybe you're just dealing with a "keywords" 
type field that you want to facet on -- and maybe you could use a diff 
field for each langauge, or encode the langauges as a prefix on each term 
and use facet.prefix to restrict the facet contraints returned)



https://people.apache.org/~hossman/#xyproblem
XY Problem

Your question appears to be an "XY Problem" ... that is: you are dealing
with "X", you are assuming "Y" will help you, and you are asking about "Y"
without giving more details about the "X" so that we can understand the
full issue.  Perhaps the best solution doesn't involve "Y" at all?
See Also: http://www.perlmonks.org/index.pl?node_id=542341



: 
: On Wed, Aug 23, 2017 at 2:10 PM, Markus Jelsma 
: wrote:
: 
: > Technically they could, facetting is possible on TextField, but it would
: > be useless for facetting. Payloads are only used for scoring via a custom
: > Similarity. Payloads also can only contain one byte of information (or was
: > it 64 bits?)
: >
: > Payloads are not something you want to use when dealing with translations.
: > We handle facet constraint (and facet field)  translations just by mapping
: > internal value to a translated value when displaying facet, and vice versa
: > when filtering.
: >
: > -Original message-
: > > From:Webster Homer 
: > > Sent: Wednesday 23rd August 2017 20:22
: > > To: solr-user@lucene.apache.org
: > > Subject: Facet on a Payload field type?
: > >
: > > Is it possible to facet on  a payload field type?
: > >
: > > We are moving from Endeca to Solr. We have a number of Endeca facets
: > where
: > > we have hacked in multilangauge support. The multiple languages are
: > really
: > > just for displaying the value of a term internally the value used to
: > search
: > > is in English. The problem is that we don't have translations for most of
: > > our facet data and this was a way to support multiple languages with the
: > > data we have.
: > >
: > > Looking at the Solrj FacetField class I cannot tell if the value can
: > >  contain  a payload or not
: > >
: > > --
: > >
: > >
: > > This message and any attachment are confidential and may be privileged or
: > > otherwise protected from disclosure. If you are not the intended
: > recipient,
: > > you must not copy this message or attachment or disclose the contents to
: > > any other person. If you have received this transmission in error, please
: > > notify the sender immediately and delete the message and any attachment
: > > from your system. Merck KGaA, Darmstadt, Germany and any of its
: > > subsidiaries do not accept liability for any omissions or errors in this
: > > message which may arise as a result of E-Mail-transmission or for damages
: > > resulting from any unauthorized changes of the content of this message
: > and
: > > any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
: > > subsidiaries do not guarantee that this message is free of viruses and
: > does
: > > not accept liability for any damages caused by any virus transmitted
: > > therewith.
: > >
: > > Click http://www.emdgroup.com/disclaimer to access the German, French,
: > > Spanish and Portuguese versions of this disclaimer.
: > >
: >
: 
: -- 
: 
: 
: This message and any attachment are confidential and may be privileged or 
: otherwise protected from disclosure. If you are not the intended recipient, 
: you must not copy this message or attachment or disclose the contents to 
: any other person. If you have received this transmission in error, please 
: notify the sender immediately and delete the message and any attachment 
: from your system. Merck KGaA, Darmstadt, Germany and any of its 
: subsidiaries do not accept liability for any omissions or errors in this 
: message which may arise as a result of E-Mail-transmission or for damages 
: resulting from any unauthorized changes of the content of this message and 
: any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
: subsidiaries do not guarantee that this message is free of viruses and does 
: not accept liability for any damages caused by any virus transmitted 
: therewith.
: 
: Click http://www.emdgroup.com/disclaimer to access the German, French, 
: Spanish and Portuguese versions of this 

Re: Machine Learning for search

2017-08-23 Thread Joe Obernberger
Thank you Joel.  I'm really having a good time with the machine learning 
component in Solr.  In this case, the weather model was built by 
classifying tweets as positive or negative.  I started by searching for 
tweets with terms like tornado, storm, forecast, typhoon, hurricane, 
blizzard, snow, lightning, flood warning, etc.. and making those 
positive.  Then I grabbed some randoms tweets about Trump, ISIS, 
Kardashian, etc. to make negative tweets.  At that point I started to 
classify data and refine the model (adding more positive/negative) as 
more data came into the system.


I hope that helps.  The model works very well at this point with just 
650 tweets manually classified (pos/neg about split even) and using 150 
terms.


I like your idea about using the model to re-rank the top n search 
results.  That said, the results can be significantly 'better' if I 
classify more data and reorder based on high probability scores; but as 
you pointed out at the cost of much slower searches.  In some cases, I 
would suspect a user may want to search just with a model and without 
any search terms, but in those cases it may be best to classify data as 
it comes in.  I guess it's a toss up between what is more important - 
high probability from the classifier vs high rank from the search engine.

Thanks Joel.

-Joe


On 8/23/2017 3:08 PM, Joel Bernstein wrote:

Can you describe the weather model?

In general the idea is to rerank the top N docs, because it will be too
slow to classify the whole result set.

In this scenario the search engine ranking will already be returning
relevant candidate documents and the model is only used to get a better
ordering of the top docs.



Joel Bernstein
http://joelsolr.blogspot.com/

On Tue, Aug 22, 2017 at 12:32 PM, Joe Obernberger <
joseph.obernber...@gmail.com> wrote:


Hi All - One of the really neat features of solr 6 is the ability to
create machine learning models (information gain) and then use those models
as a query.  If I want a user to be able to execute a query for the text
Hawaii and use a machine learning model related to weather data, how can I
correctly rank the results?  It looks like I would need to classify all the
documents in some date range (assuming the query is date restricted), look
at the probability_d and pick the top n documents.  Is there a better way
to do this?

I'm using a stream like this:
classify(model(models,id="WeatherModel",cacheMillis=5000),
search(COL1,df="FULL_DOCUMENT",q="Hawaii AND
DocTimestamp:[2017-07-23T04:00:00Z TO 
2017-08-23T03:59:00Z]",fl="ClusterText,id",sort="id
asc",rows="1"),field="ClusterText")

This sends this to all the shards who can return at most 10,000 docs each.

Thanks!

-Joe




---
This email has been checked for viruses by AVG.
http://www.avg.com





Solr deltaImportQuery ID configuration

2017-08-23 Thread Liu, Daphne
Hello,
   I am using Solr 6.3.0. Does anyone know in deltaImportQuery when referencing 
id, should I use '${dih.delta.id}' or '${dataimporter.delta.id} ?
   Both were mentioned in Delta-Import wiki. I am confused. Thank you.

Kind regards,

Daphne Liu
BI Architect - Matrix SCM

CEVA Logistics / 10751 Deerwood Park Blvd, Suite 200, Jacksonville, FL 32256 
USA / www.cevalogistics.com T 904.564.1192 / F 
904.928.1448 / daphne@cevalogistics.com



NVOCC Services are provided by CEVA as agents for and on behalf of Pyramid 
Lines Limited trading as Pyramid Lines.
This e-mail message is intended for the above named recipient(s) only. It may 
contain confidential information that is privileged. If you are not the 
intended recipient, you are hereby notified that any dissemination, 
distribution or copying of this e-mail and any attachment(s) is strictly 
prohibited. If you have received this e-mail by error, please immediately 
notify the sender by replying to this e-mail and deleting the message including 
any attachment(s) from your system. Thank you in advance for your cooperation 
and assistance. Although the company has taken reasonable precautions to ensure 
no viruses are present in this email, the company cannot accept responsibility 
for any loss or damage arising from the use of this email or attachments.


RE: Facet on a Payload field type?

2017-08-23 Thread Markus Jelsma
Yes, let the 'service layer' solve this, in our case our JS-interface does this 
job. Internal (in Lucene) values for field MIME-type are e.g. application/pdf 
and text/html. In the interface layer, which is JS, those values are mapped to 
Web page, Webpagina or PDF document, PDF-document, and more languages. Same 
same is true for our language selection facet, internal they are ISO (en, de, 
fr, nl etc) but are mapped to any format desired by the customer, usually the 
name language of the name of that very language.

Re: 64 bits payload, thanks, that gives me 57 additional bits to encode 
language features into our text!

 
-Original message-
> From:Webster Homer 
> Sent: Wednesday 23rd August 2017 22:21
> To: solr-user@lucene.apache.org
> Subject: Re: Facet on a Payload field type?
> 
> The payload idea was from my boss, it's similar to how they did this in
> Endeca.
> I'm not sure I follow your idea about "mapping internal value to translated
> value". Would you care to elaborate?
> My alternate idea is to have sets of facet fields for different languages,
> then let our service layer determine the correct one for the user's
> language, but I'm curious as to how others have solved this.
> 
> On Wed, Aug 23, 2017 at 2:10 PM, Markus Jelsma 
> wrote:
> 
> > Technically they could, facetting is possible on TextField, but it would
> > be useless for facetting. Payloads are only used for scoring via a custom
> > Similarity. Payloads also can only contain one byte of information (or was
> > it 64 bits?)
> >
> > Payloads are not something you want to use when dealing with translations.
> > We handle facet constraint (and facet field)  translations just by mapping
> > internal value to a translated value when displaying facet, and vice versa
> > when filtering.
> >
> > -Original message-
> > > From:Webster Homer 
> > > Sent: Wednesday 23rd August 2017 20:22
> > > To: solr-user@lucene.apache.org
> > > Subject: Facet on a Payload field type?
> > >
> > > Is it possible to facet on  a payload field type?
> > >
> > > We are moving from Endeca to Solr. We have a number of Endeca facets
> > where
> > > we have hacked in multilangauge support. The multiple languages are
> > really
> > > just for displaying the value of a term internally the value used to
> > search
> > > is in English. The problem is that we don't have translations for most of
> > > our facet data and this was a way to support multiple languages with the
> > > data we have.
> > >
> > > Looking at the Solrj FacetField class I cannot tell if the value can
> > >  contain  a payload or not
> > >
> > > --
> > >
> > >
> > > This message and any attachment are confidential and may be privileged or
> > > otherwise protected from disclosure. If you are not the intended
> > recipient,
> > > you must not copy this message or attachment or disclose the contents to
> > > any other person. If you have received this transmission in error, please
> > > notify the sender immediately and delete the message and any attachment
> > > from your system. Merck KGaA, Darmstadt, Germany and any of its
> > > subsidiaries do not accept liability for any omissions or errors in this
> > > message which may arise as a result of E-Mail-transmission or for damages
> > > resulting from any unauthorized changes of the content of this message
> > and
> > > any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> > > subsidiaries do not guarantee that this message is free of viruses and
> > does
> > > not accept liability for any damages caused by any virus transmitted
> > > therewith.
> > >
> > > Click http://www.emdgroup.com/disclaimer to access the German, French,
> > > Spanish and Portuguese versions of this disclaimer.
> > >
> >
> 
> -- 
> 
> 
> This message and any attachment are confidential and may be privileged or 
> otherwise protected from disclosure. If you are not the intended recipient, 
> you must not copy this message or attachment or disclose the contents to 
> any other person. If you have received this transmission in error, please 
> notify the sender immediately and delete the message and any attachment 
> from your system. Merck KGaA, Darmstadt, Germany and any of its 
> subsidiaries do not accept liability for any omissions or errors in this 
> message which may arise as a result of E-Mail-transmission or for damages 
> resulting from any unauthorized changes of the content of this message and 
> any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
> subsidiaries do not guarantee that this message is free of viruses and does 
> not accept liability for any damages caused by any virus transmitted 
> therewith.
> 
> Click http://www.emdgroup.com/disclaimer to access the German, French, 
> Spanish and Portuguese versions of this disclaimer.
> 


Re: Facet on a Payload field type?

2017-08-23 Thread Webster Homer
The payload idea was from my boss, it's similar to how they did this in
Endeca.
I'm not sure I follow your idea about "mapping internal value to translated
value". Would you care to elaborate?
My alternate idea is to have sets of facet fields for different languages,
then let our service layer determine the correct one for the user's
language, but I'm curious as to how others have solved this.

On Wed, Aug 23, 2017 at 2:10 PM, Markus Jelsma 
wrote:

> Technically they could, facetting is possible on TextField, but it would
> be useless for facetting. Payloads are only used for scoring via a custom
> Similarity. Payloads also can only contain one byte of information (or was
> it 64 bits?)
>
> Payloads are not something you want to use when dealing with translations.
> We handle facet constraint (and facet field)  translations just by mapping
> internal value to a translated value when displaying facet, and vice versa
> when filtering.
>
> -Original message-
> > From:Webster Homer 
> > Sent: Wednesday 23rd August 2017 20:22
> > To: solr-user@lucene.apache.org
> > Subject: Facet on a Payload field type?
> >
> > Is it possible to facet on  a payload field type?
> >
> > We are moving from Endeca to Solr. We have a number of Endeca facets
> where
> > we have hacked in multilangauge support. The multiple languages are
> really
> > just for displaying the value of a term internally the value used to
> search
> > is in English. The problem is that we don't have translations for most of
> > our facet data and this was a way to support multiple languages with the
> > data we have.
> >
> > Looking at the Solrj FacetField class I cannot tell if the value can
> >  contain  a payload or not
> >
> > --
> >
> >
> > This message and any attachment are confidential and may be privileged or
> > otherwise protected from disclosure. If you are not the intended
> recipient,
> > you must not copy this message or attachment or disclose the contents to
> > any other person. If you have received this transmission in error, please
> > notify the sender immediately and delete the message and any attachment
> > from your system. Merck KGaA, Darmstadt, Germany and any of its
> > subsidiaries do not accept liability for any omissions or errors in this
> > message which may arise as a result of E-Mail-transmission or for damages
> > resulting from any unauthorized changes of the content of this message
> and
> > any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> > subsidiaries do not guarantee that this message is free of viruses and
> does
> > not accept liability for any damages caused by any virus transmitted
> > therewith.
> >
> > Click http://www.emdgroup.com/disclaimer to access the German, French,
> > Spanish and Portuguese versions of this disclaimer.
> >
>

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


Re: Excessive resources consumption migrating from Solr 6.6.0 Master/Slave to SolrCloud 6.6.0 (dozen times more resources)

2017-08-23 Thread Daniel Ortega
Hi Scott,

- *Can you describe the process that queries the DB and sends records to *
*Solr?*

We are enqueueing ids during every ORACLE transaction (in insert/updates).

An application dequeues every id and perform queries against dozen of
tables in the relational model to retrieve the fields to build the
document.  As we know that we are modifying the same ORACLE row in
different (but consecutive) transactions, we store only the last version of
the modified documents in a map data structure.

The application has a configurable interval to send the documents stored in
the map to the update handler (we have tested different intervals from few
milliseconds to several seconds) using the SolrJ client. Actually we are
sending all the documents every 15 seconds.

This application is developed using Java, Spring and Maven and we have
several instances.

-* Is it a SolrJ-based application?*

Yes, it is. We aren't using the last version of SolrJ client (we are
currently using SolrJ v6.3.0).

- *If it is, which client package are you using?*

I don't know exactly what do you mean saying 'client package' :)

- *How many documents do you send at once?*

It depends on the defined interval described before and the number of
transactions executed in our relational database. From dozens to few
hundreds (and even thousands).

- *Are you sending your indexing or query traffic through a load balancer?*

We aren't using a load balancer for indexing, but we have all our Rest
Query services through an HAProxy (using 'leastconn' algorithm). The Rest
Query Services performs queries using the CloudSolrClient.

Thanks for your reply,
if you need any further information don't hesitate to ask

Daniel

2017-08-23 14:57 GMT+02:00 Scott Stults :

> Hi Daniel,
>
> Great background information about your setup! I've got just a few more
> questions:
>
> - Can you describe the process that queries the DB and sends records to
> Solr?
> - Is it a SolrJ-based application?
> - If it is, which client package are you using?
> - How many documents do you send at once?
> - Are you sending your indexing or query traffic through a load balancer?
>
> If you're sending documents to each replica as fast as they can take them,
> you might be seeing a bottleneck at the shard leaders. The SolrJ
> CloudSolrClient finds out from Zookeeper which nodes are the shard leaders
> and sends docs directly to them.
>
>
> -Scott
>
> On Tue, Aug 22, 2017 at 2:16 PM, Daniel Ortega <
> danielortegauf...@gmail.com>
> wrote:
>
> > *Main Problems*
> >
> >
> > We are involved in a migration from Solr Master/Slave infrastructure to
> > SolrCloud infrastructure.
> >
> >
> >
> > The main problems that we have now are:
> >
> >
> >
> >- Excessive resources consumption: Currently we have 5 instances with
> 80
> >processors/768 GB RAM each instance using SSD Hard Disk Drives that
> > doesn't
> >support the load that we have in the other architecture. In our
> >Master-Slave architecture we have only 7 Virtual Machines with lower
> > specs
> >(4 processors and 16 GB each instance using SSD Hard Disk Drives too).
> > So,
> >at the moment our SolrCloud infrastructure is wasting several dozen
> > times
> >more resources than our Solr Master/Slave infrastructure.
> >- Despite spending more resources we have worst query times (compared
> to
> >Solr in master/slave architecture)
> >
> >
> > *Search infrastructure (SolrCloud infrastructure)*
> >
> >
> >
> > As we cannot use DIH Handler (which is what we use in Solr Master/Slave),
> > we
> > have developed an application which reads every transaction from Oracle,
> > builds a document collection searching in the database and sends the
> result
> > to the */update* handler every 200 milliseconds using SolrJ client. This
> > application tries to delete the possible duplicates in each update
> window,
> > but we are using solr’s de-duplication techniques
> >  > https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%
> > 2Fsolr%2FDe-Duplication=02%7C01%7Cdortega%40idealista.com%
> > 7Cb169ea024abc4954927208d4bc6868eb%7Cd78b7929c2a34897ae9a7d8f8dc1
> > a1cf%7C0%7C0%7C636340604697721266=WEhzoHC1Bf77K706%
> > 2Fj2wIWOw5gzfOgsP1IPQESvMsqQ%3D=0>
> >  too.
> >
> >
> >
> > We are indexing ~100 documents per second (with peaks of ~1000 documents
> > per second).
> >
> >
> >
> > Every search query is centralized in other application which exposes a
> DSL
> > behind a REST API and uses SolrJ client too to perform queries. We have
> > peaks of 2000 QPS.
> >
> > *Cluster structure **(SolrCloud infrastructure)*
> >
> >
> >
> > At the moment, the cluster has 30 SolrCloud instances with the same specs
> > (Same physical hosts, same JVM Settings, etc.).
> >
> >
> >
> > *Main collection*
> >
> >
> >
> > In our use case we are using this collection as a NoSQL database
> basically.
> > Our document is composed of about 300 fields that 

Re: Facet on a Payload field type?

2017-08-23 Thread Webster Homer
Certainly more than a byte of information. The most common example is to
have payloads encode floats.So if there is a limit, it's more likely to be
64bits

On Wed, Aug 23, 2017 at 2:10 PM, Markus Jelsma 
wrote:

> Technically they could, facetting is possible on TextField, but it would
> be useless for facetting. Payloads are only used for scoring via a custom
> Similarity. Payloads also can only contain one byte of information (or was
> it 64 bits?)
>
> Payloads are not something you want to use when dealing with translations.
> We handle facet constraint (and facet field)  translations just by mapping
> internal value to a translated value when displaying facet, and vice versa
> when filtering.
>
> -Original message-
> > From:Webster Homer 
> > Sent: Wednesday 23rd August 2017 20:22
> > To: solr-user@lucene.apache.org
> > Subject: Facet on a Payload field type?
> >
> > Is it possible to facet on  a payload field type?
> >
> > We are moving from Endeca to Solr. We have a number of Endeca facets
> where
> > we have hacked in multilangauge support. The multiple languages are
> really
> > just for displaying the value of a term internally the value used to
> search
> > is in English. The problem is that we don't have translations for most of
> > our facet data and this was a way to support multiple languages with the
> > data we have.
> >
> > Looking at the Solrj FacetField class I cannot tell if the value can
> >  contain  a payload or not
> >
> > --
> >
> >
> > This message and any attachment are confidential and may be privileged or
> > otherwise protected from disclosure. If you are not the intended
> recipient,
> > you must not copy this message or attachment or disclose the contents to
> > any other person. If you have received this transmission in error, please
> > notify the sender immediately and delete the message and any attachment
> > from your system. Merck KGaA, Darmstadt, Germany and any of its
> > subsidiaries do not accept liability for any omissions or errors in this
> > message which may arise as a result of E-Mail-transmission or for damages
> > resulting from any unauthorized changes of the content of this message
> and
> > any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> > subsidiaries do not guarantee that this message is free of viruses and
> does
> > not accept liability for any damages caused by any virus transmitted
> > therewith.
> >
> > Click http://www.emdgroup.com/disclaimer to access the German, French,
> > Spanish and Portuguese versions of this disclaimer.
> >
>

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


RE: Facet on a Payload field type?

2017-08-23 Thread Markus Jelsma
Technically they could, facetting is possible on TextField, but it would be 
useless for facetting. Payloads are only used for scoring via a custom 
Similarity. Payloads also can only contain one byte of information (or was it 
64 bits?) 

Payloads are not something you want to use when dealing with translations. We 
handle facet constraint (and facet field)  translations just by mapping 
internal value to a translated value when displaying facet, and vice versa when 
filtering.

-Original message-
> From:Webster Homer 
> Sent: Wednesday 23rd August 2017 20:22
> To: solr-user@lucene.apache.org
> Subject: Facet on a Payload field type?
> 
> Is it possible to facet on  a payload field type?
> 
> We are moving from Endeca to Solr. We have a number of Endeca facets where
> we have hacked in multilangauge support. The multiple languages are really
> just for displaying the value of a term internally the value used to search
> is in English. The problem is that we don't have translations for most of
> our facet data and this was a way to support multiple languages with the
> data we have.
> 
> Looking at the Solrj FacetField class I cannot tell if the value can
>  contain  a payload or not
> 
> -- 
> 
> 
> This message and any attachment are confidential and may be privileged or 
> otherwise protected from disclosure. If you are not the intended recipient, 
> you must not copy this message or attachment or disclose the contents to 
> any other person. If you have received this transmission in error, please 
> notify the sender immediately and delete the message and any attachment 
> from your system. Merck KGaA, Darmstadt, Germany and any of its 
> subsidiaries do not accept liability for any omissions or errors in this 
> message which may arise as a result of E-Mail-transmission or for damages 
> resulting from any unauthorized changes of the content of this message and 
> any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
> subsidiaries do not guarantee that this message is free of viruses and does 
> not accept liability for any damages caused by any virus transmitted 
> therewith.
> 
> Click http://www.emdgroup.com/disclaimer to access the German, French, 
> Spanish and Portuguese versions of this disclaimer.
> 


Re: Machine Learning for search

2017-08-23 Thread Joel Bernstein
Can you describe the weather model?

In general the idea is to rerank the top N docs, because it will be too
slow to classify the whole result set.

In this scenario the search engine ranking will already be returning
relevant candidate documents and the model is only used to get a better
ordering of the top docs.



Joel Bernstein
http://joelsolr.blogspot.com/

On Tue, Aug 22, 2017 at 12:32 PM, Joe Obernberger <
joseph.obernber...@gmail.com> wrote:

> Hi All - One of the really neat features of solr 6 is the ability to
> create machine learning models (information gain) and then use those models
> as a query.  If I want a user to be able to execute a query for the text
> Hawaii and use a machine learning model related to weather data, how can I
> correctly rank the results?  It looks like I would need to classify all the
> documents in some date range (assuming the query is date restricted), look
> at the probability_d and pick the top n documents.  Is there a better way
> to do this?
>
> I'm using a stream like this:
> classify(model(models,id="WeatherModel",cacheMillis=5000),
> search(COL1,df="FULL_DOCUMENT",q="Hawaii AND
> DocTimestamp:[2017-07-23T04:00:00Z TO 
> 2017-08-23T03:59:00Z]",fl="ClusterText,id",sort="id
> asc",rows="1"),field="ClusterText")
>
> This sends this to all the shards who can return at most 10,000 docs each.
>
> Thanks!
>
> -Joe
>
>


Download Sunplot for SQL/Streaming expr

2017-08-23 Thread Susheel Kumar
Hello,

>From where we can download Sunplot/setup to use SQL and streaming
expressions?

Thanks,
Susheel


Facet on a Payload field type?

2017-08-23 Thread Webster Homer
Is it possible to facet on  a payload field type?

We are moving from Endeca to Solr. We have a number of Endeca facets where
we have hacked in multilangauge support. The multiple languages are really
just for displaying the value of a term internally the value used to search
is in English. The problem is that we don't have translations for most of
our facet data and this was a way to support multiple languages with the
data we have.

Looking at the Solrj FacetField class I cannot tell if the value can
 contain  a payload or not

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


Re: Request to be added to the ContributorsGroup

2017-08-23 Thread Tomas Fernandez Lobbe
I just added you to the wiki. 
Note that the official documentation is now in the "solr-ref-guide" directory 
of the code base, and you can create patches/PRs to it.

Tomás

> On Aug 23, 2017, at 10:58 AM, Kevin Grimes  wrote:
> 
> Hi there,
> 
> I would like to contribute to the Solr wiki. My username is KevinGrimes, and 
> my e-mail is kevingrim...@me.com .
> 
> Thanks,
> Kevin
> 



Request to be added to the ContributorsGroup

2017-08-23 Thread Kevin Grimes
Hi there,

I would like to contribute to the Solr wiki. My username is KevinGrimes, and my 
e-mail is kevingrim...@me.com .

Thanks,
Kevin



Custom StoredFieldVisitor in Solr

2017-08-23 Thread Jamie Johnson
I thought I had asked this previously, but I can't find reference to it
now.  I am interested in using a custom StoredFieldVisitor in Solr and
after spelunking through the code for a little it seems that there is no
easy extension point that supports me doing so.  I am currently on Solr 4.x
(moving forward is a long term option, but can't be done in the short
term).  The only option I see at this point is creating a forking Solr and
changing the way SolrIndexSearcher currently works to provide another
option to enable my custom StoredFieldVisitor.  While I'd prefer not to do
so, if it is my only option I am ok with it.

Are there any suggestions for how to go about supporting this besides the
above?

Jamie


Re: solr jetty based auth and distributed solr requests

2017-08-23 Thread Lars Karlsson
Newest docs:
https://lucene.apache.org/solr/guide/6_6/securing-solr.html


On Wed, 23 Aug 2017 at 19:20, Lars Karlsson 
wrote:

> Such setup is not supported, interfering inter node communication, start
> here on how to secure Solr:
>
> https://cwiki.apache.org/confluence/display/solr/Securing+Solr
>
> On Wed, 23 Aug 2017 at 14:41, Scott Stults <
> sstu...@opensourceconnections.com> wrote:
>
>> Radhakrishnan,
>>
>> I'm not sure offhand whether or not that's possible. It sounds like you've
>> done enough analysis to write a good Jira ticket, so if nobody speaks up
>> on
>> the mailing list, go ahead and create one.
>>
>>
>> Cheers,
>> Scott
>>
>> On Tue, Aug 22, 2017 at 7:15 PM, radha krishnan <
>> dradhakrishna...@gmail.com>
>> wrote:
>>
>> > Hi,
>> >
>> > I enabled jetty basic auth for solr by making changes to jetty.xml and
>> add
>> > a 'realm.properties'
>> >
>> > while basic queries are working, queries involving more than one shard
>> is
>> > not working. i went through the code and figured out that in
>> > HttpShardHandler, there is no provision to specify a username:password
>> >
>> > I went through a lot of JIRA's/posts and was not able to figure out
>> whether
>> > it is really possible to do.
>> >
>> > can we do a distributed operation with jetty base basic auth. can you
>> > please give share the relevant links so that i can try it out.
>> >
>> >
>> > Thanks,
>> > Radhakrishnan
>> >
>>
>>
>>
>> --
>> Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
>> | 434.409.2780
>> http://www.opensourceconnections.com
>>
>


Re: solr jetty based auth and distributed solr requests

2017-08-23 Thread Lars Karlsson
Such setup is not supported, interfering inter node communication, start
here on how to secure Solr:

https://cwiki.apache.org/confluence/display/solr/Securing+Solr

On Wed, 23 Aug 2017 at 14:41, Scott Stults <
sstu...@opensourceconnections.com> wrote:

> Radhakrishnan,
>
> I'm not sure offhand whether or not that's possible. It sounds like you've
> done enough analysis to write a good Jira ticket, so if nobody speaks up on
> the mailing list, go ahead and create one.
>
>
> Cheers,
> Scott
>
> On Tue, Aug 22, 2017 at 7:15 PM, radha krishnan <
> dradhakrishna...@gmail.com>
> wrote:
>
> > Hi,
> >
> > I enabled jetty basic auth for solr by making changes to jetty.xml and
> add
> > a 'realm.properties'
> >
> > while basic queries are working, queries involving more than one shard is
> > not working. i went through the code and figured out that in
> > HttpShardHandler, there is no provision to specify a username:password
> >
> > I went through a lot of JIRA's/posts and was not able to figure out
> whether
> > it is really possible to do.
> >
> > can we do a distributed operation with jetty base basic auth. can you
> > please give share the relevant links so that i can try it out.
> >
> >
> > Thanks,
> > Radhakrishnan
> >
>
>
>
> --
> Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
> | 434.409.2780
> http://www.opensourceconnections.com
>


Spatial search with arbitrary rectangle?

2017-08-23 Thread Paweł Kordek
Hi All


I've been skimming through the spatial search docs and came across this section:


https://lucene.apache.org/solr/guide/6_6/spatial-search.html#SpatialSearch-Filteringbyanarbitraryrectangle


"Sometimes the spatial search requirement calls for finding everything in a 
rectangular area, such as the area covered by a map the user is looking at. For 
this case, geofilt and bbox won’t cut it. "


I can't understand what is meant here by the "rectangular area". What is the 
coordinate system of this rectangle? If we talk about the map, don't we have to 
consider what is the projection? Any help will be much appreciated.


Best regards

Paweł



RE: Solr uses lots of shared memory!

2017-08-23 Thread Markus Jelsma
I have the problem in production and local, with default Solr 6.6 JVM 
arguments, environments are:

openjdk version "1.8.0_141"
OpenJDK Runtime Environment (build 1.8.0_141-8u141-b15-1~deb9u1-b15)
OpenJDK 64-Bit Server VM (build 25.141-b15, mixed mode)
Linux idx1 4.9.0-3-amd64 #1 SMP Debian 4.9.30-2+deb9u1 (2017-06-18) x86_64 
GNU/Linux

and

openjdk version "1.8.0_131"
OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-2ubuntu1.17.04.3-b11)
OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)
Linux midas 4.10.0-32-generic #36-Ubuntu SMP Tue Aug 8 12:10:06 UTC 2017 x86_64 
x86_64 x86_64 GNU/Linux

Regarding the node that shows the problem, can you reproduce it locally? Fire 
it up, put some data in, confirm low shared space usage, and execute few 
thousands queries against it? We immediately see a sharp rise in shared memory, 
MB's per second until it reaches some sort of plateau.

-Original message-
> From:Shawn Heisey 
> Sent: Wednesday 23rd August 2017 16:37
> To: solr-user@lucene.apache.org
> Subject: Re: Solr uses lots of shared memory!
> 
> On 8/23/2017 7:32 AM, Markus Jelsma wrote:
> > Why does it slowly increase over time?
> > Why does it appear to correlate to index size?
> > Is anyone else seeing this on their 6.6 cloud production or local machines?
> 
> More detailed information included here.  My 6.6 dev install is NOT
> having the problem, but a much older version IS.
> 
> I grabbed this screenshot only moments ago from a production server
> which is exhibiting a large SHR value for the Solr process:
> 
> https://www.dropbox.com/s/q79lo2gft9es06u/idxa1-top-big-shr.png?dl=0
> 
> This is Solr 4.7.2, with a 10 month uptime for the Solr process, running
> with these arguments:
> 
> -DSTOP.KEY=REDACTED
> -DSTOP.PORT=8078
> -Djetty.port=8981
> -Dsolr.solr.home=/index/solr4
> -Dcom.sun.management.jmxremote.authenticate=false
> -Dcom.sun.management.jmxremote.ssl=false
> -Dcom.sun.management.jmxremote.port=8686
> -Dcom.sun.management.jmxremote
> -XX:+PrintReferenceGC
> -XX:+PrintAdaptiveSizePolicy
> -XX:+PrintGCDetails
> -XX:+PrintGCDateStamps
> -Xloggc:logs/gc.log-verbose:gc
> -XX:+AggressiveOpts
> -XX:+UseLargePages
> -XX:InitiatingHeapOccupancyPercent=75
> -XX:MaxGCPauseMillis=250
> -XX:G1HeapRegionSize=8m
> -XX:+ParallelRefProcEnabled
> -XX:+PerfDisableSharedMem
> -XX:+UseG1GC
> -Dlog4j.configuration=file:etc/log4j.properties
> -Xmx8192M
> -Xms4096M
> 
> The OS is CentOS 6, with the following Java and kernel:
> 
> java version "1.7.0_72"
> Java(TM) SE Runtime Environment (build 1.7.0_72-b14)
> Java HotSpot(TM) 64-Bit Server VM (build 24.72-b04, mixed mode)
> 
> Linux idxa1 2.6.32-431.11.2.el6.centos.plus.x86_64 #1 SMP Tue Mar 25
> 21:36:54 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
> 
> I also just grabbed a screenshot from my dev server, running Ubuntu 14,
> Solr 6.6.0, a LOT more index data, and a more recent Java version.  Solr
> has an uptime of about one month.  This server was installed with the
> service installer script, so it uses bin/solr.  It doesn't seem to have
> the same problem:
> 
> https://www.dropbox.com/s/85h1weuopa643za/bigindy5-top-small-shr.png?dl=0
> 
> java version "1.8.0_144"
> Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
> Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)
> 
> Linux bigindy5 3.13.0-105-generic #152-Ubuntu SMP Fri Dec 2 15:37:11 UTC
> 2016 x86_64 x86_64 x86_64 GNU/Linux
> 
> The arguments for this one are very similar to the production server:
> 
> -DSTOP.KEY=solrrocks
> -DSTOP.PORT=7982
> -Dcom.sun.management.jmxremote
> -Dcom.sun.management.jmxremote.authenticate=false
> -Dcom.sun.management.jmxremote.local.only=false
> -Dcom.sun.management.jmxremote.port=18982
> -Dcom.sun.management.jmxremote.rmi.port=18982
> -Dcom.sun.management.jmxremote.ssl=false
> -Djetty.home=/opt/solr6/server
> -Djetty.port=8982
> -Dlog4j.configuration=file:/index/solr6/log4j.properties
> -Dsolr.install.dir=/opt/solr6
> -Dsolr.log.dir=/index/solr6/logs
> -Dsolr.log.muteconsole
> -Dsolr.solr.home=/index/solr6/data
> -Duser.timezone=UTC
> -XX:+AggressiveOpts
> -XX:+ParallelRefProcEnabled
> -XX:+PrintGCApplicationStoppedTime
> -XX:+PrintGCDateStamps
> -XX:+PrintGCDetails
> -XX:+PrintGCTimeStamps
> -XX:+PrintHeapAtGC
> -XX:+PrintTenuringDistribution
> -XX:+UseG1GC
> -XX:+UseGCLogFileRotation
> -XX:+UseLargePages
> -XX:G1HeapRegionSize=8m
> -XX:GCLogFileSize=20M
> -XX:InitiatingHeapOccupancyPercent=75
> -XX:MaxGCPauseMillis=250
> -XX:NumberOfGCLogFiles=9
> -XX:OnOutOfMemoryError=/opt/solr6/bin/oom_solr.sh 8982 /index/solr6/logs
> -Xloggc:/index/solr6/logs/solr_gc.log
> -Xms28g
> -Xmx28g
> -Xss256k
> -verbose:gc
> 
> Neither system has any huge pages allocated in the OS, so I doubt that
> the UseLargePages option is actually doing anything.  I've left it there
> in case I *do* enable huge pages, so they will automatically get used.
> 
> Thanks,
> Shawn
> 
> 


Re: Solr uses lots of shared memory!

2017-08-23 Thread Shawn Heisey
On 8/23/2017 7:32 AM, Markus Jelsma wrote:
> Why does it slowly increase over time?
> Why does it appear to correlate to index size?
> Is anyone else seeing this on their 6.6 cloud production or local machines?

More detailed information included here.  My 6.6 dev install is NOT
having the problem, but a much older version IS.

I grabbed this screenshot only moments ago from a production server
which is exhibiting a large SHR value for the Solr process:

https://www.dropbox.com/s/q79lo2gft9es06u/idxa1-top-big-shr.png?dl=0

This is Solr 4.7.2, with a 10 month uptime for the Solr process, running
with these arguments:

-DSTOP.KEY=REDACTED
-DSTOP.PORT=8078
-Djetty.port=8981
-Dsolr.solr.home=/index/solr4
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.port=8686
-Dcom.sun.management.jmxremote
-XX:+PrintReferenceGC
-XX:+PrintAdaptiveSizePolicy
-XX:+PrintGCDetails
-XX:+PrintGCDateStamps
-Xloggc:logs/gc.log-verbose:gc
-XX:+AggressiveOpts
-XX:+UseLargePages
-XX:InitiatingHeapOccupancyPercent=75
-XX:MaxGCPauseMillis=250
-XX:G1HeapRegionSize=8m
-XX:+ParallelRefProcEnabled
-XX:+PerfDisableSharedMem
-XX:+UseG1GC
-Dlog4j.configuration=file:etc/log4j.properties
-Xmx8192M
-Xms4096M

The OS is CentOS 6, with the following Java and kernel:

java version "1.7.0_72"
Java(TM) SE Runtime Environment (build 1.7.0_72-b14)
Java HotSpot(TM) 64-Bit Server VM (build 24.72-b04, mixed mode)

Linux idxa1 2.6.32-431.11.2.el6.centos.plus.x86_64 #1 SMP Tue Mar 25
21:36:54 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

I also just grabbed a screenshot from my dev server, running Ubuntu 14,
Solr 6.6.0, a LOT more index data, and a more recent Java version.  Solr
has an uptime of about one month.  This server was installed with the
service installer script, so it uses bin/solr.  It doesn't seem to have
the same problem:

https://www.dropbox.com/s/85h1weuopa643za/bigindy5-top-small-shr.png?dl=0

java version "1.8.0_144"
Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)

Linux bigindy5 3.13.0-105-generic #152-Ubuntu SMP Fri Dec 2 15:37:11 UTC
2016 x86_64 x86_64 x86_64 GNU/Linux

The arguments for this one are very similar to the production server:

-DSTOP.KEY=solrrocks
-DSTOP.PORT=7982
-Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.local.only=false
-Dcom.sun.management.jmxremote.port=18982
-Dcom.sun.management.jmxremote.rmi.port=18982
-Dcom.sun.management.jmxremote.ssl=false
-Djetty.home=/opt/solr6/server
-Djetty.port=8982
-Dlog4j.configuration=file:/index/solr6/log4j.properties
-Dsolr.install.dir=/opt/solr6
-Dsolr.log.dir=/index/solr6/logs
-Dsolr.log.muteconsole
-Dsolr.solr.home=/index/solr6/data
-Duser.timezone=UTC
-XX:+AggressiveOpts
-XX:+ParallelRefProcEnabled
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintGCDateStamps
-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps
-XX:+PrintHeapAtGC
-XX:+PrintTenuringDistribution
-XX:+UseG1GC
-XX:+UseGCLogFileRotation
-XX:+UseLargePages
-XX:G1HeapRegionSize=8m
-XX:GCLogFileSize=20M
-XX:InitiatingHeapOccupancyPercent=75
-XX:MaxGCPauseMillis=250
-XX:NumberOfGCLogFiles=9
-XX:OnOutOfMemoryError=/opt/solr6/bin/oom_solr.sh 8982 /index/solr6/logs
-Xloggc:/index/solr6/logs/solr_gc.log
-Xms28g
-Xmx28g
-Xss256k
-verbose:gc

Neither system has any huge pages allocated in the OS, so I doubt that
the UseLargePages option is actually doing anything.  I've left it there
in case I *do* enable huge pages, so they will automatically get used.

Thanks,
Shawn



RE: Solr uses lots of shared memory!

2017-08-23 Thread Markus Jelsma
I do not think it is a problem of reporting after watching top after restart of 
some Solr instances, it dropped back to `normal`, around 350 MB, which i think 
it high to but anyway.

Two hours later, the restarted nodes are slowly increasing shared memory 
consumption to about 1500 MB now. I don't understand why shared memory usage 
should/would increase slowly over time, it makes little sense to me and i 
cannot remember Solr doing this in the past ten years.

But it seems to correlate to index size on disk, these main text search nodes 
have an index of around 16 GB and up 3 GB of shared memory after a few days. 
Logs nodes up to 800 MB index size and 320 MB of shared memory, the low latency 
nodes have four different cores that make up just over 100 MB index size, 
shared memory consumption is just 22 MB, which seems more reasonable for the 
case of shared memory.

I can also force Solr to 'leak' shared memory just by sending queries to it. My 
freshly restarted local node used 68 MB shared memory at startup. Two minutes 
and 25.000 queries later it was already 2748 MB! At first there is a very sharp 
increase to 2000, then it takes almost two minutes more to increase to 2748. I 
can decrease the maximum shared memory usage to 1200 if i query (via edismax) 
only on fields of one language instead of 25 orso. I can decrease it as well 
further if i disable highlighting (HUH?) but still query on all fields.

* We have tried patching Java's ByteBuffer [1] because it seemed to fit the 
problems, it does not fix it.
* We have also removed all our custom plugins, so it has become a vanilla Solr 
6.6 just with our stripped down schema and solrconfig, it neither fixes it.

Why does it slowly increase over time?
Why does it appear to correlate to index size?
Is anyone else seeing this on their 6.6 cloud production or local machines?

Thanks,
Markus
 
[1]: http://www.evanjones.ca/java-bytebuffer-leak.html

-Original message-
> From:Shawn Heisey 
> Sent: Tuesday 22nd August 2017 17:32
> To: solr-user@lucene.apache.org
> Subject: Re: Solr uses lots of shared memory!
> 
> On 8/22/2017 7:24 AM, Markus Jelsma wrote:
> > I have never seen this before, one of our collections, all nodes eating 
> > tons of shared memory!
> >
> > Here's one of the nodes:
> > 10497 solr  20   0 19.439g 4.505g 3.139g S   1.0 57.8   2511:46 java 
> >
> > RSS is roughly equal to heap size + usual off-heap space + shared memory. 
> > Virtual is equal to RSS and index size on disk. For two other collections, 
> > the nodes use shared memory as expected, in the MB range.
> >
> > How can Solr, this collection, use so much shared memory? Why?
> 
> I've seen this on my own servers at work, and when I add up a subset of
> the memory numbers I can see from the system, it ends up being more
> memory than I even have in the server.
> 
> I suspect there is something odd going on in how Java reports memory
> usage to the OS, or maybe a glitch in how Linux interprets Java's memory
> usage.  At some point in the past, numbers were reported correctly.  I
> do not know if the change came about because of a Solr upgrade, because
> of a Java upgrade, or because of an OS kernel upgrade.  All three were
> upgraded between when I know the numbers looked right and when I noticed
> they were wrong.
> 
> https://www.dropbox.com/s/91uqlrnfghr2heo/solr-memory-sorted-top.png?dl=0
> 
> This screenshot shows that Solr is using 17GB of memory, 41.45GB of
> memory is being used by the OS disk cache, and 10.23GB of memory is
> free.  Add those up, and it comes to 68.68GB ... but the machine only
> has 64GB of memory, and that total doesn't include the memory usage of
> the other processes seen in the screenshot.  This impossible situation
> means that something is being misreported somewhere.  If I deduct that
> 11GB of SHR from the RES value, then all the numbers work.
> 
> The screenshot was almost 3 years ago, so I do not know what machine it
> came from, and therefore I can't be sure what the actual heap size was. 
> I think it was about 6GB -- the difference between RES and SHR.  I have
> used a 6GB heap on some of my production servers in the past.  The
> server where I got this screenshot was not having any noticeable
> performance or memory problems, so I think that I can trust that the
> main numbers above the process list (which only come from the OS) are
> correct.
> 
> Thanks,
> Shawn
> 
> 


Re: Get results in multiple orders (multiple boosts)

2017-08-23 Thread Susheel Kumar
This is just an dummy code to show you how you can add a request handler in
solrconfig and utilise that to sort by custom field based on your
criteria.  You can do lot here like using pow function etc. and create more
local params etc based on your need.



*:*
explicit
json
true
_text_

*:*

{!query defType=func v=$f_category} desc




category:{!query defType=func v=$f_category}






{!func}
if(exists(query(category_i:9500)),$f_9500,$f_1100)

{!func}
if(exists(query(source_i:5)),100,if(exists(query(source_i:9)),90,if(exists(query(source_i:7)),80)))
{!func}
if(exists(query(source_i:5)),70,if(exists(query(source_i:9)),60,if(exists(query(source_i:7)),50)))





On Wed, Aug 23, 2017 at 9:10 AM, Susheel Kumar 
wrote:

> Hi Luca,
>
> Sorry, I was out.  Let me try to put some dummy code as an example. Will
> be putting it shortly.
>
> Thnx
>
> On Tue, Aug 22, 2017 at 6:08 AM, Rick Leir  wrote:
>
>> Luca,
>> Did you say _slower_ mySQL? It is blazing fast, I used it with over 10m
>> records and no appreciable latency. The underlying InnoDB is excellent.
>> Design your schema using mySQLworkbench. Cheers -- Rick
>>
>> On August 22, 2017 2:16:07 AM EDT, Luca Dall'Osto
>>  wrote:
>> >Hello,
>> >thank you for your responses.
>> >Ok, therefore I have to archive this problem with no appropriate
>> >solution in Solr, and try to do it with a relation-based DB such as
>> >mySQL or Postgres.
>> >Build the custom sort function could be a valid solution instead of use
>> >the slower mySQL or try Postgres (I never used Postgres) or I have to
>> >forgot it?
>> >Thanks!
>> >
>> >
>> >Luca
>> >
>> >
>> >On Saturday, August 19, 2017 1:02 AM, Rick Leir 
>> >wrote:
>> >
>> >
>> > Luca
>> >Walter has got the best word on this, you should use SQL for sorting
>> >(maybe mySQL or Postgres). If you also need searching, you can create a
>> >Solr index by ingesting from the SQL database. The Solr index would be
>> >just used for searching. Cheers -- Rick
>> >--
>> >Sorry for being brief. Alternate email is rickleir at yahoo dot com
>> >
>> >
>>
>> --
>> Sorry for being brief. Alternate email is rickleir at yahoo dot com
>>
>
>


Re: Get results in multiple orders (multiple boosts)

2017-08-23 Thread Susheel Kumar
Hi Luca,

Sorry, I was out.  Let me try to put some dummy code as an example. Will be
putting it shortly.

Thnx

On Tue, Aug 22, 2017 at 6:08 AM, Rick Leir  wrote:

> Luca,
> Did you say _slower_ mySQL? It is blazing fast, I used it with over 10m
> records and no appreciable latency. The underlying InnoDB is excellent.
> Design your schema using mySQLworkbench. Cheers -- Rick
>
> On August 22, 2017 2:16:07 AM EDT, Luca Dall'Osto
>  wrote:
> >Hello,
> >thank you for your responses.
> >Ok, therefore I have to archive this problem with no appropriate
> >solution in Solr, and try to do it with a relation-based DB such as
> >mySQL or Postgres.
> >Build the custom sort function could be a valid solution instead of use
> >the slower mySQL or try Postgres (I never used Postgres) or I have to
> >forgot it?
> >Thanks!
> >
> >
> >Luca
> >
> >
> >On Saturday, August 19, 2017 1:02 AM, Rick Leir 
> >wrote:
> >
> >
> > Luca
> >Walter has got the best word on this, you should use SQL for sorting
> >(maybe mySQL or Postgres). If you also need searching, you can create a
> >Solr index by ingesting from the SQL database. The Solr index would be
> >just used for searching. Cheers -- Rick
> >--
> >Sorry for being brief. Alternate email is rickleir at yahoo dot com
> >
> >
>
> --
> Sorry for being brief. Alternate email is rickleir at yahoo dot com
>


Re: Excessive resources consumption migrating from Solr 6.6.0 Master/Slave to SolrCloud 6.6.0 (dozen times more resources)

2017-08-23 Thread Scott Stults
Hi Daniel,

Great background information about your setup! I've got just a few more
questions:

- Can you describe the process that queries the DB and sends records to
Solr?
- Is it a SolrJ-based application?
- If it is, which client package are you using?
- How many documents do you send at once?
- Are you sending your indexing or query traffic through a load balancer?

If you're sending documents to each replica as fast as they can take them,
you might be seeing a bottleneck at the shard leaders. The SolrJ
CloudSolrClient finds out from Zookeeper which nodes are the shard leaders
and sends docs directly to them.


-Scott

On Tue, Aug 22, 2017 at 2:16 PM, Daniel Ortega 
wrote:

> *Main Problems*
>
>
> We are involved in a migration from Solr Master/Slave infrastructure to
> SolrCloud infrastructure.
>
>
>
> The main problems that we have now are:
>
>
>
>- Excessive resources consumption: Currently we have 5 instances with 80
>processors/768 GB RAM each instance using SSD Hard Disk Drives that
> doesn't
>support the load that we have in the other architecture. In our
>Master-Slave architecture we have only 7 Virtual Machines with lower
> specs
>(4 processors and 16 GB each instance using SSD Hard Disk Drives too).
> So,
>at the moment our SolrCloud infrastructure is wasting several dozen
> times
>more resources than our Solr Master/Slave infrastructure.
>- Despite spending more resources we have worst query times (compared to
>Solr in master/slave architecture)
>
>
> *Search infrastructure (SolrCloud infrastructure)*
>
>
>
> As we cannot use DIH Handler (which is what we use in Solr Master/Slave),
> we
> have developed an application which reads every transaction from Oracle,
> builds a document collection searching in the database and sends the result
> to the */update* handler every 200 milliseconds using SolrJ client. This
> application tries to delete the possible duplicates in each update window,
> but we are using solr’s de-duplication techniques
>  https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%
> 2Fsolr%2FDe-Duplication=02%7C01%7Cdortega%40idealista.com%
> 7Cb169ea024abc4954927208d4bc6868eb%7Cd78b7929c2a34897ae9a7d8f8dc1
> a1cf%7C0%7C0%7C636340604697721266=WEhzoHC1Bf77K706%
> 2Fj2wIWOw5gzfOgsP1IPQESvMsqQ%3D=0>
>  too.
>
>
>
> We are indexing ~100 documents per second (with peaks of ~1000 documents
> per second).
>
>
>
> Every search query is centralized in other application which exposes a DSL
> behind a REST API and uses SolrJ client too to perform queries. We have
> peaks of 2000 QPS.
>
> *Cluster structure **(SolrCloud infrastructure)*
>
>
>
> At the moment, the cluster has 30 SolrCloud instances with the same specs
> (Same physical hosts, same JVM Settings, etc.).
>
>
>
> *Main collection*
>
>
>
> In our use case we are using this collection as a NoSQL database basically.
> Our document is composed of about 300 fields that represents an advert, and
> is a denormalization of its relational representation in Oracle.
>
>
> We are using all our nodes to store the  collection in 3 shards. So, each
> shard has 10 replicas.
>
>
> At the moment, we are only indexing a subset of the adverts stored in
> Oracle, but our goal is to store all the ads that we have in the DB (a few
> tens of millions of documents). We have NRT requirements, so we need to
> index every document as soon as posible once it’s changed in Oracle.
>
>
>
> We have defined the properties of each field (if it’s stored/indexed or
> not, if should be defined as DocValue, etc…) considering the use of that
> field.
>
>
>
> *Index size **(SolrCloud infrastructure)*
>
>
>
> The index size is currently above 6 GB, storing 1.300.000 documents in each
> shard. So, we are storing 3.900.000 documents and the total index size is
> 18 GB.
>
>
>
> *Indexation **(SolrCloud infrastructure)*
>
>
>
> The commits *aren’t* triggered by the application described before. The
> hardcommit/softcommit interval are configured in Solr:
>
>
>
>- *HardCommit:* every 15 minutes (with opensearcher = false)
>- *SoftCommit:* every 5 seconds
>
>
>
> *Apache Solr Version*
>
>
>
> We are currently using the last version of Solr (6.6.0) under an Oracle VM
> (Java(TM) SE Runtime Environment (build 1.8.0_131-b11) Oracle (64 bits)) in
> both deployments.
>
>
> The question is... What is wrong here?!?!?!
>



-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com


Re: solr jetty based auth and distributed solr requests

2017-08-23 Thread Scott Stults
Radhakrishnan,

I'm not sure offhand whether or not that's possible. It sounds like you've
done enough analysis to write a good Jira ticket, so if nobody speaks up on
the mailing list, go ahead and create one.


Cheers,
Scott

On Tue, Aug 22, 2017 at 7:15 PM, radha krishnan 
wrote:

> Hi,
>
> I enabled jetty basic auth for solr by making changes to jetty.xml and add
> a 'realm.properties'
>
> while basic queries are working, queries involving more than one shard is
> not working. i went through the code and figured out that in
> HttpShardHandler, there is no provision to specify a username:password
>
> I went through a lot of JIRA's/posts and was not able to figure out whether
> it is really possible to do.
>
> can we do a distributed operation with jetty base basic auth. can you
> please give share the relevant links so that i can try it out.
>
>
> Thanks,
> Radhakrishnan
>



-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com