Re: Mapping Solr Exceptions to Error Code while using SolrJ

2016-02-24 Thread Debraj Manna
Hi,

Any help or pointers in this issue?

Thanks,

On Wed, Feb 24, 2016 at 12:44 PM, Debraj Manna 
wrote:

> Hi,
>
> I am using Solrj 5.1 to talk add & delete docs from solr. Whenever there
> is some exception while doing addition or deletion. Solr is throwing
> SolrServerException with the error message in the exception.
>
> I am trying to map each error to an error code. For example if I am
> getting an exception message "Error from server at
> http://abc-solr4:8585/solr/discovery: user version is not high enough:
> 1456297688" then I want to map this to a more generic error with messages
> "Low User Version" & some error code let;s say XXX. So that later on it
> becomes easier to know what kind of error we are getting,
>
> The only way I can think of right now is catching the SolrServerException
> & then doing a string match in e.getMessage(). Since the solr is not
> sending and other code.
>
> Can some one let me know if there is some better way of doing this & how
> do people generally handle error while using solrj?
>
> Thanks,
> Debraj
>


Re: Get one fragment of text of field

2016-02-24 Thread Anil
one way i see is :

store a display snippet in a separate field and fetch that instead

please let me know if you see any other ways or isusue on the appraoch.

Regards,
Anil

On 25 February 2016 at 11:30, Anil  wrote:

> HI,
>
> we are indexing and storing 2 mb text data in a text field. we need to
> display partial content of filed to be displayed in UI every time when this
> document is part of the response.
>
> is there any to get only few characters of the fields ? fetching 2mb data
> and truncate in application is little overhead.
>
> Thanks for your help.
>
> Regards,
> Anil
>
> On 18 February 2016 at 12:51, Anil  wrote:
>
>> Thanks Binoy.
>>
>> index should happen on everything.
>>
>> But retrial/fetch should limit the characters. is it possible ?
>>
>> Regards,
>> Anil
>>
>> On 18 February 2016 at 12:24, Binoy Dalal  wrote:
>>
>>> If you are not particular about what part of the field is returned you
>>> can
>>> create copy fields and set a limit on those to store only the number of
>>> characters you want.
>>>
>>> 
>>>
>>> This will copy over the first 500 chars of the contents of your SRC field
>>> to your dest field.
>>> Anything beyond this will be truncated.
>>>
>>> On Thu, 18 Feb 2016, 12:00 Anil  wrote:
>>>
>>> > Hi ,
>>> >
>>> > we have around 30 fields in solr document. and we search for text in
>>> all
>>> > fields (by creating a record field with copy field).
>>> >
>>> > few fields have huge text .. order of mbs. how i can get only a
>>> fragment
>>> > of fields in a configurable way.
>>> >
>>> > we have to display each field content on UI. so its must to get
>>> content of
>>> > each field.
>>> >
>>> > for now, i am fetching the content from solr and truncating it in my
>>> code.
>>> > but it has poor performance.
>>> >
>>> > is there any way to achieve fragmentation (not a highlight
>>> fragmentation)
>>> > in solr ? please advice.
>>> >
>>> > Regards,
>>> > Anil
>>> >
>>> --
>>> Regards,
>>> Binoy Dalal
>>>
>>
>>
>


Re: numFound in facet results

2016-02-24 Thread Anil
can some one share your ideas ?

On 24 February 2016 at 08:14, Anil  wrote:

> Yes Yonik. i could not find numBuckets true/false in Solr reference
> documentation and Solrj facet params as well.
>
> Could you please point me to documentation ? Thank you.
>
> On 23 February 2016 at 20:53, Yonik Seeley  wrote:
>
>> On Mon, Feb 22, 2016 at 2:34 AM, Anil  wrote:
>> > can we get numFound of the number of face results for a query like in
>> main
>> > results ?
>>
>> Do you mean the number of facet buckets?
>>
>> You can add "numBuckets":true to a JSON facet request to get this info.
>> http://yonik.com/json-facet-api/
>>
>> -Yonik
>>
>
>


Re: Get one fragment of text of field

2016-02-24 Thread Anil
HI,

we are indexing and storing 2 mb text data in a text field. we need to
display partial content of filed to be displayed in UI every time when this
document is part of the response.

is there any to get only few characters of the fields ? fetching 2mb data
and truncate in application is little overhead.

Thanks for your help.

Regards,
Anil

On 18 February 2016 at 12:51, Anil  wrote:

> Thanks Binoy.
>
> index should happen on everything.
>
> But retrial/fetch should limit the characters. is it possible ?
>
> Regards,
> Anil
>
> On 18 February 2016 at 12:24, Binoy Dalal  wrote:
>
>> If you are not particular about what part of the field is returned you can
>> create copy fields and set a limit on those to store only the number of
>> characters you want.
>>
>> 
>>
>> This will copy over the first 500 chars of the contents of your SRC field
>> to your dest field.
>> Anything beyond this will be truncated.
>>
>> On Thu, 18 Feb 2016, 12:00 Anil  wrote:
>>
>> > Hi ,
>> >
>> > we have around 30 fields in solr document. and we search for text in all
>> > fields (by creating a record field with copy field).
>> >
>> > few fields have huge text .. order of mbs. how i can get only a
>> fragment
>> > of fields in a configurable way.
>> >
>> > we have to display each field content on UI. so its must to get content
>> of
>> > each field.
>> >
>> > for now, i am fetching the content from solr and truncating it in my
>> code.
>> > but it has poor performance.
>> >
>> > is there any way to achieve fragmentation (not a highlight
>> fragmentation)
>> > in solr ? please advice.
>> >
>> > Regards,
>> > Anil
>> >
>> --
>> Regards,
>> Binoy Dalal
>>
>
>


Re: WhitespaceTokenizerFactory and PathHierarchyTokenizerFactory

2016-02-24 Thread Anil
Sorry Jack for confusion.

I have field which holds free text. text can contain path , ip or any free
text.

I would like to tokenize the text of the field using white space. if the
text token is of path or ip pattern , it has be tockenized like path
hierarchy way.


Regards,
Anil

On 24 February 2016 at 21:59, Jack Krupansky 
wrote:

> Your statement makes no sense. Please clarify. Express your requirement(s)
> in plain English first before dragging in possible solutions. Technically,
> path elements can have embedded spaces.
>
> -- Jack Krupansky
>
> On Wed, Feb 24, 2016 at 6:53 AM, Anil  wrote:
>
> > HI,
> >
> > i need to use both WhitespaceTokenizerFactory and
> > PathHierarchyTokenizerFactory for use case.
> >
> > Solr supports only one tokenizer. is there any way we can achieve
> > PathHierarchyTokenizerFactory  functionality with filters ?
> >
> > Please advice.
> >
> > Regards,
> > Anil
> >
>


(Solr 5.5) How do beginners modify dynamic schema now that it is default?

2016-02-24 Thread Alexandre Rafalovitch
Hi,

In Solr 5.5, all the shipped examples now use dynamic schema. So, how
are they expected to add new types? We have "add/delete fields" UI in
the new Admin UI, but not "add/delete types".

Do we expect them to use REST end points and curl? Or to not modify
types at all? Or edit the "do not edit" managed schema?

I admit being a bit confused about the beginner's path now. Could
somebody else - more familiar with the context - comment, please!

Thank you,
   Alex.

Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/


RE: Hitting complex multilevel pivot queries in solr

2016-02-24 Thread Lewin Joy (TMS)
Hi Alvaro, 

We had thought about this. But our requirement is dynamic. 
The 4 fields to pivot on would change as per the many requirements.
So, this will need to be handled at query time.

Just considering the Endeca equivalent, it looks easy there.
If this feature is not available on Solr, would there be any effort in building 
this one?

P.S.
The endeca equivalent query below:
RETURN Results as SELECT Count(1) as "Total" GROUP BY "Country", "State", 
"part_num", "part_code" ORDER BY "Total" desc PAGE(0,100)

-Lewin

-Original Message-
From: Alvaro Cabrerizo [mailto:topor...@gmail.com] 
Sent: Friday, February 19, 2016 1:02 AM
To: solr-user@lucene.apache.org
Subject: Re: Hitting complex multilevel pivot queries in solr

Hi,

The only way I can imagine is to create that auxiliar field for performing the 
facet on it. It means that you have to know "a priori" the kind of report 
(facet field) you need.

For example if you current data (solrdocument) is:

{
   "id": 3757,
   "country": "CountryX",
   "state": "StateY",
   "part_num: "part_numZ",
   "part_code": "part_codeW"
}

It should be changed at index time to:

{
   "id": 3757,
   "country": "CountryX",
   "state": "StateY",
   "part_num: "part_numZ",
   "part_code": "part_codeW",
   "auxField": "CountryX StateY part_numZ part_codeW"
}

And then perform the query faceting by auxField.


Regards.

On Fri, Feb 19, 2016 at 1:15 AM, Lewin Joy (TMS) 
wrote:

> Hi,
>
> The fields are single valued. But, the requirement will be at query 
> time rather than index time. This is because, we will be having many 
> such scenarios with different fields.
> I hoped we could concatenate at query time. I just need top 100 counts 
> from the leaf level of the pivot.
> I'm also looking at facet.threads which could give responses to an extent.
> But It does not solve my issue.
>
> Hovewer, the Endeca equivalent of this application seems to be working 
> well.
> Example Endeca Query:
>
> RETURN Results as SELECT Count(1) as "Total" GROUP BY "Country", 
> "State", "part_num", "part_code" ORDER BY "Total" desc PAGE(0,100)
>
>
> -Lewin
>
>
> -Original Message-
> From: Alvaro Cabrerizo [mailto:topor...@gmail.com]
> Sent: Thursday, February 18, 2016 3:06 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Hitting complex multilevel pivot queries in solr
>
> Hi,
>
> The idea of copying fields into a new one (or various) during indexing 
> and then facet the new field (or fields) looks promising. More 
> information about data will be helpful (for example if the 
> fields:country, state.. are single or multivalued). For example if all 
> of the fields are single valued, then the combination of 
> country,state,part_num,part_code looks like a file path 
> country/state/part_num/part_code and maybe (don't know your business 
> rules), the solr.PathHierarchyTokenizerFactory
>  could 
> be an option to research instead of facet pivoting. On the other hand, 
> I don't think that the copy field < 
> https://cwiki.apache.org/confluence/display/solr/Copying+Fields> 
> feature can help you to build that auxiliary field. I think that 
> configuring an updateRequestProcessorChain < 
> https://wiki.apache.org/solr/UpdateRequestProcessor>and building your 
> own UpdateRequestProcessorFactory to concat the 
> country,state,part_num,part_code values can be better way.
>
> Hope it helps.
>
> On Thu, Feb 18, 2016 at 8:47 PM, Lewin Joy (TMS) 
> 
> wrote:
>
> > Still splitting my head over this one.
> > Let me know if anyone has any idea I could try.
> >
> > Or, is there a way to concatenate these 4 fields onto a dynamic 
> > field and do a facet.field on top of this one?
> >
> > Thanks. Any idea is helpful to try.
> >
> > -Lewin
> >
> > -Original Message-
> > From: Lewin Joy (TMS) [mailto:lewin@toyota.com]
> > Sent: Wednesday, February 17, 2016 4:29 PM
> > To: solr-user@lucene.apache.org
> > Subject: Hitting complex multilevel pivot queries in solr
> >
> > Hi,
> >
> > Is there an efficient way to hit solr for complex time consuming queries?
> > I have a requirement where I need to pivot on 4 fields. Two fields 
> > contain facet values close to 50. And the other 2 fields have 5000 
> > and
> 8000 values.
> > Pivoting on the 4 fields would crash the server.
> >
> > Is there a better way to get the data?
> >
> > Example Query Params looks like this:
> > &facet.pivot=country,state,part_num,part_code
> >
> > Thanks,
> > Lewin
> >
> >
> >
> >
>


Re: I have one small question that always intrigue me

2016-02-24 Thread Zara Parst
Very well explained, thanks Davis, Daniel really thanks.  I read your email
thoroughly and I enjoyed it while I was reading.  Though at some point my
thought partite from your view. But still I can a good perception of yours
how one can see the architectural world of lucene ecosystem. I will try to
make more sense out of it when I will implement few suggestion of yours.
For now a thanks from my side.

Note: About justifying my email, I am sorry for sending to many lists.
Actually I was chasing this question for a month or more., I invested
adequate number hours of myself to figure out and when I failed to see
insight, I started asking personally from user list but none answered.
After a while I sent same email to user list of lucene and zookeeper and
waited for a week or so but that too in vain. Finally my frustration grew
up and I sent email to as many list as I could because I was having this
feeling how could developer not faced this problem at least once if they
did what it mean ?? ignoring the genuine concern ??

Anyway really thanks.
Have a nice time.

On Wed, Feb 24, 2016 at 8:33 PM, Davis, Daniel (NIH/NLM) [C] <
daniel.da...@nih.gov> wrote:

> I've wondered about this as well.Recall that the proper architecture
> for Solr as well as ZooKeeper is as a back-end service, part of a tiered
> architecture, with web application servers in front.   Solr and other
> search engines should fit in at the same layer as RDBMS and  NoSQL, with
> the web applications in front of them.   In some larger systems, there is
> even an Enterprise SOA layer in between, but I've never worked on a project
> where I felt that was truly justified.   It is probably a matter of scale
> however.
>
> The common-case solution relies on this architecture - Solr and Zookeeper
> can be protected by IP address firewalls both off system and on system.
> The network firewalls (AWS security policy) allow only certain ip
> addresses/networks to connect to Solr and Zookeeper, and the local system
> firewalls act as a back-up to this system.   The SHA1 checksum within
> ZooKeeper and the Basic Authentication within SolrCloud then act as a way
> to fine tune access control, but they are not so much to protect Solr and
> Zookeeper but to allow a division of privileges.
>
> Some sites will find this insufficient:
> - Solr supports SSL -
> https://cwiki.apache.org/confluence/display/solr/Enabling+SSL
> - ZooKeeper supports SSL -
> https://cwiki.apache.org/confluence/display/ZOOKEEPER/ZooKeeper+SSL+User+Guide
>
> Both also at this point support custom authentication providers.
>
> My Solr is less protected that it should be, but I have mod_auth_cas
> protecting the solr admin interface, and certain request handlers can be
> accessed without this security through hand-built Apache httpd conf.d files
> for each core.There is a load-balancer (like Amazon Elastic Load
> Balancer (ELB)) in front of all Solr nodes, and since fault-tolerance is
> needed only for search, not for indexing, this is adequate. In other
> words, my Solr clients would not operate in SolrCloud mode, even if I made
> the Solr instance itself SolrCloud for ease of management.I'm having a
> little bit of a problem justifying this setup - the Role Based
> Authorization Plugin for Solr Basic Auth only scales to Enterprise use if
> you have a web front-end to manage the users, passwords, groups, and roles.
>
> Does this help?
>
> P.S. - Generally, one cross posts to another list only one when does not
> receive a good reply on the first list.   I can see how both
> u...@zookeeper.apache.org and solr-user@lucene.apache.org may be
> justified, but I don't see how you can justify more lists than this.
>
> -Original Message-
> From: Zara Parst [mailto:edotserv...@gmail.com]
> Sent: Wednesday, February 24, 2016 3:27 AM
> To: zookeeper-u...@hadoop.apache.org; f...@apache.org; AALSIHE <
> aali...@gmail.com>; u...@zookeeper.apache.org; solr-user@lucene.apache.org;
> d...@nutch.apache.org; u...@nutch.apache.org; comm...@lucene.apache.org;
> u...@lucene.apache.org
> Subject: I have one small question that always intrigue me
>
> Hi everyone,
>
> I am really need your help, please read below
>
>
> If we have to run solr in cloud mode, we are going to use zookeeper,   now
> any zookeeper client can connect to zookeeper server, Zookeeper has
> facility to protect znode however any one can see znode acl however
> password could be encrypted.  Decrypting password or guessing password is
> not a big deal. As we know password is SHA encrypted also there is no
> limitation of number of try to authorize with ACL. So my point is how to
> safegard zookeeper.
>
> I can guess few things
>
> a. Don't reveal ip of your zookeeper ( security with obscurity ) b. ip
> table which is also not a very good idea c. what else ??
>
> My guess was if some how we can protect zookeeper server itself by asking
> client to authorize them self before it can make connection to ensemble
> even at root 

Re: Null Pointer Exception on distributed search

2016-02-24 Thread Shawn Heisey
On 2/24/2016 9:58 AM, Lokesh Chhaparwal wrote:
> Can someone please update on this exception trace while we are using
> distributed search using shards parameter (solr-master-slave).

The line of code where the NPE happened (from the 4.7.2 source) is in
XMLWriter.java, at line 190:

for (String fname : doc.getFieldNames()) {

This means that "doc" is null.  This variable is pulled out of a
SolrDocumentList object, which means that whatever created that
SolrDocumentList managed to add a null entry.

I ran into a nearly identical stacktrace last year:

https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201509.mbox/%3c55f9e5df.1050...@elyograg.org%3E

I never did get a response to that message on the mailing list, and I
can no longer recall exactly what I did to fix it.  What little I do
remember suggests that this was caused by a situation where the field I
was grouping on was not present in at least one document.  This
situation should have been impossible in our data set, but occasionally
a badly formatted document will result in bad data in the database used
to populate Solr, which causes unexpected behavior.

Is this perhaps a query that uses grouping?  A similar problem might
happen with facets, but I am unsure about that.

Thanks,
Shawn



Re:5.5.0 SOLR-8621 deprecation warnings without maxMergeDocs or mergeFactor

2016-02-24 Thread Christine Poerschke (BLOOMBERG/ LONDON)
https://issues.apache.org/jira/browse/SOLR-8734 created for follow-up.

- Original Message -
From: solr-user@lucene.apache.org
To: solr-user@lucene.apache.org
At: Feb 24 2016 22:41:14

Hi Markus - thank you for the question.

Could you advise if/that the solrconfig.xml has a  element (for 
which deprecated warnings would appear separately) or that the solrconfig.xml 
has no  element?

If either is the case then yes based on the code (SolrIndexConfig.java#L153) 
the warnings would be expected-and-harmless though admittedly are confusing, 
and fixable.

Thanks,
Christine

- Original Message -
From: solr-user@lucene.apache.org
To: solr-user@lucene.apache.org
At: Feb 24 2016 17:24:45

Hi - i see lots of:

o.a.s.c.Config Beginning with Solr 5.5,  is deprecated, configure 
it on the relevant  instead.

On my development machine for all cores. None of the cores has either parameter 
configured. Is this expected?

Thanks,
Markus




Re:5.5.0 SOLR-8621 deprecation warnings without maxMergeDocs or mergeFactor

2016-02-24 Thread Christine Poerschke (BLOOMBERG/ LONDON)
Hi Markus - thank you for the question.

Could you advise if/that the solrconfig.xml has a  element (for 
which deprecated warnings would appear separately) or that the solrconfig.xml 
has no  element?

If either is the case then yes based on the code (SolrIndexConfig.java#L153) 
the warnings would be expected-and-harmless though admittedly are confusing, 
and fixable.

Thanks,
Christine

- Original Message -
From: solr-user@lucene.apache.org
To: solr-user@lucene.apache.org
At: Feb 24 2016 17:24:45

Hi - i see lots of:

o.a.s.c.Config Beginning with Solr 5.5,  is deprecated, configure 
it on the relevant  instead.

On my development machine for all cores. None of the cores has either parameter 
configured. Is this expected?

Thanks,
Markus



Re: Index writer addIndexes method not working

2016-02-24 Thread Erick Erickson
Look at the core admin API in Solr, the MERGEINDEXES action...
On Feb 22, 2016 17:41, "jeba earnest"  wrote:

> My requirement is to add the index folder to the solr data directory. I am
> generating a lucene index by mapreduce program. And later I would like to
> merge the index with the solr index without bringing the solr down.
>
> I actually tried index merger tool but this tool works when the solr is
> down.
>
> Is there a possibility to merge the segments. Will that solve my problem?
>
>
> What is this API does?
>
>
> https://lucene.apache.org/core/4_6_0/core/org/apache/lucene/index/IndexWriter.html#addIndexes(org.apache.lucene.store.Directory
> ..
> .)
>
> Jeba
>


Solr 5.5.0, connection resets in abstract distributed test persist

2016-02-24 Thread Markus Jelsma
Hi,

We got quite a few unit tests that inherit the abstract distributed test thing 
(haven't got hte FQCN around). On Solr 5.4.x we had a lot issues with 
connection reset, which i assumed, judging from resolved tickets, had been 
resolved with 5.5.0. Did i miss something? Can someone point me to an open 
ticket if available?

Many thanks!
Markus

NOTE: reproduce with: ant test  -Dtestcase=TestRecommendSearchHandler 
-Dtests.method=testSearchRecommendations -Dtests.seed=D64C367D450082A8 
-Dtests.locale=en-SG -Dtests.timezone=Europe/Skopje -Dtests.asserts=true 
-Dtests.file.encoding=UTF-8
org.apache.solr.client.solrj.SolrServerException: IOException occured when 
talking to server at: https://127.0.0.1:34761/_sn/e/collection1
at 
org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:589)
at 
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:240)
at 
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:229)
at 
org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:149)
at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:482)
at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:463)
at 
org.apache.solr.cloud.AbstractFullDistribZkTestBase.commit(AbstractFullDistribZkTestBase.java:1503)
at 
org.apache.solr.cloud.AbstractFullDistribZkTestBase.waitForThingsToLevelOut(AbstractFullDistribZkTestBase.java:1395)
at 
io.openindex.solr.recommend.TestRecommendSearchHandler.testSOLR25(TestRecommendSearchHandler.java:78)



Re: SOLR cloud startup poniting to zookeeper ensemble

2016-02-24 Thread Susheel Kumar
I see your point. Didn't realize that you are using windows.  If it work
using double quotes, please go ahead and launch that way.

Thank,

Susheel

On Wed, Feb 24, 2016 at 12:44 PM, bbarani  wrote:

> Its still throwing error without quotes.
>
> solr start -e cloud -noprompt -z
> localhost:2181,localhost:2182,localhost:2183
>
> Invalid command-line option: localhost:2182
>
> Usage: solr start [-f] [-c] [-h hostname] [-p port] [-d directory] [-z
> zkHost] [
> -m memory] [-e example] [-s solr.solr.home] [-a "additional-options"] [-V]
>
>   -fStart Solr in foreground; default starts Solr in the
> background
>   and sends stdout / stderr to solr-PORT-console.log
>
>   -c or -cloud  Start Solr in SolrCloud mode; if -z not supplied, an
> embedded Zo
>
> *Info on using double quotes:*
>
>
> http://lucene.472066.n3.nabble.com/Solr-5-2-1-setup-zookeeper-ensemble-problem-td4215823.html#a4215877
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/SOLR-cloud-startup-error-zookeeper-ensemble-windows-tp4259023p4259567.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Solr stops working...randomly

2016-02-24 Thread Shawn Heisey
On 2/24/2016 11:19 AM, Michael Beccaria wrote:
> We're running solr 4.4.0 running in this software 
> (https://github.com/CDRH/nebnews - Django based newspaper site). Solr is 
> running on Ubuntu 12.04 in Jetty. The site occasionally (once a day) goes 
> down with a Connection Refused error. I’m having a hard time troubleshooting 
> the issue and was looking for help in next steps in trying to find out why it 
> is failing.
>
> After debugging it turns out that it is solr that is refusing the connection 
> (restarting Jetty fixes it every time). It randomly fails.

The immediate possibility for the cause of this problem that comes to
mind is the maxThreads parameter in Jetty.  Beyond that, there is also
the OS process limit.

The maxThreads parameter in the Jetty config defaults to 200, and it is
quite easy to exceed this.  In the Jetty that comes packaged with Solr,
this setting has been changed to 1, which effectively removes the
limit for a typical Solr install.  Because you are running 4.4 and your
message indicates you are using "service jetty" commands, chances are
that you are NOT using the jetty that came with Solr.  The first thing I
would try is increasing the maxThreads parameter to 1.

The process limit is increased in /etc/security/limits.conf.  Here are
the additions that I make to this file on my Solr servers, to increase
the limits on the number of processes/threads and open files, both of
which default to 1024:

solrhardnproc   6144
solrsoftnproc   4096

solrhardnofile  65535
solrsoftnofile  49151

Thanks,
Shawn



Solr stops working...randomly

2016-02-24 Thread Michael Beccaria
We're running solr 4.4.0 running in this software 
(https://github.com/CDRH/nebnews - Django based newspaper site). Solr is 
running on Ubuntu 12.04 in Jetty. The site occasionally (once a day) goes down 
with a Connection Refused error. I’m having a hard time troubleshooting the 
issue and was looking for help in next steps in trying to find out why it is 
failing.



After debugging it turns out that it is solr that is refusing the connection 
(restarting Jetty fixes it every time). It randomly fails.



things I've tried:

running sudo service jetty check

Says the service is running



Opened up the port on the server and tried going to the solr admin page. This 
failed until I restarted jetty, then it works.



Checked the solr.log files and no errors are found. The jetty log level is set 
to INFO and I'm hesitant to put it to Debug because of file size growth and the 
long time between failures. The time between failures in the logs simply has a 
normal query at one time followed by a startup log sequence when I restart 
jetty.



Apache logs show tons of traffic (it's still running) from Google bots and 
maybe this is causing issues but I would still expect to find some sort of 
error. There is a mix of 200, 500 and 404 codes. Here's a small sample:


GET /lccn/sn85053037/1981-09-15/ed-1/seq-13/ocr/ HTTP/1.1

500

14814

-

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

GET /lccn/sn86075296/1910-10-27/ed-1/seq-1/ HTTP/1.1

500

14884

-

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

GET /lccn/sn84036028/1925-05-22/ed-1/seq-6/ocr/ HTTP/1.1

500

14791

-

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

GET /lccn/sn84036028/1917-10-28/ed-1/seq-1/ocr.xml HTTP/1.1

200

400827

-

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

GET /lccn/TheRetort/2011-10-07/ed-1/seq-10/ocr/ HTTP/1.1

500

14798

-

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

GET /lccn/TheRetort/1979-05-10/ed-1/seq-8/ocr.xml HTTP/1.1

200

193883

-

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

GET /lccn/sn84036124/1977-02-23/ed-1/seq-12/ocr/ HTTP/1.1

500

14790

-

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

GET /lccn/Emcoe/1958-11-21/ed-1/seq-3/ocr/ HTTP/1.1

500

14760

-

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

GET /lccn/sn85053252/1909-10-08/ed-1/.rdf HTTP/1.1

404

3051

-

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)






I could simply restart jetty nightly I guess but that seems to be putting a 
bandaid on the issue and I'm not sure how to proceed on this one. Any ideas?



Mike







Mike Beccaria

Director of Library Services

Paul Smith’s College

7833 New York 30

Paul Smiths, NY 12970

518.327.6376

mbecca...@paulsmiths.edu

www.paulsmiths.edu



-Original Message-

From: roland.sz...@booknwalk.com [mailto:roland.sz...@booknwalk.com] On Behalf 
Of Szucs Roland

Sent: Wednesday, February 24, 2016 1:10 PM

To: solr-user@lucene.apache.org

Subject: Re: very slow frequent updates



Thanks again Jeff. I will check the documentation of join queries becasue I 
never used it before.



Regards



Roland



2016-02-24 19:07 GMT+01:00 Jeff Wartes :



>

> I suspect your problem is the intersection of “very large document”

> and “high rate of change”. Either of those alone would be fine.

>

> You’re correct, if the thing you need to search or sort by is the

> thing with a high change rate, you probably aren’t going to be able to

> peel those things out of your index.

>

> Perhaps you could work something out with join queries? So you have

> two kinds of documents - book content and book price - and your

> high-frequency change is limited to documents with very little data.

>

>

>

>

>

> On 2/24/16, 4:01 AM, "roland.sz...@booknwalk.com on behalf of Szűcs

> Roland"  szucs.rol...@bookandwalk.hu> wrote:

>

> >I have checked it already in the ref. guide. It is stated that you

> >can not search in external fields:

> >

> https://cwiki.apache.org/confluence/display/solr/Working+with+External

> +Files+and+Processes

> >

> >Really I am very curios that my problem is not a usual one or the

> >case is that SOLR mainly focuses on search and not a kind of end-to-end 
> >support.

> >How this approach works with 1 million documents with frequently

> >changing prices?

> >

> >Thanks your time,

> >

> >Roland

> >

> >2016-02-24 12:39 GMT+01:00 Stefan Matheis :

> >

> >> Depending of what features you do actually need, might be worth a

> >> look on "External File Fields" Roland?

> >>

> >> -Stefan

> >>

> >> On Wed, Feb 24, 2016 at 12:24 PM, Szűcs Roland

> >>  wrote:

> >> > Thanks Jeff your help,

> >> >

> >> > Can it work in production environment? Imagine when my customer

> initiate

> >> a

> >> > query having 1 000 docs in the result set. I can not us

Re: very slow frequent updates

2016-02-24 Thread Szűcs Roland
Thanks again Jeff. I will check the documentation of join queries becasue I
never used it before.

Regards

Roland

2016-02-24 19:07 GMT+01:00 Jeff Wartes :

>
> I suspect your problem is the intersection of “very large document” and
> “high rate of change”. Either of those alone would be fine.
>
> You’re correct, if the thing you need to search or sort by is the thing
> with a high change rate, you probably aren’t going to be able to peel those
> things out of your index.
>
> Perhaps you could work something out with join queries? So you have two
> kinds of documents - book content and book price - and your high-frequency
> change is limited to documents with very little data.
>
>
>
>
>
> On 2/24/16, 4:01 AM, "roland.sz...@booknwalk.com on behalf of Szűcs
> Roland"  szucs.rol...@bookandwalk.hu> wrote:
>
> >I have checked it already in the ref. guide. It is stated that you can not
> >search in external fields:
> >
> https://cwiki.apache.org/confluence/display/solr/Working+with+External+Files+and+Processes
> >
> >Really I am very curios that my problem is not a usual one or the case is
> >that SOLR mainly focuses on search and not a kind of end-to-end support.
> >How this approach works with 1 million documents with frequently changing
> >prices?
> >
> >Thanks your time,
> >
> >Roland
> >
> >2016-02-24 12:39 GMT+01:00 Stefan Matheis :
> >
> >> Depending of what features you do actually need, might be worth a look
> >> on "External File Fields" Roland?
> >>
> >> -Stefan
> >>
> >> On Wed, Feb 24, 2016 at 12:24 PM, Szűcs Roland
> >>  wrote:
> >> > Thanks Jeff your help,
> >> >
> >> > Can it work in production environment? Imagine when my customer
> initiate
> >> a
> >> > query having 1 000 docs in the result set. I can not use the
> pagination
> >> of
> >> > SOLR as the field which is the basis of the sort is not included in
> the
> >> > schema for example the price. The customer wants the list in
> descending
> >> > order of the price.
> >> >
> >> > So I have to get all the 1000 docids from solr and find the metadata
> of
> >> > them in a sql database or in cache in best case. This is the way you
> >> > suggested? Is it not too slow?
> >> >
> >> > Regards,
> >> > Roland
> >> >
> >> > 2016-02-23 19:29 GMT+01:00 Jeff Wartes :
> >> >
> >> >>
> >> >> My suggestion would be to split your problem domain. Use Solr
> >> exclusively
> >> >> for search - index the id and only those fields you need to search
> on.
> >> Then
> >> >> use some other data store for retrieval. Get the id’s from the solr
> >> >> results, and look them up in the data store to get the rest of your
> >> fields.
> >> >> This allows you to keep your solr docs as small as possible, and you
> >> only
> >> >> need to update them when a *searchable* field changes.
> >> >>
> >> >> Every “update" in solr is a delete/insert. Even the "atomic update”
> >> >> feature is just a shortcut for that. It requires stored fields
> because
> >> the
> >> >> data from the stored fields gets copied into the new insert.
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> On 2/22/16, 12:21 PM, "Roland Szűcs" 
> >> wrote:
> >> >>
> >> >> >Hi folks,
> >> >> >
> >> >> >We use SOLR 5.2.1. We have ebooks stored in SOLR. The majority of
> the
> >> >> >fields do not change at all like content, author, publisher Only
> >> the
> >> >> >price field changes frequently.
> >> >> >
> >> >> >We let the customers to make full text search so we indexed the
> content
> >> >> >filed. Due to the frequency of the price updates we use the atomic
> >> update
> >> >> >feature. As a requirement of the atomic updates we have to store all
> >> the
> >> >> >fields even the content field which is 1MB/document and we did not
> >> want to
> >> >> >store it just index it.
> >> >> >
> >> >> >As we wanted to update 100 documents with atomic update it took
> about 3
> >> >> >minutes. Taking into account that our metadata /document is 1 Kb and
> >> our
> >> >> >content field / document is 1MB we use 1000 more memory to
> accelerate
> >> the
> >> >> >update process.
> >> >> >
> >> >> >I am almost 100% sure that we make something wrong.
> >> >> >
> >> >> >What is the best practice of the frequent updates when 99% part of a
> >> given
> >> >> >document is constant forever?
> >> >> >
> >> >> >Thank in advance
> >> >> >
> >> >> >--
> >> >> >
> Roland
> >> >> Szűcs
> >> >> >
> Connect
> >> >> with
> >> >> >me on Linkedin <
> >> >> https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>
> >> >> >
> >> >> >CEO Phone: +36 1 210 81 13
> >> >> >Bookandwalk.hu 
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> >  Szűcs
> >> Roland
> >> > 
> >> Ismerkedjünk
> >> > meg a Linkedin <
> >> https://www.linkedin.com/pub/roland-sz%C5%B1cs

Re: very slow frequent updates

2016-02-24 Thread Jeff Wartes

I suspect your problem is the intersection of “very large document” and “high 
rate of change”. Either of those alone would be fine.

You’re correct, if the thing you need to search or sort by is the thing with a 
high change rate, you probably aren’t going to be able to peel those things out 
of your index. 

Perhaps you could work something out with join queries? So you have two kinds 
of documents - book content and book price - and your high-frequency change is 
limited to documents with very little data.





On 2/24/16, 4:01 AM, "roland.sz...@booknwalk.com on behalf of Szűcs Roland" 
 wrote:

>I have checked it already in the ref. guide. It is stated that you can not
>search in external fields:
>https://cwiki.apache.org/confluence/display/solr/Working+with+External+Files+and+Processes
>
>Really I am very curios that my problem is not a usual one or the case is
>that SOLR mainly focuses on search and not a kind of end-to-end support.
>How this approach works with 1 million documents with frequently changing
>prices?
>
>Thanks your time,
>
>Roland
>
>2016-02-24 12:39 GMT+01:00 Stefan Matheis :
>
>> Depending of what features you do actually need, might be worth a look
>> on "External File Fields" Roland?
>>
>> -Stefan
>>
>> On Wed, Feb 24, 2016 at 12:24 PM, Szűcs Roland
>>  wrote:
>> > Thanks Jeff your help,
>> >
>> > Can it work in production environment? Imagine when my customer initiate
>> a
>> > query having 1 000 docs in the result set. I can not use the pagination
>> of
>> > SOLR as the field which is the basis of the sort is not included in the
>> > schema for example the price. The customer wants the list in descending
>> > order of the price.
>> >
>> > So I have to get all the 1000 docids from solr and find the metadata of
>> > them in a sql database or in cache in best case. This is the way you
>> > suggested? Is it not too slow?
>> >
>> > Regards,
>> > Roland
>> >
>> > 2016-02-23 19:29 GMT+01:00 Jeff Wartes :
>> >
>> >>
>> >> My suggestion would be to split your problem domain. Use Solr
>> exclusively
>> >> for search - index the id and only those fields you need to search on.
>> Then
>> >> use some other data store for retrieval. Get the id’s from the solr
>> >> results, and look them up in the data store to get the rest of your
>> fields.
>> >> This allows you to keep your solr docs as small as possible, and you
>> only
>> >> need to update them when a *searchable* field changes.
>> >>
>> >> Every “update" in solr is a delete/insert. Even the "atomic update”
>> >> feature is just a shortcut for that. It requires stored fields because
>> the
>> >> data from the stored fields gets copied into the new insert.
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> On 2/22/16, 12:21 PM, "Roland Szűcs" 
>> wrote:
>> >>
>> >> >Hi folks,
>> >> >
>> >> >We use SOLR 5.2.1. We have ebooks stored in SOLR. The majority of the
>> >> >fields do not change at all like content, author, publisher Only
>> the
>> >> >price field changes frequently.
>> >> >
>> >> >We let the customers to make full text search so we indexed the content
>> >> >filed. Due to the frequency of the price updates we use the atomic
>> update
>> >> >feature. As a requirement of the atomic updates we have to store all
>> the
>> >> >fields even the content field which is 1MB/document and we did not
>> want to
>> >> >store it just index it.
>> >> >
>> >> >As we wanted to update 100 documents with atomic update it took about 3
>> >> >minutes. Taking into account that our metadata /document is 1 Kb and
>> our
>> >> >content field / document is 1MB we use 1000 more memory to accelerate
>> the
>> >> >update process.
>> >> >
>> >> >I am almost 100% sure that we make something wrong.
>> >> >
>> >> >What is the best practice of the frequent updates when 99% part of a
>> given
>> >> >document is constant forever?
>> >> >
>> >> >Thank in advance
>> >> >
>> >> >--
>> >> > Roland
>> >> Szűcs
>> >> > Connect
>> >> with
>> >> >me on Linkedin <
>> >> https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>
>> >> >
>> >> >CEO Phone: +36 1 210 81 13
>> >> >Bookandwalk.hu 
>> >>
>> >
>> >
>> >
>> > --
>> >  Szűcs
>> Roland
>> > 
>> Ismerkedjünk
>> > meg a Linkedin <
>> https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>
>> > -en 
>> > Ügyvezető Telefon: +36 1 210 81 13
>> > Bookandwalk.hu 
>>
>
>
>
>-- 
> Szűcs Roland
> Ismerkedjünk
>meg a Linkedin 
>-en 
>Ügyvezető Telefon: +36 1 210 81 13
>Bookandwalk.hu 

Re: /select changes between 4 and 5

2016-02-24 Thread Yonik Seeley
On Wed, Feb 24, 2016 at 12:51 PM, Shawn Heisey  wrote:
> On 2/24/2016 9:09 AM, Mike Thomsen wrote:
>> Yeah, it was a problem on my end. Not just the content-type as you
>> suggested, but I had to wrap that whole JSON body so it looked like this:
>>
>> {
>> "params": { ///That block pasted here }
>> }
>
> I'm surprised you can get JSON to work at all.

Support for a JSON query body was added in 5x.
Still kind of experimental... I haven't had the time to flesh it out
more, but you can just put all the normal query params in the "params"
block.

http://yonik.com/solr-json-request-api/

-Yonik


Re: What search metrics are useful?

2016-02-24 Thread Doug Turnbull
I would also point you at many of Mr. Underwood's blog posts, as they have
helped me quite a bit :)

http://techblog.chegg.com/2012/12/12/measuring-search-relevance-with-mrr/

On Wed, Feb 24, 2016 at 11:37 AM, Doug Turnbull <
dturnb...@opensourceconnections.com> wrote:

> For relevance, I would also look at retention metrics. Harder to tie back
> to a specific search. But what happens after the conversion? Did they
> purchase the product and hate it? Or did they come back for more? Retention
> metrics say a lot about the whole experience. But for many search-heavy
> applications search is 90% of the user experience. Was it really relevant
> if they purchased a product, but were dissatisfied? Did search make a
> promise that wasn't delivered on? This is something I personally noodle
> about and not something I have a canned solution for.
>
> There's an obsession with what I think of as "engagement metrics" or
> "session metrics". Engagement metrics like CTR and are handy because
> they're easy to tie to a search. Search, search, search , search,
> search,  .
>
> I'm always cautious of click-thru metrics. Beware of the biases in your
> clickthru metrics
>
> http://opensourceconnections.com/blog/2014/10/08/when-click-scoring-can-hurt-search-relevance-a-roadmap-to-better-signals-processing-in-search/
>
> Another reason to be cautious is user behavioral data can require
> domain-specific interpretation. A good book on recommendor can talk more
> about interpreting user behavior to see if an item was relevant. For
> examples Pratical Recommender Systems by Kim Falk (a Manning MEAP) spends a
> great deal of time talking through gathering evidence whether a user liked
> the thing they clicked on or not. For example, did the user click a movie
> and go back immediately? Start watching a movie and go back in 5 minutes
> indicating they hated it? Or watch a movie all the way through?
>
> Related to interpreting behavior -- understand the kinds of searchers out
> there. Understand what sort of user experience you've built. Informational
> searchers doing research will look at every item and evaluate them. For
> example, a paralegal searching a legal application may need to examine
> every result carefully. Navigational searchers want to hunt for one thing.
> Everyday e-commerce searchers clicking on every result is probably
> disastrous. However, the purchasing dept of an organization MIGHT look at
> every result and that might be ok.
>
> Beware of search's long tail. You can gather metrics on all your searches
> and where users are clicking, but search has a notorious long tail. Many of
> my clients have meaningful metrics over perhaps the top 50 results before
> quickly going off into obscurity of statistical insignificance per search.
> This depends entirely on the type of search application you're developing.
> Some kind of niche product with a handful of searches per day? Or giant
> e-commerce site?
>
> Sometimes what's simpler is to do usability testing or to sit with an
> expert user and gather relevance judgments--grades on what's relevant and
> what's not. (this is what we do with Quepid). This works particularly well
> for these niche, expert search subjects
>
> Anyway, there's still quite a bit of art to interpreting search metrics. I
> would argue to keep the human and domain expert in the loop understanding
> and interpreting metrics. But its a yin-and-yang. You also need to be able
> to tell that supposed domain expert when they're wrong.
>
> Sorry for long winded email, but these topics dominate my
> dreams/nightmares these days :)
>
> Best
> -Doug
>
>
>
>
> On Wed, Feb 24, 2016 at 11:20 AM, Walter Underwood 
> wrote:
>
>> Click through rate (CTR) is fundamental. That is easy to understand and
>> integrates well with other business metrics like conversion. CTR is at
>> least one click anywhere in the result set (first page, second page, …).
>> Count multiple clicks as a single success. The metric is, “at least one
>> click”.
>>
>> No hit rate is sort of useful, but you need to know which queries are
>> getting no hits, so you can fix it.
>>
>> For latency metrics, look at 90th percentile or 95th percentile. Average
>> is useless because response time is a one-sided distribution, so it will be
>> thrown off by outliers. Percentiles have a direct customer satisfaction
>> interpretation. 90% of searches were under one second, for example. Median
>> response time should be very, very fast because of caching in Solr. During
>> busy periods, our median response time is about 1.5 ms.
>>
>> Number of different queries per conversion is a good way to look how
>> query assistance is working. Things like autosuggest, fuzzy, etc.
>>
>> About 10% of queries will be misspelled, so you do need to deal with that.
>>
>> Finding underperforming queries is trickier. I really need to write an
>> article on that.
>>
>> “Search Analytics for Your Site” by Lou Rosenfeld is a good introduction.
>>
>> http://rosenfeldmedia.com/

Re: /select changes between 4 and 5

2016-02-24 Thread Shawn Heisey
On 2/24/2016 9:09 AM, Mike Thomsen wrote:
> Yeah, it was a problem on my end. Not just the content-type as you
> suggested, but I had to wrap that whole JSON body so it looked like this:
>
> {
> "params": { ///That block pasted here }
> }

I'm surprised you can get JSON to work at all.  I would expect the
needed format to be what browsers send when they do a post from an html
form, like the example Yonik tried.

Whatever is parsing those parameters, whether it is Solr or or the
servlet container, probably got more restrictive about the format in
5.x.  Solr 5.x uses Jetty 9 for running servlets.  There's no way for me
to know what container you were using with the 4.10 version.

Is there documentation somewhere for using a JSON post body with Solr?

Thanks,
Shawn



Re: SOLR cloud startup poniting to zookeeper ensemble

2016-02-24 Thread bbarani
Its still throwing error without quotes.

solr start -e cloud -noprompt -z
localhost:2181,localhost:2182,localhost:2183

Invalid command-line option: localhost:2182

Usage: solr start [-f] [-c] [-h hostname] [-p port] [-d directory] [-z
zkHost] [
-m memory] [-e example] [-s solr.solr.home] [-a "additional-options"] [-V]

  -fStart Solr in foreground; default starts Solr in the
background
  and sends stdout / stderr to solr-PORT-console.log

  -c or -cloud  Start Solr in SolrCloud mode; if -z not supplied, an
embedded Zo

*Info on using double quotes:*

http://lucene.472066.n3.nabble.com/Solr-5-2-1-setup-zookeeper-ensemble-problem-td4215823.html#a4215877



--
View this message in context: 
http://lucene.472066.n3.nabble.com/SOLR-cloud-startup-error-zookeeper-ensemble-windows-tp4259023p4259567.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: /select changes between 4 and 5

2016-02-24 Thread Markus Jelsma
Great! Thanks!

 
 
-Original message-
> From:Yonik Seeley 
> Sent: Wednesday 24th February 2016 18:04
> To: solr-user@lucene.apache.org
> Subject: Re: /select changes between 4 and 5
> 
> On Wed, Feb 24, 2016 at 11:21 AM, Markus Jelsma
>  wrote:
> > Re: POST in general still works for queries... I just verified it:
> >
> > This is not supposed to change i hope? We rely on POST for some huge 
> > automated queries. Instead of constantly increasing URL length limit, we 
> > rely on POST.
> 
> Yep, and IIRC Solr even uses POST internally when doing distributed search.
> 
> -Yonik
> 


5.5.0 SOLR-8621 deprecation warnings without maxMergeDocs or mergeFactor

2016-02-24 Thread Markus Jelsma
Hi - i see lots of:

o.a.s.c.Config Beginning with Solr 5.5,  is deprecated, configure 
it on the relevant  instead.

On my development machine for all cores. None of the cores has either parameter 
configured. Is this expected?

Thanks,
Markus


Re: /select changes between 4 and 5

2016-02-24 Thread Yonik Seeley
On Wed, Feb 24, 2016 at 11:21 AM, Markus Jelsma
 wrote:
> Re: POST in general still works for queries... I just verified it:
>
> This is not supposed to change i hope? We rely on POST for some huge 
> automated queries. Instead of constantly increasing URL length limit, we rely 
> on POST.

Yep, and IIRC Solr even uses POST internally when doing distributed search.

-Yonik


Re: Null Pointer Exception on distributed search

2016-02-24 Thread Lokesh Chhaparwal
Hi,

Can someone please update on this exception trace while we are using
distributed search using shards parameter (solr-master-slave).

Thanks,
Lokesh


On Wed, Feb 17, 2016 at 5:33 PM, Lokesh Chhaparwal 
wrote:

> Hi,
>
> We are facing NPE while using distributed search (Solr version 4.7.2)
> (using *shards* parameter in solr query)
>
> Exception Trace:
> ERROR - 2016-02-17 16:44:26.616; org.apache.solr.common.SolrException;
> null:java.lang.NullPointerException
> at org.apache.solr.response.XMLWriter.writeSolrDocument(XMLWriter.java:190)
> at
> org.apache.solr.response.TextResponseWriter.writeSolrDocumentList(TextResponseWriter.java:222)
> at
> org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:184)
> at org.apache.solr.response.XMLWriter.writeNamedList(XMLWriter.java:227)
> at
> org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:188)
> at org.apache.solr.response.XMLWriter.writeArray(XMLWriter.java:273)
> at
> org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:190)
> at org.apache.solr.response.XMLWriter.writeNamedList(XMLWriter.java:227)
> at
> org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:188)
> at org.apache.solr.response.XMLWriter.writeNamedList(XMLWriter.java:227)
> at
> org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:188)
> at org.apache.solr.response.XMLWriter.writeResponse(XMLWriter.java:111)
> at
> org.apache.solr.response.XMLResponseWriter.write(XMLResponseWriter.java:40)
> at
> org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:756)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:428)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:205)
> at
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
> at
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
> at
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:220)
> at
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122)
> at
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:171)
> at
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
> at
> org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:950)
> at
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)
> at
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:408)
> at
> org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1040)
> at
> org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:607)
> at
> org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:314)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at
> org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
> at java.lang.Thread.run(Thread.java:745)
>
>
> Can somebody help us in finding the root cause of this exception?
>
> FYI, we have documents split across 8 shards (num docs ~ 20 million) with
> index size ~ 4 GB per node. We are using c3.2xlarge amazon ec2 machines
> with solr running in apache tomcat (memory config 8 to 10 gb). Request
> count ~ 200/sec.
>
> Thanks,
> Lokesh
>
>


SolrJ + JSON Facet API

2016-02-24 Thread Georg Sorst
Hi list!

Does SolrJ already wrap the new JSON Facet API? I couldn't find any info
about this.

If not, what's the best way for a Java client to build and send requests
when you want to use the JSON Facets?

On a side note, since the JSON Facet API uses POST I will not be able to
see the requested facets in my Solr logs anymore, right?

Thanks!
Georg
-- 
*Georg M. Sorst I CTO*
FINDOLOGIC GmbH

Jakob-Haringer-Str. 5a | 5020 Salzburg I T.: +43 662 456708
E.: g.so...@findologic.com
www.findologic.com Folgen Sie uns auf: XING
facebook
 Twitter


Wir sehen uns auf der* Internet World am 01.03. und 02.03.2016 - Halle
B6 Stand D182!* Hier
 Termin
vereinbaren!


Re: What search metrics are useful?

2016-02-24 Thread Doug Turnbull
For relevance, I would also look at retention metrics. Harder to tie back
to a specific search. But what happens after the conversion? Did they
purchase the product and hate it? Or did they come back for more? Retention
metrics say a lot about the whole experience. But for many search-heavy
applications search is 90% of the user experience. Was it really relevant
if they purchased a product, but were dissatisfied? Did search make a
promise that wasn't delivered on? This is something I personally noodle
about and not something I have a canned solution for.

There's an obsession with what I think of as "engagement metrics" or
"session metrics". Engagement metrics like CTR and are handy because
they're easy to tie to a search. Search, search, search , search,
search,  .

I'm always cautious of click-thru metrics. Beware of the biases in your
clickthru metrics
http://opensourceconnections.com/blog/2014/10/08/when-click-scoring-can-hurt-search-relevance-a-roadmap-to-better-signals-processing-in-search/

Another reason to be cautious is user behavioral data can require
domain-specific interpretation. A good book on recommendor can talk more
about interpreting user behavior to see if an item was relevant. For
examples Pratical Recommender Systems by Kim Falk (a Manning MEAP) spends a
great deal of time talking through gathering evidence whether a user liked
the thing they clicked on or not. For example, did the user click a movie
and go back immediately? Start watching a movie and go back in 5 minutes
indicating they hated it? Or watch a movie all the way through?

Related to interpreting behavior -- understand the kinds of searchers out
there. Understand what sort of user experience you've built. Informational
searchers doing research will look at every item and evaluate them. For
example, a paralegal searching a legal application may need to examine
every result carefully. Navigational searchers want to hunt for one thing.
Everyday e-commerce searchers clicking on every result is probably
disastrous. However, the purchasing dept of an organization MIGHT look at
every result and that might be ok.

Beware of search's long tail. You can gather metrics on all your searches
and where users are clicking, but search has a notorious long tail. Many of
my clients have meaningful metrics over perhaps the top 50 results before
quickly going off into obscurity of statistical insignificance per search.
This depends entirely on the type of search application you're developing.
Some kind of niche product with a handful of searches per day? Or giant
e-commerce site?

Sometimes what's simpler is to do usability testing or to sit with an
expert user and gather relevance judgments--grades on what's relevant and
what's not. (this is what we do with Quepid). This works particularly well
for these niche, expert search subjects

Anyway, there's still quite a bit of art to interpreting search metrics. I
would argue to keep the human and domain expert in the loop understanding
and interpreting metrics. But its a yin-and-yang. You also need to be able
to tell that supposed domain expert when they're wrong.

Sorry for long winded email, but these topics dominate my dreams/nightmares
these days :)

Best
-Doug




On Wed, Feb 24, 2016 at 11:20 AM, Walter Underwood 
wrote:

> Click through rate (CTR) is fundamental. That is easy to understand and
> integrates well with other business metrics like conversion. CTR is at
> least one click anywhere in the result set (first page, second page, …).
> Count multiple clicks as a single success. The metric is, “at least one
> click”.
>
> No hit rate is sort of useful, but you need to know which queries are
> getting no hits, so you can fix it.
>
> For latency metrics, look at 90th percentile or 95th percentile. Average
> is useless because response time is a one-sided distribution, so it will be
> thrown off by outliers. Percentiles have a direct customer satisfaction
> interpretation. 90% of searches were under one second, for example. Median
> response time should be very, very fast because of caching in Solr. During
> busy periods, our median response time is about 1.5 ms.
>
> Number of different queries per conversion is a good way to look how query
> assistance is working. Things like autosuggest, fuzzy, etc.
>
> About 10% of queries will be misspelled, so you do need to deal with that.
>
> Finding underperforming queries is trickier. I really need to write an
> article on that.
>
> “Search Analytics for Your Site” by Lou Rosenfeld is a good introduction.
>
> http://rosenfeldmedia.com/books/search-analytics-for-your-site/ <
> http://rosenfeldmedia.com/books/search-analytics-for-your-site/>
>
> Sea Urchin is doing some good work in search metrics:
> https://seaurchin.io/ 
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> Search Guy, Chegg
>
> > On Feb 24, 2016, at 2:38 AM, Emir Arnautovic <
> emir.arnauto...@sematext.com> w

Re: WhitespaceTokenizerFactory and PathHierarchyTokenizerFactory

2016-02-24 Thread Jack Krupansky
Your statement makes no sense. Please clarify. Express your requirement(s)
in plain English first before dragging in possible solutions. Technically,
path elements can have embedded spaces.

-- Jack Krupansky

On Wed, Feb 24, 2016 at 6:53 AM, Anil  wrote:

> HI,
>
> i need to use both WhitespaceTokenizerFactory and
> PathHierarchyTokenizerFactory for use case.
>
> Solr supports only one tokenizer. is there any way we can achieve
> PathHierarchyTokenizerFactory  functionality with filters ?
>
> Please advice.
>
> Regards,
> Anil
>


Re: Query time de-boost

2016-02-24 Thread shamik
Binoy, 0.1 is still a positive boost. With title getting the highest weight,
this won't make any difference. I've tried this as well.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Query-time-de-boost-tp4259309p4259552.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Query time de-boost

2016-02-24 Thread shamik
Hi Emir,

I've a bunch of contentgroup values, so boosting them individually is
cumbersome. I've boosting on query fields 

qf=text^6 title^15 IndexTerm^8

and 

bq=Source:simplecontent^10 Source:Help^20
(-ContentGroup-local:("Developer"))^99

I was hoping *(-ContentGroup-local:("Developer"))^99* will implicitly boost
the rest, but that didn't happen.

I'm using edismax.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Query-time-de-boost-tp4259309p4259551.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: /select changes between 4 and 5

2016-02-24 Thread Markus Jelsma
Re: POST in general still works for queries... I just verified it: 

This is not supposed to change i hope? We rely on POST for some huge automated 
queries. Instead of constantly increasing URL length limit, we rely on POST.

Regards,
Markus

-Original message-
> From:Yonik Seeley 
> Sent: Wednesday 24th February 2016 17:06
> To: solr-user@lucene.apache.org
> Subject: Re: /select changes between 4 and 5
> 
> POST in general still works for queries... I just verified it:
> 
> curl -XPOST "http://localhost:8983/solr/techproducts/select"; -d "q=*:*"
> 
> Maybe it's your content-type (since it seems like you are posting
> Python)... Were you using some sort of custom code that could
> read/accept other content types?
> 
> -Yonik
> 
> 
> On Wed, Feb 24, 2016 at 8:48 AM, Mike Thomsen  wrote:
> > With 4.10, we used to post JSON like this example (part of it is Python) to
> > /select:
> >
> > {
> > "q": "LONG_QUERY_HERE",
> > "fq": fq,
> > "fl": ["id", "title", "date_of_information", "link", "search_text"],
> > "rows": 100,
> > "wt": "json",
> > "indent": "true",
> > "_": int(time.time())
> > }
> >
> > We just upgraded to 5.4.1, and now we can't seem to POST anything to
> > /select. I tried it out in the admin tool, and it only does GET operations
> > against /select (tried changing it to POST and moving query string to the
> > body with Firefox dev tools, but that failed).
> >
> > Is there a way to keep doing something like what we were doing or do we
> > need to limit ourselves to GETs? I think our queries are all small enough
> > now for that, but it would helpful to know for planning.
> >
> > Thanks,
> >
> > Mike
> 


Re: What search metrics are useful?

2016-02-24 Thread Walter Underwood
Click through rate (CTR) is fundamental. That is easy to understand and 
integrates well with other business metrics like conversion. CTR is at least 
one click anywhere in the result set (first page, second page, …). Count 
multiple clicks as a single success. The metric is, “at least one click”.

No hit rate is sort of useful, but you need to know which queries are getting 
no hits, so you can fix it.

For latency metrics, look at 90th percentile or 95th percentile. Average is 
useless because response time is a one-sided distribution, so it will be thrown 
off by outliers. Percentiles have a direct customer satisfaction 
interpretation. 90% of searches were under one second, for example. Median 
response time should be very, very fast because of caching in Solr. During busy 
periods, our median response time is about 1.5 ms.

Number of different queries per conversion is a good way to look how query 
assistance is working. Things like autosuggest, fuzzy, etc.

About 10% of queries will be misspelled, so you do need to deal with that.

Finding underperforming queries is trickier. I really need to write an article 
on that.

“Search Analytics for Your Site” by Lou Rosenfeld is a good introduction.

http://rosenfeldmedia.com/books/search-analytics-for-your-site/ 


Sea Urchin is doing some good work in search metrics: https://seaurchin.io/ 


wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)
Search Guy, Chegg

> On Feb 24, 2016, at 2:38 AM, Emir Arnautovic  
> wrote:
> 
> Hi Bill,
> You can take a look at Sematext's search analytics 
> (https://sematext.com/search-analytics). It provides some of metrics you 
> mentioned, plus some additional (top queries, CTR, click stats, paging stats 
> etc.). In combination with Sematext's performance metrics 
> (https://sematext.com/spm) you can have full picture of your search 
> infrastructure.
> 
> Regards,
> Emir
> 
> -- 
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
> 
> 
> On 24.02.2016 04:07, William Bell wrote:
>> How do others look at search metrics?
>> 
>> 1. Search conversion? Do you look at searches and if the user does not
>> click on a result, and reruns the search that would be a failure?
>> 
>> 2. How to measure auto complete success metrics?
>> 
>> 3. Facets/filters could be considered negative, since we did not find the
>> results that the user wanted, and now they are filtering - who to measure?
>> 
>> 4. One easy metric is searches with 0 results. We could auto expand the geo
>> distance or ask the user "did you mean" ?
>> 
>> 5. Another easy one would be tech performance: "time it takes in seconds to
>> get a result".
>> 
>> 6. How to measure fuzzy? How do you know you need more synonyms? How to
>> measure?
>> 
>> 7. How many searches it takes before the user clicks on a result?
>> 
>> Other ideas? Is there a video or presentation on search metrics that would
>> be useful?
>> 
> 



Re: /select changes between 4 and 5

2016-02-24 Thread Mike Thomsen
Yeah, it was a problem on my end. Not just the content-type as you
suggested, but I had to wrap that whole JSON body so it looked like this:

{
"params": { ///That block pasted here }
}

On Wed, Feb 24, 2016 at 11:05 AM, Yonik Seeley  wrote:

> POST in general still works for queries... I just verified it:
>
> curl -XPOST "http://localhost:8983/solr/techproducts/select"; -d "q=*:*"
>
> Maybe it's your content-type (since it seems like you are posting
> Python)... Were you using some sort of custom code that could
> read/accept other content types?
>
> -Yonik
>
>
> On Wed, Feb 24, 2016 at 8:48 AM, Mike Thomsen 
> wrote:
> > With 4.10, we used to post JSON like this example (part of it is Python)
> to
> > /select:
> >
> > {
> > "q": "LONG_QUERY_HERE",
> > "fq": fq,
> > "fl": ["id", "title", "date_of_information", "link", "search_text"],
> > "rows": 100,
> > "wt": "json",
> > "indent": "true",
> > "_": int(time.time())
> > }
> >
> > We just upgraded to 5.4.1, and now we can't seem to POST anything to
> > /select. I tried it out in the admin tool, and it only does GET
> operations
> > against /select (tried changing it to POST and moving query string to the
> > body with Firefox dev tools, but that failed).
> >
> > Is there a way to keep doing something like what we were doing or do we
> > need to limit ourselves to GETs? I think our queries are all small enough
> > now for that, but it would helpful to know for planning.
> >
> > Thanks,
> >
> > Mike
>


Re: /select changes between 4 and 5

2016-02-24 Thread Yonik Seeley
POST in general still works for queries... I just verified it:

curl -XPOST "http://localhost:8983/solr/techproducts/select"; -d "q=*:*"

Maybe it's your content-type (since it seems like you are posting
Python)... Were you using some sort of custom code that could
read/accept other content types?

-Yonik


On Wed, Feb 24, 2016 at 8:48 AM, Mike Thomsen  wrote:
> With 4.10, we used to post JSON like this example (part of it is Python) to
> /select:
>
> {
> "q": "LONG_QUERY_HERE",
> "fq": fq,
> "fl": ["id", "title", "date_of_information", "link", "search_text"],
> "rows": 100,
> "wt": "json",
> "indent": "true",
> "_": int(time.time())
> }
>
> We just upgraded to 5.4.1, and now we can't seem to POST anything to
> /select. I tried it out in the admin tool, and it only does GET operations
> against /select (tried changing it to POST and moving query string to the
> body with Firefox dev tools, but that failed).
>
> Is there a way to keep doing something like what we were doing or do we
> need to limit ourselves to GETs? I think our queries are all small enough
> now for that, but it would helpful to know for planning.
>
> Thanks,
>
> Mike


RE: I have one small question that always intrigue me

2016-02-24 Thread Davis, Daniel (NIH/NLM) [C]
I've wondered about this as well.Recall that the proper architecture for 
Solr as well as ZooKeeper is as a back-end service, part of a tiered 
architecture, with web application servers in front.   Solr and other search 
engines should fit in at the same layer as RDBMS and  NoSQL, with the web 
applications in front of them.   In some larger systems, there is even an 
Enterprise SOA layer in between, but I've never worked on a project where I 
felt that was truly justified.   It is probably a matter of scale however.

The common-case solution relies on this architecture - Solr and Zookeeper can 
be protected by IP address firewalls both off system and on system.The 
network firewalls (AWS security policy) allow only certain ip 
addresses/networks to connect to Solr and Zookeeper, and the local system 
firewalls act as a back-up to this system.   The SHA1 checksum within ZooKeeper 
and the Basic Authentication within SolrCloud then act as a way to fine tune 
access control, but they are not so much to protect Solr and Zookeeper but to 
allow a division of privileges.

Some sites will find this insufficient:
- Solr supports SSL - 
https://cwiki.apache.org/confluence/display/solr/Enabling+SSL
- ZooKeeper supports SSL - 
https://cwiki.apache.org/confluence/display/ZOOKEEPER/ZooKeeper+SSL+User+Guide

Both also at this point support custom authentication providers.

My Solr is less protected that it should be, but I have mod_auth_cas protecting 
the solr admin interface, and certain request handlers can be accessed without 
this security through hand-built Apache httpd conf.d files for each core.
There is a load-balancer (like Amazon Elastic Load Balancer (ELB)) in front of 
all Solr nodes, and since fault-tolerance is needed only for search, not for 
indexing, this is adequate. In other words, my Solr clients would not 
operate in SolrCloud mode, even if I made the Solr instance itself SolrCloud 
for ease of management.I'm having a little bit of a problem justifying this 
setup - the Role Based Authorization Plugin for Solr Basic Auth only scales to 
Enterprise use if you have a web front-end to manage the users, passwords, 
groups, and roles.

Does this help?

P.S. - Generally, one cross posts to another list only one when does not 
receive a good reply on the first list.   I can see how both 
u...@zookeeper.apache.org and solr-user@lucene.apache.org may be justified, but 
I don't see how you can justify more lists than this.

-Original Message-
From: Zara Parst [mailto:edotserv...@gmail.com] 
Sent: Wednesday, February 24, 2016 3:27 AM
To: zookeeper-u...@hadoop.apache.org; f...@apache.org; AALSIHE 
; u...@zookeeper.apache.org; solr-user@lucene.apache.org; 
d...@nutch.apache.org; u...@nutch.apache.org; comm...@lucene.apache.org; 
u...@lucene.apache.org
Subject: I have one small question that always intrigue me

Hi everyone,

I am really need your help, please read below


If we have to run solr in cloud mode, we are going to use zookeeper,   now
any zookeeper client can connect to zookeeper server, Zookeeper has facility to 
protect znode however any one can see znode acl however password could be 
encrypted.  Decrypting password or guessing password is not a big deal. As we 
know password is SHA encrypted also there is no limitation of number of try to 
authorize with ACL. So my point is how to safegard zookeeper.

I can guess few things

a. Don't reveal ip of your zookeeper ( security with obscurity ) b. ip table 
which is also not a very good idea c. what else ??

My guess was if some how we can protect zookeeper server itself by asking 
client to authorize them self before it can make connection to ensemble even at 
root ( /) znode.

Please please at least comment on this , I really need your help.


Re: importing 4.10.2 solr cloud repository to 5.4.1

2016-02-24 Thread Shawn Heisey
On 2/23/2016 11:10 PM, Neeraj Bhatt wrote:
> Hello
>
> We have a solr cloud stored and indexed data of around 25 lakh documents
> We recently moved to solr 5.4.1 but are unable to move our indexed
> data. What approach we should follow
>
> 1. data import handler works in solr cloud ? what should we give in
> url like  url="http://192.168.34.218:8080/solr/client_sku_shard1_replica3";
> , this will have shard name, so all documents won't be imported

SolrEntityProcessor in DIH will only work if your index meets the
requirements for Atomic Updates.  Basically, every field must be stored,
unless it is a copyField destination:

https://wiki.apache.org/solr/Atomic_Updates#Caveats_and_Limitations

> 2. direct copying of index will work ? There are some schema changes
> like from solr.Int to solr.TrieInt etc

If the schema uses different classes, you will not be able to use the
old index directly.  The schema would need to be completely unchanged,
but it sounds like your old schema is using classes that are no longer
present in 5.x.

> 3. write code to fetch from solr 4.10.2 and push into 5.4.1 this is
> time consuming, though can be improved by using multithreading

This has the same requirements as SolrEntityProcessor.

A complete reindex in 5.x from the original data source would be the
best option, but if your index meets the Atomic Update requirements, you
could go with one of the options that you numbered 1 or 3.

Thanks,
Shawn



/select changes between 4 and 5

2016-02-24 Thread Mike Thomsen
With 4.10, we used to post JSON like this example (part of it is Python) to
/select:

{
"q": "LONG_QUERY_HERE",
"fq": fq,
"fl": ["id", "title", "date_of_information", "link", "search_text"],
"rows": 100,
"wt": "json",
"indent": "true",
"_": int(time.time())
}

We just upgraded to 5.4.1, and now we can't seem to POST anything to
/select. I tried it out in the admin tool, and it only does GET operations
against /select (tried changing it to POST and moving query string to the
body with Firefox dev tools, but that failed).

Is there a way to keep doing something like what we were doing or do we
need to limit ourselves to GETs? I think our queries are all small enough
now for that, but it would helpful to know for planning.

Thanks,

Mike


Re: very slow frequent updates

2016-02-24 Thread Szűcs Roland
I have checked it already in the ref. guide. It is stated that you can not
search in external fields:
https://cwiki.apache.org/confluence/display/solr/Working+with+External+Files+and+Processes

Really I am very curios that my problem is not a usual one or the case is
that SOLR mainly focuses on search and not a kind of end-to-end support.
How this approach works with 1 million documents with frequently changing
prices?

Thanks your time,

Roland

2016-02-24 12:39 GMT+01:00 Stefan Matheis :

> Depending of what features you do actually need, might be worth a look
> on "External File Fields" Roland?
>
> -Stefan
>
> On Wed, Feb 24, 2016 at 12:24 PM, Szűcs Roland
>  wrote:
> > Thanks Jeff your help,
> >
> > Can it work in production environment? Imagine when my customer initiate
> a
> > query having 1 000 docs in the result set. I can not use the pagination
> of
> > SOLR as the field which is the basis of the sort is not included in the
> > schema for example the price. The customer wants the list in descending
> > order of the price.
> >
> > So I have to get all the 1000 docids from solr and find the metadata of
> > them in a sql database or in cache in best case. This is the way you
> > suggested? Is it not too slow?
> >
> > Regards,
> > Roland
> >
> > 2016-02-23 19:29 GMT+01:00 Jeff Wartes :
> >
> >>
> >> My suggestion would be to split your problem domain. Use Solr
> exclusively
> >> for search - index the id and only those fields you need to search on.
> Then
> >> use some other data store for retrieval. Get the id’s from the solr
> >> results, and look them up in the data store to get the rest of your
> fields.
> >> This allows you to keep your solr docs as small as possible, and you
> only
> >> need to update them when a *searchable* field changes.
> >>
> >> Every “update" in solr is a delete/insert. Even the "atomic update”
> >> feature is just a shortcut for that. It requires stored fields because
> the
> >> data from the stored fields gets copied into the new insert.
> >>
> >>
> >>
> >>
> >>
> >> On 2/22/16, 12:21 PM, "Roland Szűcs" 
> wrote:
> >>
> >> >Hi folks,
> >> >
> >> >We use SOLR 5.2.1. We have ebooks stored in SOLR. The majority of the
> >> >fields do not change at all like content, author, publisher Only
> the
> >> >price field changes frequently.
> >> >
> >> >We let the customers to make full text search so we indexed the content
> >> >filed. Due to the frequency of the price updates we use the atomic
> update
> >> >feature. As a requirement of the atomic updates we have to store all
> the
> >> >fields even the content field which is 1MB/document and we did not
> want to
> >> >store it just index it.
> >> >
> >> >As we wanted to update 100 documents with atomic update it took about 3
> >> >minutes. Taking into account that our metadata /document is 1 Kb and
> our
> >> >content field / document is 1MB we use 1000 more memory to accelerate
> the
> >> >update process.
> >> >
> >> >I am almost 100% sure that we make something wrong.
> >> >
> >> >What is the best practice of the frequent updates when 99% part of a
> given
> >> >document is constant forever?
> >> >
> >> >Thank in advance
> >> >
> >> >--
> >> > Roland
> >> Szűcs
> >> > Connect
> >> with
> >> >me on Linkedin <
> >> https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>
> >> >
> >> >CEO Phone: +36 1 210 81 13
> >> >Bookandwalk.hu 
> >>
> >
> >
> >
> > --
> >  Szűcs
> Roland
> > 
> Ismerkedjünk
> > meg a Linkedin <
> https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>
> > -en 
> > Ügyvezető Telefon: +36 1 210 81 13
> > Bookandwalk.hu 
>



-- 
 Szűcs Roland
 Ismerkedjünk
meg a Linkedin 
-en 
Ügyvezető Telefon: +36 1 210 81 13
Bookandwalk.hu 


WhitespaceTokenizerFactory and PathHierarchyTokenizerFactory

2016-02-24 Thread Anil
HI,

i need to use both WhitespaceTokenizerFactory and
PathHierarchyTokenizerFactory for use case.

Solr supports only one tokenizer. is there any way we can achieve
PathHierarchyTokenizerFactory  functionality with filters ?

Please advice.

Regards,
Anil


Re: very slow frequent updates

2016-02-24 Thread Stefan Matheis
Depending of what features you do actually need, might be worth a look
on "External File Fields" Roland?

-Stefan

On Wed, Feb 24, 2016 at 12:24 PM, Szűcs Roland
 wrote:
> Thanks Jeff your help,
>
> Can it work in production environment? Imagine when my customer initiate a
> query having 1 000 docs in the result set. I can not use the pagination of
> SOLR as the field which is the basis of the sort is not included in the
> schema for example the price. The customer wants the list in descending
> order of the price.
>
> So I have to get all the 1000 docids from solr and find the metadata of
> them in a sql database or in cache in best case. This is the way you
> suggested? Is it not too slow?
>
> Regards,
> Roland
>
> 2016-02-23 19:29 GMT+01:00 Jeff Wartes :
>
>>
>> My suggestion would be to split your problem domain. Use Solr exclusively
>> for search - index the id and only those fields you need to search on. Then
>> use some other data store for retrieval. Get the id’s from the solr
>> results, and look them up in the data store to get the rest of your fields.
>> This allows you to keep your solr docs as small as possible, and you only
>> need to update them when a *searchable* field changes.
>>
>> Every “update" in solr is a delete/insert. Even the "atomic update”
>> feature is just a shortcut for that. It requires stored fields because the
>> data from the stored fields gets copied into the new insert.
>>
>>
>>
>>
>>
>> On 2/22/16, 12:21 PM, "Roland Szűcs"  wrote:
>>
>> >Hi folks,
>> >
>> >We use SOLR 5.2.1. We have ebooks stored in SOLR. The majority of the
>> >fields do not change at all like content, author, publisher Only the
>> >price field changes frequently.
>> >
>> >We let the customers to make full text search so we indexed the content
>> >filed. Due to the frequency of the price updates we use the atomic update
>> >feature. As a requirement of the atomic updates we have to store all the
>> >fields even the content field which is 1MB/document and we did not want to
>> >store it just index it.
>> >
>> >As we wanted to update 100 documents with atomic update it took about 3
>> >minutes. Taking into account that our metadata /document is 1 Kb and our
>> >content field / document is 1MB we use 1000 more memory to accelerate the
>> >update process.
>> >
>> >I am almost 100% sure that we make something wrong.
>> >
>> >What is the best practice of the frequent updates when 99% part of a given
>> >document is constant forever?
>> >
>> >Thank in advance
>> >
>> >--
>> > Roland
>> Szűcs
>> > Connect
>> with
>> >me on Linkedin <
>> https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>
>> >
>> >CEO Phone: +36 1 210 81 13
>> >Bookandwalk.hu 
>>
>
>
>
> --
>  Szűcs Roland
>  Ismerkedjünk
> meg a Linkedin 
> -en 
> Ügyvezető Telefon: +36 1 210 81 13
> Bookandwalk.hu 


Re: very slow frequent updates

2016-02-24 Thread Szűcs Roland
Thanks Jeff your help,

Can it work in production environment? Imagine when my customer initiate a
query having 1 000 docs in the result set. I can not use the pagination of
SOLR as the field which is the basis of the sort is not included in the
schema for example the price. The customer wants the list in descending
order of the price.

So I have to get all the 1000 docids from solr and find the metadata of
them in a sql database or in cache in best case. This is the way you
suggested? Is it not too slow?

Regards,
Roland

2016-02-23 19:29 GMT+01:00 Jeff Wartes :

>
> My suggestion would be to split your problem domain. Use Solr exclusively
> for search - index the id and only those fields you need to search on. Then
> use some other data store for retrieval. Get the id’s from the solr
> results, and look them up in the data store to get the rest of your fields.
> This allows you to keep your solr docs as small as possible, and you only
> need to update them when a *searchable* field changes.
>
> Every “update" in solr is a delete/insert. Even the "atomic update”
> feature is just a shortcut for that. It requires stored fields because the
> data from the stored fields gets copied into the new insert.
>
>
>
>
>
> On 2/22/16, 12:21 PM, "Roland Szűcs"  wrote:
>
> >Hi folks,
> >
> >We use SOLR 5.2.1. We have ebooks stored in SOLR. The majority of the
> >fields do not change at all like content, author, publisher Only the
> >price field changes frequently.
> >
> >We let the customers to make full text search so we indexed the content
> >filed. Due to the frequency of the price updates we use the atomic update
> >feature. As a requirement of the atomic updates we have to store all the
> >fields even the content field which is 1MB/document and we did not want to
> >store it just index it.
> >
> >As we wanted to update 100 documents with atomic update it took about 3
> >minutes. Taking into account that our metadata /document is 1 Kb and our
> >content field / document is 1MB we use 1000 more memory to accelerate the
> >update process.
> >
> >I am almost 100% sure that we make something wrong.
> >
> >What is the best practice of the frequent updates when 99% part of a given
> >document is constant forever?
> >
> >Thank in advance
> >
> >--
> > Roland
> Szűcs
> > Connect
> with
> >me on Linkedin <
> https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>
> >
> >CEO Phone: +36 1 210 81 13
> >Bookandwalk.hu 
>



-- 
 Szűcs Roland
 Ismerkedjünk
meg a Linkedin 
-en 
Ügyvezető Telefon: +36 1 210 81 13
Bookandwalk.hu 


Read payload in Solr

2016-02-24 Thread Andrea Roggerone
Hi all,
I am indexing the payload in Solr as advised in
https://lucidworks.com/blog/2014/06/13/end-to-end-payload-example-in-solr/
and I am also able to search for it.
What I want to do now is getting the payload within my Solr custom function
to do some calculation however I can see just methods to get the
FieldValuesthat obviously I want to avoid since I have the payload and
reading from a postinglist is better performance wise.
Can you guys please point me to a resource that explains how to read the
payload or sharing with me a code snippet? Thanks!!!


Re: What search metrics are useful?

2016-02-24 Thread Emir Arnautovic

Hi Bill,
You can take a look at Sematext's search analytics 
(https://sematext.com/search-analytics). It provides some of metrics you 
mentioned, plus some additional (top queries, CTR, click stats, paging 
stats etc.). In combination with Sematext's performance metrics 
(https://sematext.com/spm) you can have full picture of your search 
infrastructure.


Regards,
Emir

--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/


On 24.02.2016 04:07, William Bell wrote:

How do others look at search metrics?

1. Search conversion? Do you look at searches and if the user does not
click on a result, and reruns the search that would be a failure?

2. How to measure auto complete success metrics?

3. Facets/filters could be considered negative, since we did not find the
results that the user wanted, and now they are filtering - who to measure?

4. One easy metric is searches with 0 results. We could auto expand the geo
distance or ask the user "did you mean" ?

5. Another easy one would be tech performance: "time it takes in seconds to
get a result".

6. How to measure fuzzy? How do you know you need more synonyms? How to
measure?

7. How many searches it takes before the user clicks on a result?

Other ideas? Is there a video or presentation on search metrics that would
be useful?





Re: Query time de-boost

2016-02-24 Thread Binoy Dalal
If you were to apply a boost of less than 1, so something like 0.1, that
would reduce the score of the docs you want to de-boost.

On Wed, 24 Feb 2016, 15:17 Emir Arnautovic 
wrote:

> Hi Shamik,
> Is boosting others acceptable option to you, e.g.
> ContentGroup:"NonDeveloper"^100.
> Which query parser do you use?
>
> Regards,
> Emir
>
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
>
>
>
> On 23.02.2016 23:42, Shamik Bandopadhyay wrote:
> > Hi,
> >
> >I'm looking into the possibility of de-boosting a set of documents
> during
> > query time. In my application, when I search for e.g. "preferences", I
> want
> > to de-boost content tagged with ContentGroup:"Developer" or in other
> words,
> > push those content back in the order. Here's the catch. I've the
> following
> > weights.
> >
> > text^1.5 title^4 IndexTerm^2
> >
> > As you can see, Title has a higher weight.
> >
> > Now, a bunch of content tagged with ContentGroup:"Developer" consists of
> a
> > title like "Preferences.material" or "Preferences Property" or
> > "Preferences.graphics". The boost on title pushes these documents at the
> > top.
> >
> > What I'm looking is to see if there's a way deboost all documents that
> are
> > tagged with ContentGroup:"Developer" irrespective of the term occurrence
> is
> > text or title.
> >
> > Any pointers will be appreciated.
> >
> > Thanks,
> > Shamik
> >
>
-- 
Regards,
Binoy Dalal


Re: Query time de-boost

2016-02-24 Thread Emir Arnautovic

Hi Shamik,
Is boosting others acceptable option to you, e.g. 
ContentGroup:"NonDeveloper"^100.

Which query parser do you use?

Regards,
Emir

--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



On 23.02.2016 23:42, Shamik Bandopadhyay wrote:

Hi,

   I'm looking into the possibility of de-boosting a set of documents during
query time. In my application, when I search for e.g. "preferences", I want
to de-boost content tagged with ContentGroup:"Developer" or in other words,
push those content back in the order. Here's the catch. I've the following
weights.

text^1.5 title^4 IndexTerm^2

As you can see, Title has a higher weight.

Now, a bunch of content tagged with ContentGroup:"Developer" consists of a
title like "Preferences.material" or "Preferences Property" or
"Preferences.graphics". The boost on title pushes these documents at the
top.

What I'm looking is to see if there's a way deboost all documents that are
tagged with ContentGroup:"Developer" irrespective of the term occurrence is
text or title.

Any pointers will be appreciated.

Thanks,
Shamik



Re: Reindexing in SOlr

2016-02-24 Thread Neeraj Bhatt
Hi kshitij

We are using following configuration and it is working fine



  http://11.11.11.11:8983/solr/classify"; query="*:*"
fl="id,title,content,segment," wt="javabin" /> 

 

Please give processor="SolrEntityProcessor" and also give fl
(fieldswhich you want to be saved in your new instance)

thanks

On Wed, Feb 24, 2016 at 2:31 PM, kshitij tyagi
 wrote:
> hi
>
> I am using following tag
>"*:*"/>  
>
> i am able to connect but indexing is not working. My solr have same versions
>
>
> On Wed, Feb 24, 2016 at 12:48 PM, Neeraj Bhatt 
> wrote:
>
>> Hi
>>
>> Can you give your data import tag details  tag in
>> db-data-config.xml
>> Also is your previuos and new solr have different versions ?
>>
>> Thanks
>>
>>
>>
>> On Wed, Feb 24, 2016 at 12:08 PM, kshitij tyagi
>>  wrote:
>> > Hi,
>> >
>> > I am following the following article
>> > https://wiki.apache.org/solr/HowToReindex
>> > to reindex the data using Solr itself as a datasource.
>> >
>> > Means one solr instance has all fields with stored true and
>> indexed=false.
>> > When I am using this instance as a datasource and indexing it on other
>> > instance data is not indexing.
>> >
>> > Giving error of version conflict. How can i resolve it.
>>


Re: Reindexing in SOlr

2016-02-24 Thread kshitij tyagi
hi

I am using following tag


i am able to connect but indexing is not working. My solr have same versions


On Wed, Feb 24, 2016 at 12:48 PM, Neeraj Bhatt 
wrote:

> Hi
>
> Can you give your data import tag details  tag in
> db-data-config.xml
> Also is your previuos and new solr have different versions ?
>
> Thanks
>
>
>
> On Wed, Feb 24, 2016 at 12:08 PM, kshitij tyagi
>  wrote:
> > Hi,
> >
> > I am following the following article
> > https://wiki.apache.org/solr/HowToReindex
> > to reindex the data using Solr itself as a datasource.
> >
> > Means one solr instance has all fields with stored true and
> indexed=false.
> > When I am using this instance as a datasource and indexing it on other
> > instance data is not indexing.
> >
> > Giving error of version conflict. How can i resolve it.
>


I have one small question that always intrigue me

2016-02-24 Thread Zara Parst
Hi everyone,

I am really need your help, please read below


If we have to run solr in cloud mode, we are going to use zookeeper,   now
any zookeeper client can connect to zookeeper server, Zookeeper has
facility to protect znode however any one can see znode acl however
password could be encrypted.  Decrypting password or guessing password is
not a big deal. As we know password is SHA encrypted also there is no
limitation of number of try to authorize with ACL. So my point is how to
safegard zookeeper.

I can guess few things

a. Don't reveal ip of your zookeeper ( security with obscurity )
b. ip table which is also not a very good idea
c. what else ??

My guess was if some how we can protect zookeeper server itself by asking
client to authorize them self before it can make connection to ensemble
even at root ( /) znode.

Please please at least comment on this , I really need your help.