date:20150520

On 5/20/2015 12:06 AM, Shalin Shekhar Mangar wrote:
 Sounds similar to https://issues.apache.org/jira/browse/SOLR-6165 which I
 fixed in 4.10. Can you try a newer release?

I can't upgrade yet.  I am using a plugin that hasn't been verified
against anything newer than 4.9.  When a new version becomes available,
I will begin testing 5.x.

The patch does look like it will fix the issue perfectly ... so I am
very likely to patch 4.9.1 and build a custom war.

Thanks,
Shawn

Re: Deduplication

2015-05-20 Thread Bram Van Dam

On 19/05/15 14:47, Alessandro Benedetti wrote:
 Hi Bram,
 what do you mean with :
   I
 would like it to provide the unique value myself, without having the
 deduplicator create a hash of field values  .
 
 This is not reduplication, but simple document filtering based on a
 constraint.
 In the case you want de-duplication ( which seemed from your very first
 part of the mail) here you can find a lot of info :

Not sure whether de-duplication is the right word for what I'm after, I
essentially want a unique constraint on an arbitrary field. Without
overwrite semantics, because I want Solr to tell me if a duplicate is
sent to Solr.

I was thinking that the de-duplication feature could accomplish this
somehow.


 - Bram

[solr 5.1] Looking for full text + collation search field

2015-05-20 Thread Björn Keil

Hello,

might anyone suggest a field type with which I may do both a full text
search (i.e. there is an analyzer including a tokenizer) and apply a
collation?

An example for what I want to do:
There is a field composer for which I passed the value Dvořák, Antonín.

I want the following queries to match:
composer:(antonín dvořák)
composer:dvorak
composer:dvorak, antonin

the latter case is possible using a solr.ICUCollationField, but that
type does not support an Analyzer and consequently no tokenizer, thus,
it is not helpful.

Unlike former versions of solr there do not seem to be
CollationKeyFilters which you may hang into the analyzer of a
solr.TextField... so I am a bit at a loss how I get *both* a tokenizer
and a collation at the same time.

Thanks for help,
Björn



signature.asc
Description: OpenPGP digital signature

Re: Deduplication

2015-05-20 Thread Alessandro Benedetti

What the Solr de-duplciation offers you is to calculate for each document
in input an Hash ( based on a set of fields).
You can then select two options :
 - Index everything, documents with same signature will be equals
- avoid the overwriting of duplicates.

How the similarity has is calculated is something you can play with and
customise if needed.

Clarified that, do you think can fit in some way, or definitely you are not
talking about deduce ?

2015-05-20 8:37 GMT+01:00 Bram Van Dam bram.van...@intix.eu:

 On 19/05/15 14:47, Alessandro Benedetti wrote:
  Hi Bram,
  what do you mean with :
I
  would like it to provide the unique value myself, without having the
  deduplicator create a hash of field values  .
 
  This is not reduplication, but simple document filtering based on a
  constraint.
  In the case you want de-duplication ( which seemed from your very first
  part of the mail) here you can find a lot of info :

 Not sure whether de-duplication is the right word for what I'm after, I
 essentially want a unique constraint on an arbitrary field. Without
 overwrite semantics, because I want Solr to tell me if a duplicate is
 sent to Solr.

 I was thinking that the de-duplication feature could accomplish this
 somehow.


  - Bram




-- 
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?

William Blake - Songs of Experience -1794 England

Re: Term Frequency Calculation - Clarification

2015-05-20 Thread ariya bala

Thanks Jack.
In my case there is only one document - Foo Foo is in bar
As per your comment, I should expect TF to be 2.
But I am getting one.
Is there any check where if one match is a subset of other, is calculated
once?
My class extends DefaultSimilarity.

Cheers
Ariya Bala S

On Wed, May 20, 2015 at 2:09 PM, Jack Krupansky jack.krupan...@gmail.com
wrote:

 Yes.

 tf is both 1 and 2 - tf is per document, which is 1 for the first document
 and 2 for the second document.

 See:

 http://lucene.apache.org/core/5_1_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html


 -- Jack Krupansky

 On Wed, May 20, 2015 at 6:13 AM, ariya bala ariya...@gmail.com wrote:

  Hi,
  I have made custom class for scoring the similarity
  (TermFrequencyBiasedSimilarity).
  The score was deduced by considering just the TF part (acheived  by
 setting
  IDF=1).
 
  Question is:
  -
  *Document content:* Foo Foo is in bar
  *Search query:* Foo bar
  *slop:* 3
 
  With Slop 3, There are two matches to the query
   Foo is in bar
   Foo Foo is in bar
 
  *Should the Term Frequency be 1 or 2? Also point to the explanation of
 the
  logic implemented in Lucene/Solr.*
 
  --
  Cheers
  *Ariya *
 




-- 
*Ariya *

Term Frequency Calculation - Clarification

2015-05-20 Thread ariya bala

Hi,
I have made custom class for scoring the similarity
(TermFrequencyBiasedSimilarity).
The score was deduced by considering just the TF part (acheived  by setting
IDF=1).

Question is:
-
*Document content:* Foo Foo is in bar
*Search query:* Foo bar
*slop:* 3

With Slop 3, There are two matches to the query
 Foo is in bar
 Foo Foo is in bar

*Should the Term Frequency be 1 or 2? Also point to the explanation of the
logic implemented in Lucene/Solr.*

--
Cheers
*Ariya *

Re: Term Frequency Calculation - Clarification

2015-05-20 Thread Jack Krupansky

Yes.

tf is both 1 and 2 - tf is per document, which is 1 for the first document
and 2 for the second document.

See:
http://lucene.apache.org/core/5_1_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html


-- Jack Krupansky

On Wed, May 20, 2015 at 6:13 AM, ariya bala ariya...@gmail.com wrote:

 Hi,
 I have made custom class for scoring the similarity
 (TermFrequencyBiasedSimilarity).
 The score was deduced by considering just the TF part (acheived  by setting
 IDF=1).

 Question is:
 -
 *Document content:* Foo Foo is in bar
 *Search query:* Foo bar
 *slop:* 3

 With Slop 3, There are two matches to the query
  Foo is in bar
  Foo Foo is in bar

 *Should the Term Frequency be 1 or 2? Also point to the explanation of the
 logic implemented in Lucene/Solr.*

 --
 Cheers
 *Ariya *

Problem using a function with a multivalued field

2015-05-20 Thread Fernando Agüero

Hi everyone,

I’ve been reading answers around this problem but I wanted to make sure that 
there is another way out of my problem. The thing is that the solution 
shouldn’t be on index-time, involve indexing a new field or changing this 
multi-valued field to a single-valued one.


Problem:
I need to run a custom function with some fields but I see that it’s not 
possible to get the value (first value in this case) of a multivalued field. 
“title” is a multi-valued field.


See:
if(exists(title),strdist(title, “string1),0).


This throws the “can’t use FieldCache on a multivalued field” error.


Solutions that doesn’t work for me:    
- Keep a copy of the value into a non-multi-valued field, using an update 
processor:  This involves indexing a new field.


- Change the field to multiValued=false: This involves using a single-valued 
field. I will be indexing new data in the future and I need some fields to be 
multi-valued but I also need to work with them.


Thanks in advance, I spent a lot of time with this without a solution. I’m 
using Solr 4.10.

When is too many fields in qf is too many?

Hi everyone,

My solution requires that users in group-A can only search against a set of
fields-A and users in group-B can only search against a set of fields-B,
etc.  There can be several groups, as many as 100 even more.  To meet this
need, I build my search by passing in the list of fields via qf.  What
goes into qf can be large: as many as 1500 fields and each field name
averages 15 characters long, in effect the data passed via qf will be
over 20K characters.

Given the above, beside the fact that a search for apple translating to a
20K characters passing over the network, what else within Solr and Lucene I
should be worried about if any?  Will I hit some kind of a limit?  Will
each search now require more CPU cycles?  Memory?  Etc.

If the network traffic becomes an issue, my alternative solution is to
create a /select handler for each group and in that handler list the fields
under qf.

I have considered creating pseudo-fields for each group and then use
copyField into that group.  During search, I than can qf against that one
field.  Unfortunately, this is not ideal for my solution because the fields
that go into each group dynamically change (at least once a month) and when
they do change, I have to re-index everything (this I have to avoid) to
sync that group-field.

I'm using qf with edismax and my Solr version is 5.1.

Thanks

Steve

Error on grouping result set

2015-05-20 Thread Abhijit Deka

Hi,
I am having some problem whille grouping the result set.I have a solr schema 
like this fields
   field name=id type=string indexed=false stored=true required=true 
/  
   field name=product type=string indexed=true stored=true 
required=true / 
   field name=vendor type=string indexed=true stored=true 
required=true / 
   field name=language type=string indexed=true stored=true 
required=true / 
   field name=TotalInvoices type=float indexed=true stored=true 
required=true/ 
/fieldsI am querying the schema and the result is like this 
product,Vendor,Invoice
abc,vendor1,49206.758
abc,vendor2,35654.981
abc,vendor2,94861.258
abc,vendor3,990.96012
abc,vendor3,990.96012
abc,vendor3,990.9601

I want to group the result by the vendor field so I post a query like this 
http://localhost:8983/solr/gettingstarted_shard2_replica2/select?q=abc 
fl=product%2Cvendor%2CTotalInvoices 
wt=json 
indent=true 
debugQuery=true 
group=true 
group.field=vendor

 I am getting an error for this in the debug field.
error:{ msg:org.apache.solr.client.solrj.SolrServerException: No live 
SolrServers available to handle this 
request:[http://10.192.17.110:7574/solr/gettingstarted_shard2_replica1, 
http://10.192.17.110:8983/solr/gettingstarted_shard1_replica2, 
http://10.192.17.110:7574/solr/gettingstarted_shard1_replica1, 
http://10.192.17.110:8983/solr/gettingstarted_shard2_replica2];, 
trace:org.apache.solr.common.SolrException: 
org.apache.solr.client.solrj.SolrServerException: No live SolrServers available 
to handle this 
request:[http://10.192.17.110:7574/solr/gettingstarted_shard2_replica1, 
http://10.192.17.110:8983/solr/gettingstarted_shard1_replica2, 
http://10.192.17.110:7574/solr/gettingstarted_shard1_replica1, 
http://10.192.17.110:8983/solr/gettingstarted_shard2_replica2]\n\tat 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:342)\n\tat
 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143)\n\tat
 org.apache.solr.core.SolrCore.execute(SolrCore.java:1984)\n\tat 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:829)\n\tat
 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:446)\n\tat
 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:220)\n\tat
 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)\n\tat
 
org.eclipse.jetty.servlets.CrossOriginFilter.handle(CrossOriginFilter.java:247)\n\tat
 
org.eclipse.jetty.servlets.CrossOriginFilter.doFilter(CrossOriginFilter.java:210)\n\tat
 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)\n\tat
 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)\n\tat
 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)\n\tat
 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)\n\tat
 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)\n\tat
 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)\n\tat
 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)\n\tat 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)\n\tat
 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)\n\tat
 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)\n\tat
 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)\n\tat
 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)\n\tat
 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)\n\tat
 org.eclipse.jetty.server.Server.handle(Server.java:368)\n\tat 
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)\n\tat
 
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)\n\tat
 
org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:942)\n\tat
 
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1004)\n\tat
 org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:640)\n\tat 
org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)\n\tat 
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)\n\tat
 
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)\n\tat
 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)\n\tat
 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)\n\tat
 java.lang.Thread.run(Thread.java:745)\nCaused by: 
org.apache.solr.client.solrj.SolrServerException: No live SolrServers available 
to handle this

Re: Term Frequency Calculation - Clarification

2015-05-20 Thread ariya bala

Please ignore.


On Wed, May 20, 2015 at 2:45 PM, ariya bala ariya...@gmail.com wrote:

 Thanks Jack.
 In my case there is only one document - Foo Foo is in bar
 As per your comment, I should expect TF to be 2.
 But I am getting one.
 Is there any check where if one match is a subset of other, is calculated
 once?
 My class extends DefaultSimilarity.

 Cheers
 Ariya Bala S

 On Wed, May 20, 2015 at 2:09 PM, Jack Krupansky jack.krupan...@gmail.com
 wrote:

 Yes.

 tf is both 1 and 2 - tf is per document, which is 1 for the first document
 and 2 for the second document.

 See:

 http://lucene.apache.org/core/5_1_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html


 -- Jack Krupansky

 On Wed, May 20, 2015 at 6:13 AM, ariya bala ariya...@gmail.com wrote:

  Hi,
  I have made custom class for scoring the similarity
  (TermFrequencyBiasedSimilarity).
  The score was deduced by considering just the TF part (acheived  by
 setting
  IDF=1).
 
  Question is:
  -
  *Document content:* Foo Foo is in bar
  *Search query:* Foo bar
  *slop:* 3
 
  With Slop 3, There are two matches to the query
   Foo is in bar
   Foo Foo is in bar
 
  *Should the Term Frequency be 1 or 2? Also point to the explanation of
 the
  logic implemented in Lucene/Solr.*
 
  --
  Cheers
  *Ariya *
 




 --
 *Ariya *




-- 
*Ariya *

Re: Solr query which return only those docs whose all tokens are from given list

Use update processor to add number of tags per doc. eg check
CountFieldValuesUpdateProcessorFactory

Doc1 - tags:T1 T2 ; tagNum: 2

Doc2 - tags:T1 T3 ; tagNum: 2

Doc3 - tags:T1 T4 ; tagNum: 2

Doc4 - tags:T1 T2 T3 ; tagNum: 3

than when you search for tags you need to get number of tags matched per
document, it can be done with recently implemented via ^=
eg

tags:(T1^=1 T2^=1 T3^=1)

then we need to subtract the expected number of tags per doc

q=sub(query($tagsAct)),tagNum)tagsAct=tags:(T1^=1 T2^=1 T3^=1)

and then cut off the not enough coverage

 q={frange l=0}sub(query($tagsAct)),tagNum)tagsAct=tags:(T1^=1 T2^=1 T3^=1)


On Wed, May 20, 2015 at 10:10 AM, Naresh Yadav nyadav@gmail.com wrote:

 Requesting Solr experts again to suggest some solutions to my above problem
 as i am not able to solve this.

 On Tue, May 12, 2015 at 11:04 AM, Naresh Yadav nyadav@gmail.com
 wrote:

  Thanks Andrew, You got my problem precisely But solutions you suggested
  may not work for me.
 
  In my API i get only list of tags authorized i.e [T1, T2, T3] and based
 on
  that only i need to construct my Solr query.
  So first solution with NOT (T4 OR T5) will not work.
 
  In real case tag ids T1, T2 are UUID's, so range query also will not work
  as i have no control on order of these ids.
 
  Looking for more suggestions ??
 
  Thanks
  Naresh
 
  On Mon, May 11, 2015 at 10:05 PM, Andrew Chillrud 
 achill...@opentext.com
  wrote:
 
  Based on his example, it sounds like Naresh not only wants the tags
 field
  to contain at least one of the values [T1, T2, T3] but also wants to
  exclude documents that contain a tag other than T1, T2, or T3 (Doc3
 should
  not be retrieved).
 
  If the set of possible values in the tags field is limited and known,
 you
  could use a NOT (or '-') clause to accomplish this. If there were 5
  possible tag values:
 
  tags:(( T1 OR T2 OR T3) NOT (T4 OR T5))
 
  However this doesn't seem practical if the number of possible values is
  large or unlimited. Perhaps something could be done with range queries:
 
  tags:(( T1 OR T2 OR T3) NOT ([* TO T1} OR {T1 TO T2} OR {T3 to * ]))
 
  however this would require whatever is constructing the query to be
 aware
  of the lexical ordering of the terms in the index. Maybe there are more
  elegant solutions, but I am not aware of them.
 
  - Andy -
 
  -Original Message-
  From: sujitatgt...@gmail.com [mailto:sujitatgt...@gmail.com] On Behalf
  Of Sujit Pal
  Sent: Monday, May 11, 2015 10:40 AM
  To: solr-user@lucene.apache.org
  Subject: Re: Solr query which return only those docs whose all tokens
 are
  from given list
 
  Hi Naresh,
 
  Couldn't you could just model this as an OR query since your requirement
  is at least one (but can be more than one), ie:
 
  tags:T1 tags:T2 tags:T3
 
  -sujit
 
 
  On Mon, May 11, 2015 at 4:14 AM, Naresh Yadav nyadav@gmail.com
  wrote:
 
   Hi all,
  
   Also asked this here : http://stackoverflow.com/questions/30166116
  
   For example i have SOLR docs in which tags field is indexed :
  
   Doc1 - tags:T1 T2
  
   Doc2 - tags:T1 T3
  
   Doc3 - tags:T1 T4
  
   Doc4 - tags:T1 T2 T3
  
   Query1 : get all docs with tags:T1 AND tags:T3 then it works and
   will give Doc2 and Doc4
  
   Query2 : get all docs whose tags must be one of these [T1, T2, T3]
   Expected is : Doc1, Doc2, Doc4
  
   How to model Query2 in Solr ?? Please help me on this ?
  
 
 
 
 
 
 




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
mkhlud...@griddynamics.com

Re: Deduplication

2015-05-20 Thread Shalin Shekhar Mangar

On Wed, May 20, 2015 at 12:59 PM, Bram Van Dam bram.van...@intix.eu wrote:

  Write a custom update processor and include it in your update chain.
  You will then have the ability to do anything you want with the entire
  input document before it hits the code to actually do the indexing.

 This sounded like the perfect option ... until I read Jack's comment:

 
  My understanding was that the distributed update processor is near the
 end
  of the chain, so that running of user update processors occurs before the
  distribution step, but is that distribution to the leader, or
 distribution
  from leader to replicas for a shard?

 That would pose some potential problems.

 Would a custom update processor make the solution cloud-safe?


Starting with Solr 5.1, you have the ability to specify an update processor
on the fly to requests and you can even control whether it is to be
executed before any distribution happens or before it is actually indexed
on the replica.

e.g. you can specify processor=xyz,MyCustomUpdateProc in the request to
have processor xyz run first and then MyCustomUpdateProc and then the
default update processor chain (which will also distribute the doc to the
leader or from the leader to a replica). This also means that such
processors will not be executed on the replicas at all.

You can also specify post-processor=xyz,MyCustomUpdateProc to have xyz and
MyCustomUpdateProc to run on each replica (including the leader) right
before the doc is indexed (i.e. just before RunUpdateProcessor)

Unfortunately, due to an oversight, this feature hasn't been documented
well which is something I'll fix. See
https://issues.apache.org/jira/browse/SOLR-6892 for more details.



 Thx,

  - Bram




-- 
Regards,
Shalin Shekhar Mangar.

Re: Block Join Query update documents, how to do it correctly?

On Thu, May 14, 2015 at 12:01 AM, Tom Devel deve...@gmail.com wrote:

 I tried to repost the whole modified document (the parent and ALL of its
 children as one file), and it seems to work on a small toy example, but of
 course I cannot be sure for a larger instance with thousands of documents,
 and I would like to know if this is the correct way to go or not.


Absolutely. Here is the only way to go so far.

-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
mkhlud...@griddynamics.com

Re: When is too many fields in qf is too many?

 Also, is this 1500 fields that are always populated, or are there really a
 larger number of different record types, each with a relatively small
 number of fields populated in a particular document?

Answer: This is a large number of different record types, each with a
relatively small number of fields in a particular document.  Some documents
will have 5 fields, others may have 50 (that's the average)

 Could you try to point to a real-world example of where your use case
might
 apply, so we can relate to it?

I'm indexing data off a DB, all the fields of each record is indexed.  The
application is complex such that it has views and users belong to 1 or
more views.  Users can move between views and views can change over time.
A user in view-A can see certain fields, while a user in view-B can see
some other fields.  So, when a user issues a search, I have to limit into
which fields that search is executed against.  And like I said, because
users can move between views, and views can change over time, the list of
fields isn't static.  This is why I have to pass the list of fields for
each search based on user's current view.

I hope this gives context to my problem I'm trying to solve and describes
why I'm using fq and why the list of fields maybe long because there is a
case in which a user may belong to N - 1 views.

Steve


On Wed, May 20, 2015 at 11:14 AM, Jack Krupansky jack.krupan...@gmail.com
wrote:

 The uf parameter is used to specify which fields a user may query against
 - the qf parameter specifies the set of fields that an unfielded query
 term must be queried against. The user is free to specify fielded query
 terms, like field1:term1 OR field2:term2. So, which use case are you
 really talking about.

 Could you try to point to a real-world example of where your use case might
 apply, so we can relate to it?

 Generally, I would say that a Solr document/collection should have no more
 than low hundreds of fields. It's not that you absolutely can't have more
 or absolutely can't have 5,000 or more, but simply that you will be asking
 for trouble, for example, with the cost of comprehending and maintaining
 and communicating your solution with others, including this mailing list
 for support.

 What specifically pushed you to have documents with 1500 field?

 Also, is this 1500 fields that are always populated, or are there really a
 larger number of different record types, each with a relatively small
 number of fields populated in a particular document?


 -- Jack Krupansky

 On Wed, May 20, 2015 at 8:27 AM, Steven White swhite4...@gmail.com
 wrote:

  Hi everyone,
 
  My solution requires that users in group-A can only search against a set
 of
  fields-A and users in group-B can only search against a set of fields-B,
  etc.  There can be several groups, as many as 100 even more.  To meet
 this
  need, I build my search by passing in the list of fields via qf.  What
  goes into qf can be large: as many as 1500 fields and each field name
  averages 15 characters long, in effect the data passed via qf will be
  over 20K characters.
 
  Given the above, beside the fact that a search for apple translating
 to a
  20K characters passing over the network, what else within Solr and
 Lucene I
  should be worried about if any?  Will I hit some kind of a limit?  Will
  each search now require more CPU cycles?  Memory?  Etc.
 
  If the network traffic becomes an issue, my alternative solution is to
  create a /select handler for each group and in that handler list the
 fields
  under qf.
 
  I have considered creating pseudo-fields for each group and then use
  copyField into that group.  During search, I than can qf against that
 one
  field.  Unfortunately, this is not ideal for my solution because the
 fields
  that go into each group dynamically change (at least once a month) and
 when
  they do change, I have to re-index everything (this I have to avoid) to
  sync that group-field.
 
  I'm using qf with edismax and my Solr version is 5.1.
 
  Thanks
 
  Steve

Re: Looking up arrays in a sub-entity

2015-05-20 Thread rumford

I was able to get what I wanted by processing the column in question as
massaged text, so that it was a comma-delimited series of IDs, and then
passing that to a subentity query that went something like: SELECT value
FROM othertable WHERE id IN (${master.ids}).

It's slow but I think it's getting the job done.

For better performance I would probably script something to feed Solr
instead of using the DIH.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Looking-up-arrays-in-a-sub-entity-tp4206380p4206592.html
Sent from the Solr - User mailing list archive at Nabble.com.

[ANN] Relevant Search -- The Book on Search Relevance

2015-05-20 Thread Doug Turnbull

Hello fellow Solr users,

We're writing a book on applied Lucene search relevance -- Relevant
Search (http://manning.com/turnbull). We want to teach you to improve the
quality of your Solr search results! We're trying to bridge the academic
side of Information Retrieval from books like Intro. to IR (
http://www-nlp.stanford.edu/IR-book/) and Lucene-based search engines like
Solr and Elasticsearch.

Manning is offering discount code *39turnbull* to the Solr mailing list
readers to get 39% off all formats  (http://manning.com/turnbull).

You can preview parts/ideas of our book here:
http://java.dzone.com/articles/solr-and-elasticsearch
http://opensourceconnections.com/blog/2015/05/15/relevance-data-modeling/
https://medium.com/@softwaredoug/search-is-eating-the-world-1c3dbdfe9b83

Our chapters seem to be taking the form of 1/3 broad relevance tuning
philosophy, 2/3 useful examples. While we build a lot of our examples with
Elasticsearch, we're also working to try bridge them to Solr as well in the
final book so the book can apply to both audiences. After all, almost every
idea is translatable between both search engines that share the same Lucene
core. If you get into the book, we'd be open to your ideas (or even
help:-p) on how to best do this from the community.

Happy Searching!
-Doug Turnbull  John Berryman

-- 
*Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections,
LLC | 240.476.9983 | http://www.opensourceconnections.com
Author: Relevant Search http://manning.com/turnbull from Manning
Publications
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.

Re: When is too many fields in qf is too many?

Thanks Shawn.

I have already switched to using POST because I need to send a long list of
data in qf.  My question isn't about POST / GET, it's about Solr and
Lucene having to deal with such long list of fields.  Here is the text of
my question reposted:

 Given the above, beside the fact that a search for apple translating to
 a 20K characters passing over the network, what else within Solr and
Lucene
 I should be worried about if any?  Will I hit some kind of a limit?  Will
 each search now require more CPU cycles?  Memory?  Etc.

Steve



On Wed, May 20, 2015 at 10:52 AM, Shawn Heisey apa...@elyograg.org wrote:

 On 5/20/2015 6:27 AM, Steven White wrote:
  My solution requires that users in group-A can only search against a set
 of
  fields-A and users in group-B can only search against a set of fields-B,
  etc.  There can be several groups, as many as 100 even more.  To meet
 this
  need, I build my search by passing in the list of fields via qf.  What
  goes into qf can be large: as many as 1500 fields and each field name
  averages 15 characters long, in effect the data passed via qf will be
  over 20K characters.
 
  Given the above, beside the fact that a search for apple translating
 to a
  20K characters passing over the network, what else within Solr and
 Lucene I
  should be worried about if any?  Will I hit some kind of a limit?  Will
  each search now require more CPU cycles?  Memory?  Etc.

 You have two choices when queries become that large.

 One is to increase the max HTTP header size in the servlet container.
 In most containers, webservers, and proxy servers, this defaults to 8192
 bytes.  This is an approach that works very well, but will not scale to
 extremely large sizes.  I have done this on my indexes, because I
 regularly have queries in the 20K range, but I do not expect them to get
 very much larger than this.

 The other option is to switch to sending a POST instead of a GET.  The
 default max POST size that Solr sets is 2MB, which is plenty for just
 about any query, and can be increased easily to much larger sizes.  If
 you are using SolrJ, switching to POST is very easy ... you'd need to
 research to figure out how if you're using another framework.

 Thanks,
 Shawn

Re: Suggestion on field type

2015-05-20 Thread Vishal Swaroop

Thank you all... You all are experts...

I will go with double as this seems to be more feasible.

Regards

On Tue, May 19, 2015 at 7:26 PM, Walter Underwood wun...@wunderwood.org
wrote:

A field type based on BigDecimal could be useful, but that would be a fair
amount more work.

Double is usually sufficient for big data analysis, especially if you are
doing simple aggregates (which is most of what Solr can do).

If you want to do something fancier, you’ll need a database, not a search
engine. As I usually do, I’ll recommend MarkLogic, which is pretty awesome
stuff. Solr would not be in my top handful of solutions for big data
analysis.

Personally, I’d stuff it all in JSON in Amazon S3 and run map-reduce
against it. If you need to do something like that, you could store a JSON
blob in Solr with the exact values, and use approximate fields to narrow
things down. Of course, MarkLogic has a graceful interface to Hadoop.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)

On May 19, 2015, at 4:09 PM, Erick Erickson erickerick...@gmail.com
wrote:

Well, double is all you've got, so that's what you have to work with.
_Every_ float is an approximation when you get out to some number of
decimal places, so you don't really have any choice. Of course it'll
affect the result. The question is whether it affects the result
enough to matter which is application-specific.

Best,
Erick

On Tue, May 19, 2015 at 12:05 PM, Vishal Swaroop vishal@gmail.com
wrote:
Also 10481.5711458735456*79* indexes to 10481.571145873546 using double
fieldType name=double class=solr.TrieDoubleField precisionStep=0
positionIncrementGap=0 omitNorms=false/

On Tue, May 19, 2015 at 2:57 PM, Vishal Swaroop vishal@gmail.com
wrote:

Thanks Erick... I can ignore the trailing zeros

I am indexing data from Vertica database... Though *double *is very
close
but it SOLR indexes 14 digits after decimal
e.g. actual db value is 15 digits after decimal i.e.
249.81735425382405*2*

SOLR indexes 14 digits after decimal i.e. 249.81735425382405

As these values will be used for big data analysis, so I am wondering
if
it might impact the result.
fieldType name=double class=solr.TrieDoubleField precisionStep=0
positionIncrementGap=0 omitNorms=false/

Any suggestions ?

Regards

On Tue, May 19, 2015 at 1:41 PM, Erick Erickson
erickerick...@gmail.com
wrote:

Why do you want to keep trailing zeros? The original input is
preserved in the stored portion and will be returned if you specify
the field in your fl list. I'm assuming here that you're looking at
the actual indexed terms, and don't really understand why the trailing
zeros are important

Do not use strings.

Best
Erick

On Tue, May 19, 2015 at 10:22 AM, Vishal Swaroop
vishal@gmail.com
wrote:
Thank you John and Jack...

Looks like double is much closer... it removes trailing zeros...
a) Is there a way to keep trailing zeros
double : 194.846189733028000 indexes to 194.846189733028
fieldType name=double class=solr.TrieDoubleField
precisionStep=0
positionIncrementGap=0 omitNorms=false/

b) If I use String then will there be issue doing range query

float
fieldType name=float class=solr.TrieFloatField precisionStep=0
positionIncrementGap=0 omitNorms=false/
277.677836785372000 indexes to 277.67783

On Tue, May 19, 2015 at 11:56 AM, Jack Krupansky
jack.krupan...@gmail.com
wrote:

double (solr.TrieDoubleField) gives more precision

See:

https://lucene.apache.org/solr/5_1_0/solr-core/org/apache/solr/schema/TrieDoubleField.html

-- Jack Krupansky

On Tue, May 19, 2015 at 11:27 AM, Vishal Swaroop
vishal@gmail.com

wrote:

Please suggest which numeric field type to use so that I can get
complete
value.

e.g value in database is : 194.846189733028000

If I index it as float SOLR indexes it as 194.84619 where as I need
complete value i.e 194.846189733028000
I will also be doing range query on this field.

fieldType name=float class=solr.TrieFloatField
precisionStep=0
positionIncrementGap=0/

field name=value type=float indexed=true stored=true
multiValued=false /

Regards

Re: scoreMode ToParentBlockJoinQuery

Hello,

Here is the patch
https://issues.apache.org/jira/browse/SOLR-5882


On Tue, May 12, 2015 at 1:11 PM, StrW_dev r.j.bamb...@structweb.nl wrote:

 Hi

 Is it possible to configure the scoreMode of the Parent block join query
 parser (ToParentBlockJoinQuery)?
 It seems it's set to none, while i would require max in this case.

 What I want is to filter on child documents, but still use the
 relevance/boost of these child documents in the final score.

 Gr.



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/scoreMode-ToParentBlockJoinQuery-tp4205020.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
mkhlud...@griddynamics.com

Re: When is too many fields in qf is too many?

On 5/20/2015 9:24 AM, Steven White wrote:
 I have already switched to using POST because I need to send a long list of
 data in qf.  My question isn't about POST / GET, it's about Solr and
 Lucene having to deal with such long list of fields.  Here is the text of
 my question reposted:
 
 Given the above, beside the fact that a search for apple translating to
 a 20K characters passing over the network, what else within Solr and
 Lucene
 I should be worried about if any?  Will I hit some kind of a limit?  Will
 each search now require more CPU cycles?  Memory?  Etc.

You may need to increase maxBooleanClauses beyond the default of 1024.
There will be a message in the log if that is required.  Note that such
an increase must happen on EVERY config you have, or one of them may set
it back to the 1024 default -- it's a global JVM-wide config.

Large complex queries are usually slow, requiring more memory and CPU
than simple queries, but if you have the resources, Solr will handle it
just fine.

Thanks,
Shawn

Re: solr 5.x on glassfish/tomcat instead of jetty

2015-05-20 Thread Ravi Solr

Shawn I agree with you, but, some of the decisions in the corporate world
are handed down through higher powers/pay grade, who do not always like to
hear counter arguments. For example, this is the same reason why
govt/federal restrict tech folks only use certified DBs/App Servers like
Oracle,WSAD etc (Not to say that govt teams are not using SOLR, I know
library of congress etc use it.). Some times the decision is above my pay
grade more so when the firm is not a core Technology firm. I would rather
find a way than be labeled an anarchist, after all anything is possible
with software right !!?? ;-)

Hope you have already viewed The Expert video on YouTube :-)

Thanks

Ravi Kiran Bhaskar

On Wed, May 20, 2015 at 11:21 AM, Shawn Heisey apa...@elyograg.org wrote:

 On 5/20/2015 9:07 AM, Ravi Solr wrote:
  I have read that solr 5.x has moved away from deployable WAR architecture
  to a runnable Java Application architecture. Our infrastructure/standards
  folks are adamant about not running SOLR on Jetty (as we are about to
  upgrade from 4.7.2 to 5.1), any ideas on how I can make it run on
 Glassfish
  or at least on Tomcat ?? And do I have to watch for any gotchas regarding
  the different containers or the upgrade itself ? Would love to hear from
  people who have already treaded down that path.

 I really need to finish the wiki page on this topic.

 As of right now, there is still a .war file.  Look in the server/webapps
 directory for the .war, server/lib/ext for logging jars, and
 server/resources for the logging configuration.  Consult your
 container's documentation to learn where to place these things.

 At some point in the future, such deployments will no longer be
 possible, which is why the docs say you can't do it, even though you
 can.  The project is preparing users for the eventual reality with a
 documentation change.

 I'm wondering ... if Jetty is good enough for the Google App Engine, why
 isn't it good enough for your infrastructure standards?  It is the only
 container that gets testing ... I assure you that there are no tests in
 the Solr source code that make sure Glassfish works.

 Thanks,
 Shawn

Re: Edismax

2015-05-20 Thread Walter Underwood

I highly recommend using boost= in edismax rather than bq=. The multiplicative 
boost is stable with a wide range of scores. bq is additive and has problems 
with high or low scores.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

On May 20, 2015, at 1:04 PM, John Blythe j...@curvolabs.com wrote:

 Hi all,
 
 I've been fine tuning our current Solr implementation the last week or two
 to get more precise results. We are trying to get our implementation
 accurate enough to serve as a lightweight machine learning (obviously a
 misnomer) implementation of sorts. Actual user generated searching is far
 secondary for our purposes.
 
 I've gotten our results to go from confidence scores of ~40-60 for good
 results to the 700s. So far so good. Edismax seems like it has some
 promising features, but I'm wondering if it'll be very helpful for our
 purposes. The only thing that jumps out immediately to me is the bq ability
 in which one of our non-primary fields is used as a means of boosting. In
 other words, when using our three fields—manufacturer, part number, and
 description—to find a part, we could bq the category or size field to help
 eliminate false positives from appearing.
 
 Is there anything else that you think I should look into regarding edismax
 that could be helpful to our end game?
 
 Thanks for any ideas!

Re: Edismax

could i do that the same way as my mention of using bq? the docs aren't
very rich in their example or explanation of boost= here:
https://cwiki.apache.org/confluence/display/solr/The+Extended+DisMax+Query+Parser

thanks!

-- 
*John Blythe*
Product Manager  Lead Developer

251.605.3071 | j...@curvolabs.com
www.curvolabs.com

58 Adams Ave
Evansville, IN 47713

On Wed, May 20, 2015 at 3:13 PM, Walter Underwood wun...@wunderwood.org
wrote:

 I highly recommend using boost= in edismax rather than bq=. The
 multiplicative boost is stable with a wide range of scores. bq is additive
 and has problems with high or low scores.

 wunder
 Walter Underwood
 wun...@wunderwood.org
 http://observer.wunderwood.org/  (my blog)

 On May 20, 2015, at 1:04 PM, John Blythe j...@curvolabs.com wrote:

  Hi all,
 
  I've been fine tuning our current Solr implementation the last week or
 two
  to get more precise results. We are trying to get our implementation
  accurate enough to serve as a lightweight machine learning (obviously a
  misnomer) implementation of sorts. Actual user generated searching is far
  secondary for our purposes.
 
  I've gotten our results to go from confidence scores of ~40-60 for good
  results to the 700s. So far so good. Edismax seems like it has some
  promising features, but I'm wondering if it'll be very helpful for our
  purposes. The only thing that jumps out immediately to me is the bq
 ability
  in which one of our non-primary fields is used as a means of boosting. In
  other words, when using our three fields—manufacturer, part number, and
  description—to find a part, we could bq the category or size field to
 help
  eliminate false positives from appearing.
 
  Is there anything else that you think I should look into regarding
 edismax
  that could be helpful to our end game?
 
  Thanks for any ideas!

Upgrading question

2015-05-20 Thread Craig Longman


We've been using Solr a bit now for a year or so, 4.6 is the oldest version of 
Solr we've deployed.  We're currently working through the process we'll use to 
upgrade to 5.1, an upgrade we need for the new facet.stats capabilities.

Reading the Major Changes document, it indicates that there is no longer 
support for Lucene/Solr 3.x and earlier indexes.  It also indicates that you 
should use the IndexUpgrader included with Solr 4.10 if you're unsure.

We've only ever deployed 4.6, and 4.9 Solr installations.  Am I safe to assume 
that we can skip the optimize step and just upgrade to Solr 5.1, perhaps 
optimizing after we've done that?

Thanks,


Craig Longman
C++ Developer

This message and any attachments are intended only for the use of the addressee 
and may contain information that is privileged and confidential. If the reader 
of the message is not the intended recipient or an authorized representative of 
the intended recipient, you are hereby notified that any dissemination of this 
communication is strictly prohibited. If you have received this communication 
in error, notify the sender immediately by return email and delete the message 
and any attachments from your system.

Re: Error on grouping result set

Possibly you changed the field type sometime without completely
blowing away your index and re-indexing from scratch? Based on:

unexpected docvalues type SORTED_SET for field 'vendor' (expected=SORTED)

Because you can't group on multi-valued fields, which is I think
what's going on here.

Either that or you have some replicas that aren't coming up, based on
No live SolrServers available to handle this request

Best,
Erick

On Wed, May 20, 2015 at 5:55 AM, Abhijit Deka
abhijit.d...@rocketmail.com wrote:
 Hi,
 I am having some problem whille grouping the result set.I have a solr schema 
 like this fields
field name=id type=string indexed=false stored=true 
 required=true /
field name=product type=string indexed=true stored=true 
 required=true /
field name=vendor type=string indexed=true stored=true 
 required=true /
field name=language type=string indexed=true stored=true 
 required=true /
field name=TotalInvoices type=float indexed=true stored=true 
 required=true/
 /fieldsI am querying the schema and the result is like this 
 product,Vendor,Invoice
 abc,vendor1,49206.758
 abc,vendor2,35654.981
 abc,vendor2,94861.258
 abc,vendor3,990.96012
 abc,vendor3,990.96012
 abc,vendor3,990.9601

 I want to group the result by the vendor field so I post a query like this 
 http://localhost:8983/solr/gettingstarted_shard2_replica2/select?q=abc
 fl=product%2Cvendor%2CTotalInvoices
 wt=json
 indent=true
 debugQuery=true
 group=true
 group.field=vendor

  I am getting an error for this in the debug field.
 error:{ msg:org.apache.solr.client.solrj.SolrServerException: No live 
 SolrServers available to handle this 
 request:[http://10.192.17.110:7574/solr/gettingstarted_shard2_replica1, 
 http://10.192.17.110:8983/solr/gettingstarted_shard1_replica2, 
 http://10.192.17.110:7574/solr/gettingstarted_shard1_replica1, 
 http://10.192.17.110:8983/solr/gettingstarted_shard2_replica2];, 
 trace:org.apache.solr.common.SolrException: 
 org.apache.solr.client.solrj.SolrServerException: No live SolrServers 
 available to handle this 
 request:[http://10.192.17.110:7574/solr/gettingstarted_shard2_replica1, 
 http://10.192.17.110:8983/solr/gettingstarted_shard1_replica2, 
 http://10.192.17.110:7574/solr/gettingstarted_shard1_replica1, 
 http://10.192.17.110:8983/solr/gettingstarted_shard2_replica2]\n\tat 
 org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:342)\n\tat
  
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143)\n\tat
  org.apache.solr.core.SolrCore.execute(SolrCore.java:1984)\n\tat 
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:829)\n\tat
  
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:446)\n\tat
  
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:220)\n\tat
  
 org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)\n\tat
  
 org.eclipse.jetty.servlets.CrossOriginFilter.handle(CrossOriginFilter.java:247)\n\tat
  
 org.eclipse.jetty.servlets.CrossOriginFilter.doFilter(CrossOriginFilter.java:210)\n\tat
  
 org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)\n\tat
  
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)\n\tat
  
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)\n\tat
  
 org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)\n\tat
  
 org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)\n\tat
  
 org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)\n\tat
  
 org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)\n\tat
  
 org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)\n\tat
  
 org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)\n\tat
  
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)\n\tat
  
 org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)\n\tat
  
 org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)\n\tat
  
 org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)\n\tat
  org.eclipse.jetty.server.Server.handle(Server.java:368)\n\tat 
 org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)\n\tat
  
 org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)\n\tat
  
 org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:942)\n\tat
  
 org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1004)\n\tat
  org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:640)\n\tat

Re: Edismax

2015-05-20 Thread Walter Underwood

I believe that boost is a superset of the bq functionality.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

On May 20, 2015, at 1:16 PM, John Blythe j...@curvolabs.com wrote:

 could i do that the same way as my mention of using bq? the docs aren't
 very rich in their example or explanation of boost= here:
 https://cwiki.apache.org/confluence/display/solr/The+Extended+DisMax+Query+Parser
 
 thanks!
 
 -- 
 *John Blythe*
 Product Manager  Lead Developer
 
 251.605.3071 | j...@curvolabs.com
 www.curvolabs.com
 
 58 Adams Ave
 Evansville, IN 47713
 
 On Wed, May 20, 2015 at 3:13 PM, Walter Underwood wun...@wunderwood.org
 wrote:
 
 I highly recommend using boost= in edismax rather than bq=. The
 multiplicative boost is stable with a wide range of scores. bq is additive
 and has problems with high or low scores.
 
 wunder
 Walter Underwood
 wun...@wunderwood.org
 http://observer.wunderwood.org/  (my blog)
 
 On May 20, 2015, at 1:04 PM, John Blythe j...@curvolabs.com wrote:
 
 Hi all,
 
 I've been fine tuning our current Solr implementation the last week or
 two
 to get more precise results. We are trying to get our implementation
 accurate enough to serve as a lightweight machine learning (obviously a
 misnomer) implementation of sorts. Actual user generated searching is far
 secondary for our purposes.
 
 I've gotten our results to go from confidence scores of ~40-60 for good
 results to the 700s. So far so good. Edismax seems like it has some
 promising features, but I'm wondering if it'll be very helpful for our
 purposes. The only thing that jumps out immediately to me is the bq
 ability
 in which one of our non-primary fields is used as a means of boosting. In
 other words, when using our three fields—manufacturer, part number, and
 description—to find a part, we could bq the category or size field to
 help
 eliminate false positives from appearing.
 
 Is there anything else that you think I should look into regarding
 edismax
 that could be helpful to our end game?
 
 Thanks for any ideas!

Re: ConfigSets and SolrCloud

What is it? There isn't one except zkcli and variants ;).

Things are all automatic once you get things _to_ Zookeeper, but
pushing the config sets up is a manual process. The usual process is
to have the configs in some VCS somewhere so they're safe, and do the
usual checkout/edit/checkin and at some point push them to ZK. Then
they will be automatically distributed to all the relevant Solr nodes
whenever the cores get reloaded, often done with the collections API
RELOAD command.

Of course you can cheat in dev environments, at least in IntelliJ, by
downloading the Zookeeper plugin that allows you to edit the files
directly on Zookeeper, but that's certainly NOT recommended for
production of course.

Best,
Erick


On Wed, May 20, 2015 at 10:57 AM, Jim.Musil jim.mu...@target.com wrote:
 Hi,

 I need a little clarification on configSets in solr 5.x.

 According to this page:

 https://cwiki.apache.org/confluence/display/solr/Config+Sets

 I can create named configSets to be shared by other cores. If I create them 
 using this method AND am operating in SolrCloud mode, will it automatically 
 upload these named config sets to zookeeper?

 Thanks!
 Jim Musil

Re: [solr 5.1] Looking for full text + collation search field

2015-05-20 Thread Ahmet Arslan

Hi Bjorn,

solr.ICUCollationField is useful for *sorting*, and you cannot sort on 
tokenized fields.

Your example looks like diacritics insensitive search. 
Please see : ASCIIFoldingFilterFactory
 
Ahmet


On Wednesday, May 20, 2015 2:53 PM, Björn Keil deeph...@web.de wrote:
Hello,

might anyone suggest a field type with which I may do both a full text
search (i.e. there is an analyzer including a tokenizer) and apply a
collation?

An example for what I want to do:
There is a field composer for which I passed the value Dvořák, Antonín.

I want the following queries to match:
composer:(antonín dvořák)
composer:dvorak
composer:dvorak, antonin

the latter case is possible using a solr.ICUCollationField, but that
type does not support an Analyzer and consequently no tokenizer, thus,
it is not helpful.

Unlike former versions of solr there do not seem to be
CollationKeyFilters which you may hang into the analyzer of a
solr.TextField... so I am a bit at a loss how I get *both* a tokenizer
and a collation at the same time.

Thanks for help,
Björn

Re: Edismax

On 5/20/2015 2:54 PM, John Blythe wrote:
 new question re edismax: when i turn it on (in solr admin) my score goes
 wayy down. from 772 to 4.9.
 
 what in the edismax query parser would account for that huge nosedive?

Scores are 100% relative, and the number only has meaning in the context
of that specific query.  You cannot compare scores from one query to
scores from another query done with different parameters, especially if
it's using a different query parser, and expect those numbers to mean
anything.

The actual number is doesn't matter ... what matters is how the
documents score compared to *each other* -- what order the documents
have within a single result.

Thanks,
Shawn

Re: Edismax

thanks guys.

it doesn't depend on absolute scores, but it is leaning on the score as a
confident metric of sorts. we've found some good standard deviation info
when plotting out the accuracy of the top result and the relative score
with the analyzers currently in production and hope to strengthen that
confidence when it's right and lower it when it's wrong with the latest
fine-tuning. so far so good, too.

regarding the new question itself, i'd replied to this thread w more info
but had the system kick it back to me for some reason. maybe i replied too
much too soon? anyway, it ended up being a result of my query still being
in the primary query box instead of moving it to the q.alt box. i'd thought
the alt was indicative of it being an *alternate* query strictly
speaking. changed it to house the query and voila!

thanks-

-- 
*John Blythe*
Product Manager  Lead Developer

251.605.3071 | j...@curvolabs.com
www.curvolabs.com

58 Adams Ave
Evansville, IN 47713

On Wed, May 20, 2015 at 4:23 PM, Walter Underwood wun...@wunderwood.org
wrote:

 I was going to post the same advice. If your approach depends on absolute
 scores, you need to change your approach.

 wunder
 Walter Underwood
 wun...@wunderwood.org
 http://observer.wunderwood.org/  (my blog)


 On May 20, 2015, at 2:09 PM, Shawn Heisey apa...@elyograg.org wrote:

  On 5/20/2015 2:54 PM, John Blythe wrote:
  new question re edismax: when i turn it on (in solr admin) my score goes
  wayy down. from 772 to 4.9.
 
  what in the edismax query parser would account for that huge nosedive?
 
  Scores are 100% relative, and the number only has meaning in the context
  of that specific query.  You cannot compare scores from one query to
  scores from another query done with different parameters, especially if
  it's using a different query parser, and expect those numbers to mean
  anything.
 
  The actual number is doesn't matter ... what matters is how the
  documents score compared to *each other* -- what order the documents
  have within a single result.
 
  Thanks,
  Shawn

Edismax

Hi all,

I've been fine tuning our current Solr implementation the last week or two
to get more precise results. We are trying to get our implementation
accurate enough to serve as a lightweight machine learning (obviously a
misnomer) implementation of sorts. Actual user generated searching is far
secondary for our purposes.

I've gotten our results to go from confidence scores of ~40-60 for good
results to the 700s. So far so good. Edismax seems like it has some
promising features, but I'm wondering if it'll be very helpful for our
purposes. The only thing that jumps out immediately to me is the bq ability
in which one of our non-primary fields is used as a means of boosting. In
other words, when using our three fields—manufacturer, part number, and
description—to find a part, we could bq the category or size field to help
eliminate false positives from appearing.

Is there anything else that you think I should look into regarding edismax
that could be helpful to our end game?

Thanks for any ideas!

Re: Problem using a function with a multivalued field

bq: Keep a copy of the value into a non-multi-valued field, using an
update processor:  This involves indexing a new field

Why can't you do this? You can't re-index the data perhaps? It's by
far the easiest solution

Best,
Erick

On Wed, May 20, 2015 at 2:45 AM, Fernando Agüero fjagu...@gmail.com wrote:
 Hi everyone,

 I’ve been reading answers around this problem but I wanted to make sure that 
 there is another way out of my problem. The thing is that the solution 
 shouldn’t be on index-time, involve indexing a new field or changing this 
 multi-valued field to a single-valued one.


 Problem:
 I need to run a custom function with some fields but I see that it’s not 
 possible to get the value (first value in this case) of a multivalued field. 
 “title” is a multi-valued field.


 See:
 if(exists(title),strdist(title, “string1),0).


 This throws the “can’t use FieldCache on a multivalued field” error.


 Solutions that doesn’t work for me:
 - Keep a copy of the value into a non-multi-valued field, using an update 
 processor:  This involves indexing a new field.


 - Change the field to multiValued=false: This involves using a single-valued 
 field. I will be indexing new data in the future and I need some fields to be 
 multi-valued but I also need to work with them.


 Thanks in advance, I spent a lot of time with this without a solution. I’m 
 using Solr 4.10.

Re: Edismax

cool, will check into it some more via testing

-- 
*John Blythe*
Product Manager  Lead Developer

251.605.3071 | j...@curvolabs.com
www.curvolabs.com

58 Adams Ave
Evansville, IN 47713

On Wed, May 20, 2015 at 3:22 PM, Walter Underwood wun...@wunderwood.org
wrote:

 I believe that boost is a superset of the bq functionality.

 wunder
 Walter Underwood
 wun...@wunderwood.org
 http://observer.wunderwood.org/  (my blog)

 On May 20, 2015, at 1:16 PM, John Blythe j...@curvolabs.com wrote:

  could i do that the same way as my mention of using bq? the docs aren't
  very rich in their example or explanation of boost= here:
 
 https://cwiki.apache.org/confluence/display/solr/The+Extended+DisMax+Query+Parser
 
  thanks!
 
  --
  *John Blythe*
  Product Manager  Lead Developer
 
  251.605.3071 | j...@curvolabs.com
  www.curvolabs.com
 
  58 Adams Ave
  Evansville, IN 47713
 
  On Wed, May 20, 2015 at 3:13 PM, Walter Underwood wun...@wunderwood.org
 
  wrote:
 
  I highly recommend using boost= in edismax rather than bq=. The
  multiplicative boost is stable with a wide range of scores. bq is
 additive
  and has problems with high or low scores.
 
  wunder
  Walter Underwood
  wun...@wunderwood.org
  http://observer.wunderwood.org/  (my blog)
 
  On May 20, 2015, at 1:04 PM, John Blythe j...@curvolabs.com wrote:
 
  Hi all,
 
  I've been fine tuning our current Solr implementation the last week or
  two
  to get more precise results. We are trying to get our implementation
  accurate enough to serve as a lightweight machine learning (obviously a
  misnomer) implementation of sorts. Actual user generated searching is
 far
  secondary for our purposes.
 
  I've gotten our results to go from confidence scores of ~40-60 for good
  results to the 700s. So far so good. Edismax seems like it has some
  promising features, but I'm wondering if it'll be very helpful for our
  purposes. The only thing that jumps out immediately to me is the bq
  ability
  in which one of our non-primary fields is used as a means of boosting.
 In
  other words, when using our three fields—manufacturer, part number, and
  description—to find a part, we could bq the category or size field to
  help
  eliminate false positives from appearing.
 
  Is there anything else that you think I should look into regarding
  edismax
  that could be helpful to our end game?
 
  Thanks for any ideas!

Re: Edismax

2015-05-20 Thread Walter Underwood

I was going to post the same advice. If your approach depends on absolute 
scores, you need to change your approach.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


On May 20, 2015, at 2:09 PM, Shawn Heisey apa...@elyograg.org wrote:

 On 5/20/2015 2:54 PM, John Blythe wrote:
 new question re edismax: when i turn it on (in solr admin) my score goes
 wayy down. from 772 to 4.9.
 
 what in the edismax query parser would account for that huge nosedive?
 
 Scores are 100% relative, and the number only has meaning in the context
 of that specific query.  You cannot compare scores from one query to
 scores from another query done with different parameters, especially if
 it's using a different query parser, and expect those numbers to mean
 anything.
 
 The actual number is doesn't matter ... what matters is how the
 documents score compared to *each other* -- what order the documents
 have within a single result.
 
 Thanks,
 Shawn

Re: Edismax

new question re edismax: when i turn it on (in solr admin) my score goes
wayy down. from 772 to 4.9.

what in the edismax query parser would account for that huge nosedive?

-- 
*John Blythe*
Product Manager  Lead Developer

251.605.3071 | j...@curvolabs.com
www.curvolabs.com

58 Adams Ave
Evansville, IN 47713

On Wed, May 20, 2015 at 3:27 PM, John Blythe j...@curvolabs.com wrote:

 cool, will check into it some more via testing

 --
 *John Blythe*
 Product Manager  Lead Developer

 251.605.3071 | j...@curvolabs.com
 www.curvolabs.com

 58 Adams Ave
 Evansville, IN 47713

 On Wed, May 20, 2015 at 3:22 PM, Walter Underwood wun...@wunderwood.org
 wrote:

 I believe that boost is a superset of the bq functionality.

 wunder
 Walter Underwood
 wun...@wunderwood.org
 http://observer.wunderwood.org/  (my blog)

 On May 20, 2015, at 1:16 PM, John Blythe j...@curvolabs.com wrote:

  could i do that the same way as my mention of using bq? the docs aren't
  very rich in their example or explanation of boost= here:
 
 https://cwiki.apache.org/confluence/display/solr/The+Extended+DisMax+Query+Parser
 
  thanks!
 
  --
  *John Blythe*
  Product Manager  Lead Developer
 
  251.605.3071 | j...@curvolabs.com
  www.curvolabs.com
 
  58 Adams Ave
  Evansville, IN 47713
 
  On Wed, May 20, 2015 at 3:13 PM, Walter Underwood 
 wun...@wunderwood.org
  wrote:
 
  I highly recommend using boost= in edismax rather than bq=. The
  multiplicative boost is stable with a wide range of scores. bq is
 additive
  and has problems with high or low scores.
 
  wunder
  Walter Underwood
  wun...@wunderwood.org
  http://observer.wunderwood.org/  (my blog)
 
  On May 20, 2015, at 1:04 PM, John Blythe j...@curvolabs.com wrote:
 
  Hi all,
 
  I've been fine tuning our current Solr implementation the last week or
  two
  to get more precise results. We are trying to get our implementation
  accurate enough to serve as a lightweight machine learning (obviously
 a
  misnomer) implementation of sorts. Actual user generated searching is
 far
  secondary for our purposes.
 
  I've gotten our results to go from confidence scores of ~40-60 for
 good
  results to the 700s. So far so good. Edismax seems like it has some
  promising features, but I'm wondering if it'll be very helpful for our
  purposes. The only thing that jumps out immediately to me is the bq
  ability
  in which one of our non-primary fields is used as a means of
 boosting. In
  other words, when using our three fields—manufacturer, part number,
 and
  description—to find a part, we could bq the category or size field to
  help
  eliminate false positives from appearing.
 
  Is there anything else that you think I should look into regarding
  edismax
  that could be helpful to our end game?
 
  Thanks for any ideas!

Re: Need help with Nested docs situation

data scale and request rate can judge between block, plain joins and field
collapsing.

On Thu, Apr 30, 2015 at 1:07 PM, roySolr royrutten1...@gmail.com wrote:

 Hello,

 I have a situation and i'm a little bit stuck on the way how to fix it.
 For example the following data structure:

 *Deal*
 All Coca Cola 20% off

 *Products*
 Coca Cola light
 Coca Cola Zero 1L
 Coca Cola Zero 20CL
 Coca Cola 1L

 When somebody search to Cola discount i want the result of the deal with
 related products.

 Solution #1:
 I could index it with nested docs(solr 4.9). But the problem is when a
 product has some changes(let's say Zero gets a new name Extra Light) i
 have to re-index every deal with these products.

 Solution #2:
 I could make 2 collections, one with deals and one with products. A Product
 will get a parentid(dealid). Then i have to do 2 queries to get the
 information? When i have a resultpage with 10 deals i want to preview the
 first 2 products. That means a lot of queries but it's doesn't have the
 update problem from solution #1.

 Does anyone have a good solution for this?

 Thanks, any help is appreciated.
 Roy





 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Need-help-with-Nested-docs-situation-tp4203190.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
mkhlud...@griddynamics.com

Re: When is too many fields in qf is too many?

2015-05-20 Thread Doug Turnbull

Yeah a copyField into one could be a good space/time tradeoff. It can be
more manageable to use an all field for both relevancy and performance, if
you can handle the duplication of data.

You could set tie=1.0, which effectively sums all the matches instead of
picking the best match. You'll still have cases where one field's score
might just happen to be far off of another, and thus dominating the
summation. But something easy to try if you want to keep playing with
dismax.

-Doug

On Wed, May 20, 2015 at 2:56 PM, Steven White swhite4...@gmail.com wrote:

Hi Doug,

Your blog write up on relevancy is very interesting, I didn't know this.
Looks like I have to go back to my drawing board and figure out an
alternative solution: somehow get those group-based-fields data into a
single field using copyField.

Thanks

Steve

On Wed, May 20, 2015 at 11:17 AM, Doug Turnbull
dturnb...@opensourceconnections.com wrote:

Steven,

http://opensourceconnections.com/blog/2013/07/02/getting-dissed-by-dismax-why-your-incorrect-assumptions-about-dismax-are-hurting-search-relevancy/

I'm about to win the blashphemer merit badge, but ad-hoc all-field like
searching over many fields is actually a good use case for
Elasticsearch's
cross field queries.

https://www.elastic.co/guide/en/elasticsearch/guide/master/_cross_fields_queries.html

http://opensourceconnections.com/blog/2015/03/19/elasticsearch-cross-field-search-is-a-lie/

https://github.com/elastic/elasticsearch/blob/master/src/main/java/org/apache/lucene/queries/BlendedTermQuery.java

Hope that helps
-Doug
--
*Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections,
LLC | 240.476.9983 | http://www.opensourceconnections.com
Author: Relevant Search http://manning.com/turnbull from Manning
Publications
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.
On Wed, May 20, 2015 at 8:27 AM, Steven White swhite4...@gmail.com
wrote:

Hi everyone,

My solution requires that users in group-A can only search against a
set
of
fields-A and users in group-B can only search against a set of
fields-B,
etc. There can be several groups, as many as 100 even more. To meet
this
need, I build my search by passing in the list of fields via qf.
What
goes into qf can be large: as many as 1500 fields and each field name
averages 15 characters long, in effect the data passed via qf will be
over 20K characters.

Given the above, beside the fact that a search for apple translating
to a
20K characters passing over the network, what else within Solr and
Lucene I
should be worried about if any? Will I hit some kind of a limit? Will
each search now require more CPU cycles? Memory? Etc.

If the network traffic becomes an issue, my alternative solution is to
create a /select handler for each group and in that handler list the
fields
under qf.

I have considered creating pseudo-fields for each group and then use
copyField into that group. During search, I than can qf against that
one
field. Unfortunately, this is not ideal for my solution because the
fields
that go into each group dynamically change (at least once a month) and
when
they do change, I have to re-index everything (this I have to avoid) to
sync that group-field.

I'm using qf with edismax and my Solr version is 5.1.

Thanks

Steve

--
*Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections,
LLC | 240.476.9983 | http://www.opensourceconnections.com
Author: Relevant Search http://manning.com/turnbull from Manning
Publications
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.

Re: Help to index nested document

I'm absolutely sure that you need to group them externally in the indexer
eg like a child VALUES entity in DataImportHandler.

On Mon, May 11, 2015 at 9:52 PM, Vishal Swaroop vishal@gmail.com
wrote:

 Need your valuable inputs...

 I am indexing data from database (one table) which is in this example
 format :
 id name value
 1 Joe 102724904
 2 Joe 100996643

 - id is primary/ unique key
 - there can be same name but different value
 - If I try name as unique key then SOLR removes duplicate and indexes 1
 document

 - I am getting the result in this format... Is there as way I can index
 data in a way so that I can value can be child for name...
 response: {
 numFound: 2,
 start: 0,
 docs: [
   {
 id: 1,
 name: Joe,
 value: [
   102724904
 ]
   },
   {
 id: 2,
 name: Joe,
 value: [
   100996643
 ]
   }...

 Expected format :
 docs: [
   {
 name: Joe,
 value: [
   102724904,
   100996643
 ]
   }




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
mkhlud...@griddynamics.com

Re: Solr Cloud: No live SolrServers available

2015-05-20 Thread Chetan Vora

Seems like the attachements get stripped off. Anyways, here is the 4.7 log
on startup

INFO  - 2015-05-20 10:35:45.786; org.eclipse.jetty.server.Server;
jetty-8.1.10.v20130312
INFO  - 2015-05-20 10:35:45.804;
org.eclipse.jetty.deploy.providers.ScanningAppProvider; Deployment monitor
C:\apps\solr\solr-4.7.2\example\contexts at interval 0
INFO  - 2015-05-20 10:35:45.811;
org.eclipse.jetty.deploy.DeploymentManager; Deployable added:
C:\apps\solr\solr-4.7.2\example\contexts\solr-jetty-context.xml
INFO  - 2015-05-20 10:35:47.405;
org.eclipse.jetty.webapp.StandardDescriptorProcessor; NO JSP Support for
/solr, did not find org.apache.jasper.servlet.JspServlet
INFO  - 2015-05-20 10:35:47.460;
org.apache.solr.servlet.SolrDispatchFilter; SolrDispatchFilter.init()
INFO  - 2015-05-20 10:35:47.472; org.apache.solr.core.SolrResourceLoader;
JNDI not configured for solr (NoInitialContextEx)
INFO  - 2015-05-20 10:35:47.473; org.apache.solr.core.SolrResourceLoader;
solr home defaulted to 'solr/' (could not find system property or JNDI)
INFO  - 2015-05-20 10:35:47.473; org.apache.solr.core.SolrResourceLoader;
new SolrResourceLoader for directory: 'solr/'
INFO  - 2015-05-20 10:35:47.579; org.apache.solr.core.ConfigSolr; Loading
container configuration from C:\apps\solr\solr-4.7.2\example\solr\solr.xml
INFO  - 2015-05-20 10:35:47.674;
org.apache.solr.core.CorePropertiesLocator; Config-defined core root
directory: C:\apps\solr\solr-4.7.2\example\solr
INFO  - 2015-05-20 10:35:47.680; org.apache.solr.core.CoreContainer; New
CoreContainer 1930610653
INFO  - 2015-05-20 10:35:47.680; org.apache.solr.core.CoreContainer;
Loading cores into CoreContainer [instanceDir=solr/]
INFO  - 2015-05-20 10:35:47.691;
org.apache.solr.handler.component.HttpShardHandlerFactory; Setting
socketTimeout to: 0
INFO  - 2015-05-20 10:35:47.691;
org.apache.solr.handler.component.HttpShardHandlerFactory; Setting
urlScheme to: null
INFO  - 2015-05-20 10:35:47.695;
org.apache.solr.handler.component.HttpShardHandlerFactory; Setting
connTimeout to: 0
INFO  - 2015-05-20 10:35:47.695;
org.apache.solr.handler.component.HttpShardHandlerFactory; Setting
maxConnectionsPerHost to: 20
INFO  - 2015-05-20 10:35:47.696;
org.apache.solr.handler.component.HttpShardHandlerFactory; Setting
corePoolSize to: 0
INFO  - 2015-05-20 10:35:47.696;
org.apache.solr.handler.component.HttpShardHandlerFactory; Setting
maximumPoolSize to: 2147483647
INFO  - 2015-05-20 10:35:47.697;
org.apache.solr.handler.component.HttpShardHandlerFactory; Setting
maxThreadIdleTime to: 5
INFO  - 2015-05-20 10:35:47.697;
org.apache.solr.handler.component.HttpShardHandlerFactory; Setting
sizeOfQueue to: -1
INFO  - 2015-05-20 10:35:47.697;
org.apache.solr.handler.component.HttpShardHandlerFactory; Setting
fairnessPolicy to: false
INFO  - 2015-05-20 10:35:47.931; org.apache.solr.logging.LogWatcher; SLF4J
impl is org.slf4j.impl.Log4jLoggerFactory
INFO  - 2015-05-20 10:35:47.932; org.apache.solr.logging.LogWatcher;
Registering Log Listener [Log4j (org.slf4j.impl.Log4jLoggerFactory)]
INFO  - 2015-05-20 10:35:47.933; org.apache.solr.core.CoreContainer; Host
Name:
INFO  - 2015-05-20 10:35:47.946; org.apache.solr.cloud.SolrZkServer;
STARTING EMBEDDED STANDALONE ZOOKEEPER SERVER at port 9987
INFO  - 2015-05-20 10:35:48.447; org.apache.solr.core.ZkContainer;
Zookeeper client=localhost:9987
INFO  - 2015-05-20 10:35:48.489;
org.apache.solr.common.cloud.ConnectionManager; Waiting for client to
connect to ZooKeeper
INFO  - 2015-05-20 10:35:48.500;
org.apache.solr.common.cloud.ConnectionManager; Watcher
org.apache.solr.common.cloud.ConnectionManager@2bc25a1d
name:ZooKeeperConnection Watcher:localhost:9987 got event WatchedEvent
state:SyncConnected type:None path:null path:null type:None
INFO  - 2015-05-20 10:35:48.500;
org.apache.solr.common.cloud.ConnectionManager; Client is connected to
ZooKeeper
INFO  - 2015-05-20 10:35:48.516;
org.apache.solr.common.cloud.ZkStateReader; Updating cluster state from
ZooKeeper...
INFO  - 2015-05-20 10:35:49.529; org.apache.solr.cloud.ZkController;
Register node as live in ZooKeeper:/live_nodes/10.1.172.231:8987_solr
INFO  - 2015-05-20 10:35:49.536; org.apache.solr.cloud.ZkController; Found
a previous node that still exists while trying to register a new live node
/live_nodes/10.1.172.231:8987_solr - removing existing node to create
another.
INFO  - 2015-05-20 10:35:49.537; org.apache.solr.common.cloud.SolrZkClient;
makePath: /live_nodes/10.1.172.231:8987_solr
INFO  - 2015-05-20 10:35:49.537;
org.apache.solr.common.cloud.ZkStateReader$3; Updating live nodes... (0)
INFO  - 2015-05-20 10:35:49.544;
org.apache.solr.common.cloud.ZkStateReader$3; Updating live nodes... (1)
INFO  - 2015-05-20 10:35:49.581; org.apache.solr.common.cloud.SolrZkClient;
makePath: /configs/myapp47/schema.xml
INFO  - 2015-05-20 10:35:49.596; org.apache.solr.common.cloud.SolrZkClient;
makePath: /configs/myapp47/solrconfig.xml
INFO  - 2015-05-20 10:35:49.636;
org.apache.solr.core.CorePropertiesLocator; Looking for core

ConfigSets and SolrCloud

2015-05-20 Thread Jim . Musil

Hi,

I need a little clarification on configSets in solr 5.x.

According to this page:

https://cwiki.apache.org/confluence/display/solr/Config+Sets

I can create named configSets to be shared by other cores. If I create them 
using this method AND am operating in SolrCloud mode, will it automatically 
upload these named config sets to zookeeper?

Thanks!
Jim Musil

Re: When is too many fields in qf is too many?

Hi Doug,

Thanks

Steve

On Wed, May 20, 2015 at 11:17 AM, Doug Turnbull
dturnb...@opensourceconnections.com wrote:

Steven,

http://opensourceconnections.com/blog/2013/07/02/getting-dissed-by-dismax-why-your-incorrect-assumptions-about-dismax-are-hurting-search-relevancy/

I'm about to win the blashphemer merit badge, but ad-hoc all-field like
searching over many fields is actually a good use case for Elasticsearch's
cross field queries.

https://www.elastic.co/guide/en/elasticsearch/guide/master/_cross_fields_queries.html

http://opensourceconnections.com/blog/2015/03/19/elasticsearch-cross-field-search-is-a-lie/

https://github.com/elastic/elasticsearch/blob/master/src/main/java/org/apache/lucene/queries/BlendedTermQuery.java

Hi everyone,

My solution requires that users in group-A can only search against a set
of
fields-A and users in group-B can only search against a set of fields-B,
etc. There can be several groups, as many as 100 even more. To meet
this
need, I build my search by passing in the list of fields via qf. What
goes into qf can be large: as many as 1500 fields and each field name
averages 15 characters long, in effect the data passed via qf will be
over 20K characters.

If the network traffic becomes an issue, my alternative solution is to
create a /select handler for each group and in that handler list the
fields
under qf.

I'm using qf with edismax and my Solr version is 5.1.

Thanks

Steve

Re: When is too many fields in qf is too many?

Thanks for calling out maxBooleanClauses.  The current default of 1024 has
not caused me any issues (so far) in my testing.

However, you probably saw Doug Tumbull's reply, it looks like my relevance
will suffer.

Steve

On Wed, May 20, 2015 at 11:42 AM, Shawn Heisey apa...@elyograg.org wrote:

 On 5/20/2015 9:24 AM, Steven White wrote:
  I have already switched to using POST because I need to send a long list
 of
  data in qf.  My question isn't about POST / GET, it's about Solr and
  Lucene having to deal with such long list of fields.  Here is the text of
  my question reposted:
 
  Given the above, beside the fact that a search for apple translating
 to
  a 20K characters passing over the network, what else within Solr and
  Lucene
  I should be worried about if any?  Will I hit some kind of a limit?
 Will
  each search now require more CPU cycles?  Memory?  Etc.

 You may need to increase maxBooleanClauses beyond the default of 1024.
 There will be a message in the log if that is required.  Note that such
 an increase must happen on EVERY config you have, or one of them may set
 it back to the 1024 default -- it's a global JVM-wide config.

 Large complex queries are usually slow, requiring more memory and CPU
 than simple queries, but if you have the resources, Solr will handle it
 just fine.

 Thanks,
 Shawn

Re: solr 5.x on glassfish/tomcat instead of jetty

2015-05-20 Thread Toke Eskildsen

Shawn Heisey apa...@elyograg.org wrote:
 I'm wondering ... if Jetty is good enough for the Google App Engine, why
 isn't it good enough for your infrastructure standards?

Replace Jetty vs. Glassfish with Linux vs. Windows, Eclipse vs. Idea, emacs vs. 
vi, Java vs. C#...

There are many reasons for a corporation to prefer one product over another. 
One common one is the wish to support as few different platforms as possible: 
Better the devil you know.

We're still on Solr 4.x and deploy it in a tomcat, as that is what Operations 
prefer to use. From their perspective, Solr is just another thing to run among 
all the other WARs we throw at them. We will switch away from tomcat when 
upgrading to Solr 5, but our upgrade has been delayed so far (partly) because 
of that change.

This is a recurring discussion. A list of the merits  drawbacks of going 
WAR-less (or more to the point: Require Solr to be run as an application 
instead of in a generic container) might be an idea?

- Toke Eskildsen

Re: Edismax

John:

The spam filter is very aggressive. Try changing the type to plain
text rather than rich text or html...

Best,
Erick

On Wed, May 20, 2015 at 2:35 PM, John Blythe j...@curvolabs.com wrote:
 thanks guys.

 it doesn't depend on absolute scores, but it is leaning on the score as a
 confident metric of sorts. we've found some good standard deviation info
 when plotting out the accuracy of the top result and the relative score
 with the analyzers currently in production and hope to strengthen that
 confidence when it's right and lower it when it's wrong with the latest
 fine-tuning. so far so good, too.

 regarding the new question itself, i'd replied to this thread w more info
 but had the system kick it back to me for some reason. maybe i replied too
 much too soon? anyway, it ended up being a result of my query still being
 in the primary query box instead of moving it to the q.alt box. i'd thought
 the alt was indicative of it being an *alternate* query strictly
 speaking. changed it to house the query and voila!

 thanks-

 --
 *John Blythe*
 Product Manager  Lead Developer

 251.605.3071 | j...@curvolabs.com
 www.curvolabs.com

 58 Adams Ave
 Evansville, IN 47713

 On Wed, May 20, 2015 at 4:23 PM, Walter Underwood wun...@wunderwood.org
 wrote:

 I was going to post the same advice. If your approach depends on absolute
 scores, you need to change your approach.

 wunder
 Walter Underwood
 wun...@wunderwood.org
 http://observer.wunderwood.org/  (my blog)


 On May 20, 2015, at 2:09 PM, Shawn Heisey apa...@elyograg.org wrote:

  On 5/20/2015 2:54 PM, John Blythe wrote:
  new question re edismax: when i turn it on (in solr admin) my score goes
  wayy down. from 772 to 4.9.
 
  what in the edismax query parser would account for that huge nosedive?
 
  Scores are 100% relative, and the number only has meaning in the context
  of that specific query.  You cannot compare scores from one query to
  scores from another query done with different parameters, especially if
  it's using a different query parser, and expect those numbers to mean
  anything.
 
  The actual number is doesn't matter ... what matters is how the
  documents score compared to *each other* -- what order the documents
  have within a single result.
 
  Thanks,
  Shawn

Re: Edismax

Good call thank you

On Wed, May 20, 2015 at 5:15 PM, Erick Erickson erickerick...@gmail.com
wrote:

 John:
 The spam filter is very aggressive. Try changing the type to plain
 text rather than rich text or html...
 Best,
 Erick
 On Wed, May 20, 2015 at 2:35 PM, John Blythe j...@curvolabs.com wrote:
 thanks guys.

 it doesn't depend on absolute scores, but it is leaning on the score as a
 confident metric of sorts. we've found some good standard deviation info
 when plotting out the accuracy of the top result and the relative score
 with the analyzers currently in production and hope to strengthen that
 confidence when it's right and lower it when it's wrong with the latest
 fine-tuning. so far so good, too.

 regarding the new question itself, i'd replied to this thread w more info
 but had the system kick it back to me for some reason. maybe i replied too
 much too soon? anyway, it ended up being a result of my query still being
 in the primary query box instead of moving it to the q.alt box. i'd thought
 the alt was indicative of it being an *alternate* query strictly
 speaking. changed it to house the query and voila!

 thanks-

 --
 *John Blythe*
 Product Manager  Lead Developer

 251.605.3071 | j...@curvolabs.com
 www.curvolabs.com

 58 Adams Ave
 Evansville, IN 47713

 On Wed, May 20, 2015 at 4:23 PM, Walter Underwood wun...@wunderwood.org
 wrote:

 I was going to post the same advice. If your approach depends on absolute
 scores, you need to change your approach.

 wunder
 Walter Underwood
 wun...@wunderwood.org
 http://observer.wunderwood.org/  (my blog)


 On May 20, 2015, at 2:09 PM, Shawn Heisey apa...@elyograg.org wrote:

  On 5/20/2015 2:54 PM, John Blythe wrote:
  new question re edismax: when i turn it on (in solr admin) my score goes
  wayy down. from 772 to 4.9.
 
  what in the edismax query parser would account for that huge nosedive?
 
  Scores are 100% relative, and the number only has meaning in the context
  of that specific query.  You cannot compare scores from one query to
  scores from another query done with different parameters, especially if
  it's using a different query parser, and expect those numbers to mean
  anything.
 
  The actual number is doesn't matter ... what matters is how the
  documents score compared to *each other* -- what order the documents
  have within a single result.
 
  Thanks,
  Shawn

Re: Edismax

On 5/20/2015 3:35 PM, John Blythe wrote:
 regarding the new question itself, i'd replied to this thread w more info
 but had the system kick it back to me for some reason. maybe i replied too
 much too soon? anyway, it ended up being a result of my query still being
 in the primary query box instead of moving it to the q.alt box. i'd thought
 the alt was indicative of it being an *alternate* query strictly
 speaking. changed it to house the query and voila!

As Erick said, it may have been classified as spam and discarded.  His
advice of using plain text instead of HTML or rich text is one of the
top things to try.  If you actually received a bounce message, that
bounce should have information about why it was rejected.

The q.alt parameter is an alternate query, in *lucene* parser syntax,
that dismax or edismax will execute when the q parameter is empty or
missing.  It is quite common to use q.alt=*:* in the handler defaults so
that if you omit the q parameter or send an empty string, you get all
docs.  If there is a non-empty q parameter, q.alt is ignored.

Thanks,
Shawn

Re: Reindex of document leaves old fields behind

Well, let's see the code. Standard updates should replace the previous
docs, reindexing the same unique ID with fewer fields should show
fewer fields. So something's weird here.

Although do, just for yucks, issue a query on some of the unique ids
in question, I'd be curious if you get more than one back which would
tell us something.

Did you push your schema up to Zookeeper and reload (or restart) your
collection before re-indexing things? And are you sure the documents
are actually getting indexed and that the update is succeeding? (check
your Solr logs probably here).

Best,
Erick

On Wed, May 20, 2015 at 5:12 PM, tuxedomoon dancolem...@yahoo.com wrote:
 The uniqueKey value is the same.

 The new documents contain fewer fields than the already indexed ones.  Could
 this cause the updates to be treated as atomic?  With the persisting fields
 treated as un-updated?

 Routing should be implicit since the collection was created using numShards.
 Many request for the same document with cache busting produce the same
 unwanted fields, so I doubt the correct one is hiding somewhere.  I can
 also see the timestamp going up with each reindex.




 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Reindex-of-document-leaves-old-fields-behind-tp4206710p4206732.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Reindex of document leaves old fields behind

2015-05-20 Thread tuxedomoon

I'm reindexing Mongo docs into SolrCloud.  The new docs have had a few fields
removed so upon reindexing those fields should be gone in Solr.  They are
not.  So the result is a new doc merged with an old doc rather than a
replacement which is what I need.

I do not know whether the issue is with my SolrJ client, Solr config or
something else.  





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Reindex-of-document-leaves-old-fields-behind-tp4206710.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Reindex of document leaves old fields behind

2015-05-20 Thread tuxedomoon

The uniqueKey value is the same.  

The new documents contain fewer fields than the already indexed ones.  Could
this cause the updates to be treated as atomic?  With the persisting fields
treated as un-updated?

Routing should be implicit since the collection was created using numShards. 
Many request for the same document with cache busting produce the same
unwanted fields, so I doubt the correct one is hiding somewhere.  I can
also see the timestamp going up with each reindex. 




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Reindex-of-document-leaves-old-fields-behind-tp4206710p4206732.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: solr 5.x on glassfish/tomcat instead of jetty

2015-05-20 Thread TK Solr


On 5/20/15, 8:21 AM, Shawn Heisey wrote:
As of right now, there is still a .war file. Look in the server/webapps 
directory for the .war, server/lib/ext for logging jars, and server/resources 
for the logging configuration. Consult your container's documentation to learn 
where to place these things. At some point in the future, such deployments 
will no longer be possible,
While we are still at this subject, I have been aware there has been an anti-WAR 
movement in the tech but I don't quite understand where this movement is coming 
from.  Can someone point me to some website summarizing why WARs are bad?


Thanks.

Re: solr 5.x on glassfish/tomcat instead of jetty

2015-05-20 Thread TK Solr


Never mind. I found that thread. Sorry for the noise.

On 5/20/15, 5:56 PM, TK Solr wrote:

On 5/20/15, 8:21 AM, Shawn Heisey wrote:
As of right now, there is still a .war file. Look in the server/webapps 
directory for the .war, server/lib/ext for logging jars, and server/resources 
for the logging configuration. Consult your container's documentation to 
learn where to place these things. At some point in the future, such 
deployments will no longer be possible,
While we are still at this subject, I have been aware there has been an 
anti-WAR movement in the tech but I don't quite understand where this movement 
is coming from.  Can someone point me to some website summarizing why WARs are 
bad?


Thanks.

SolrCloud Leader Election

2015-05-20 Thread Ryan Steele

My SolrCloud cluster isn't reassigning the collections leaders from 
downed cores--the downed cores are still listed as the leaders. The 
cluster has been in the state for a few hours and the logs continue to 
report No registered leader was found after waiting for 4000ms. Is 
there a way to force it to reassign the leader?


I'm running SolrCloud 5.0.
I have 7 Solr nodes, 3 Zookeeper nodes, and 3739 collections.

Thanks,
Ryan
---
This email has been scanned for email related threats and delivered safely by 
Mimecast.
For more information please visit http://www.mimecast.com
---

Re: SolrCloud delete by query performance

2015-05-20 Thread Ryan Cutter

GC is operating the way I think it should but I am lacking memory.  I am
just surprised because indexing is performing fine (documents going in) but
deletions are really bad (documents coming out).

Is it possible these deletes are hitting many segments, each of which I
assume must be re-built?  And if there isn't much slack memory laying
around to begin with, there's a bunch of contention/swap?

Thanks Shawn!

On Wed, May 20, 2015 at 4:50 PM, Shawn Heisey apa...@elyograg.org wrote:

 On 5/20/2015 5:41 PM, Ryan Cutter wrote:
  I have a collection with 1 billion documents and I want to delete 500 of
  them.  The collection has a dozen shards and a couple replicas.  Using
 Solr
  4.4.
 
  Sent the delete query via HTTP:
 
  http://hostname:8983/solr/my_collection/update?stream.body=
  deletequerysource:foo/query/delete
 
  Took a couple minutes and several replicas got knocked into Recovery
 mode.
  They eventually came back and the desired docs were deleted but the
 cluster
  wasn't thrilled (high load, etc).
 
  Is this expected behavior?  Is there a better way to delete documents
 that
  I'm missing?

 That's the correct way to do the delete.  Before you'll see the change,
 a commit must happen in one way or another.  Hopefully you already knew
 that.

 I believe that your setup has some performance issues that are making it
 very slow and knocking out your Solr nodes temporarily.

 The most common root problems with SolrCloud and indexes going into
 recovery are:  1) Your heap is enormous but your garbage collection is
 not tuned.  2) You don't have enough RAM, separate from your Java heap,
 for adequate index caching.  With a billion documents in your
 collection, you might even be having problems with both.

 Here's a wiki page that includes some info on both of these problems,
 plus a few others:

 http://wiki.apache.org/solr/SolrPerformanceProblems

 Thanks,
 Shawn

Re: Upgrading question

Yep. Solr/Lucene strives for one major revision backwards
compatibility. So any 5x should be able to read any index produced
with 4x, but no index produced with 3x.

Best,
Erick

On Wed, May 20, 2015 at 2:44 PM, Craig Longman clong...@iconect.com wrote:

 We've been using Solr a bit now for a year or so, 4.6 is the oldest version 
 of Solr we've deployed.  We're currently working through the process we'll 
 use to upgrade to 5.1, an upgrade we need for the new facet.stats 
 capabilities.

 Reading the Major Changes document, it indicates that there is no longer 
 support for Lucene/Solr 3.x and earlier indexes.  It also indicates that you 
 should use the IndexUpgrader included with Solr 4.10 if you're unsure.

 We've only ever deployed 4.6, and 4.9 Solr installations.  Am I safe to 
 assume that we can skip the optimize step and just upgrade to Solr 5.1, 
 perhaps optimizing after we've done that?

 Thanks,


 Craig Longman
 C++ Developer

 This message and any attachments are intended only for the use of the 
 addressee and may contain information that is privileged and confidential. If 
 the reader of the message is not the intended recipient or an authorized 
 representative of the intended recipient, you are hereby notified that any 
 dissemination of this communication is strictly prohibited. If you have 
 received this communication in error, notify the sender immediately by return 
 email and delete the message and any attachments from your system.

Re: Edismax

2015-05-20 Thread Upayavira

A few things:

Scores aren't confidence metrics, they are relative rankings, in
relation to a single resultset, that's all.

Secondly for edismax, boost does multiplicative boosting (whatever
function you provide, the score is multiplied by that), whereas bf does
additive boosting.

Upayavira

On Wed, May 20, 2015, at 11:15 PM, Erick Erickson wrote:
 John:
 
 The spam filter is very aggressive. Try changing the type to plain
 text rather than rich text or html...
 
 Best,
 Erick
 
 On Wed, May 20, 2015 at 2:35 PM, John Blythe j...@curvolabs.com wrote:
  thanks guys.
 
  it doesn't depend on absolute scores, but it is leaning on the score as a
  confident metric of sorts. we've found some good standard deviation info
  when plotting out the accuracy of the top result and the relative score
  with the analyzers currently in production and hope to strengthen that
  confidence when it's right and lower it when it's wrong with the latest
  fine-tuning. so far so good, too.
 
  regarding the new question itself, i'd replied to this thread w more info
  but had the system kick it back to me for some reason. maybe i replied too
  much too soon? anyway, it ended up being a result of my query still being
  in the primary query box instead of moving it to the q.alt box. i'd thought
  the alt was indicative of it being an *alternate* query strictly
  speaking. changed it to house the query and voila!
 
  thanks-
 
  --
  *John Blythe*
  Product Manager  Lead Developer
 
  251.605.3071 | j...@curvolabs.com
  www.curvolabs.com
 
  58 Adams Ave
  Evansville, IN 47713
 
  On Wed, May 20, 2015 at 4:23 PM, Walter Underwood wun...@wunderwood.org
  wrote:
 
  I was going to post the same advice. If your approach depends on absolute
  scores, you need to change your approach.
 
  wunder
  Walter Underwood
  wun...@wunderwood.org
  http://observer.wunderwood.org/  (my blog)
 
 
  On May 20, 2015, at 2:09 PM, Shawn Heisey apa...@elyograg.org wrote:
 
   On 5/20/2015 2:54 PM, John Blythe wrote:
   new question re edismax: when i turn it on (in solr admin) my score goes
   wayy down. from 772 to 4.9.
  
   what in the edismax query parser would account for that huge nosedive?
  
   Scores are 100% relative, and the number only has meaning in the context
   of that specific query.  You cannot compare scores from one query to
   scores from another query done with different parameters, especially if
   it's using a different query parser, and expect those numbers to mean
   anything.
  
   The actual number is doesn't matter ... what matters is how the
   documents score compared to *each other* -- what order the documents
   have within a single result.
  
   Thanks,
   Shawn

Re: Reindex of document leaves old fields behind

On 5/20/2015 4:43 PM, tuxedomoon wrote:
 I'm reindexing Mongo docs into SolrCloud.  The new docs have had a few fields
 removed so upon reindexing those fields should be gone in Solr.  They are
 not.  So the result is a new doc merged with an old doc rather than a
 replacement which is what I need.
 
 I do not know whether the issue is with my SolrJ client, Solr config or
 something else.  

Do those documents have the same value in the uniqueKey field?  It must
be an exact match -- a deviation in upper/lower case will be treated as
a new document.

If they do have identical information in the uniqueKey field, then there
are a few possible problems:

Are you indexing full documents, or are you doing Atomic Updates?  An
atomic update is by definition a change, not a replacement ... so unless
that change includes deleting fields, they would not be affected.

https://wiki.apache.org/solr/Atomic_Updates

If your collection has multiple shards, then this paragraph may apply:
What is the router set to on the collection?  If it is implicit then
you may have indexed the new document to a different shard, which means
that it is now in your index more than once, and which one gets returned
may not be predictable.  The same thing may be true if you are using
composite routing and including information in the key that sends the
document to a different shard from where it was originally indexed.

Thanks,
Shawn

SolrCloud delete by query performance

2015-05-20 Thread Ryan Cutter

I have a collection with 1 billion documents and I want to delete 500 of
them.  The collection has a dozen shards and a couple replicas.  Using Solr
4.4.

Sent the delete query via HTTP:

http://hostname:8983/solr/my_collection/update?stream.body=
deletequerysource:foo/query/delete

Took a couple minutes and several replicas got knocked into Recovery mode.
They eventually came back and the desired docs were deleted but the cluster
wasn't thrilled (high load, etc).

Is this expected behavior?  Is there a better way to delete documents that
I'm missing?

Thanks, Ryan

Re: SolrCloud delete by query performance

On 5/20/2015 5:41 PM, Ryan Cutter wrote:
 I have a collection with 1 billion documents and I want to delete 500 of
 them.  The collection has a dozen shards and a couple replicas.  Using Solr
 4.4.
 
 Sent the delete query via HTTP:
 
 http://hostname:8983/solr/my_collection/update?stream.body=
 deletequerysource:foo/query/delete
 
 Took a couple minutes and several replicas got knocked into Recovery mode.
 They eventually came back and the desired docs were deleted but the cluster
 wasn't thrilled (high load, etc).
 
 Is this expected behavior?  Is there a better way to delete documents that
 I'm missing?

That's the correct way to do the delete.  Before you'll see the change,
a commit must happen in one way or another.  Hopefully you already knew
that.

I believe that your setup has some performance issues that are making it
very slow and knocking out your Solr nodes temporarily.

The most common root problems with SolrCloud and indexes going into
recovery are:  1) Your heap is enormous but your garbage collection is
not tuned.  2) You don't have enough RAM, separate from your Java heap,
for adequate index caching.  With a billion documents in your
collection, you might even be having problems with both.

Here's a wiki page that includes some info on both of these problems,
plus a few others:

http://wiki.apache.org/solr/SolrPerformanceProblems

Thanks,
Shawn

Re: SolrCloud delete by query performance

On 5/20/2015 5:57 PM, Ryan Cutter wrote:
 GC is operating the way I think it should but I am lacking memory.  I am
 just surprised because indexing is performing fine (documents going in) but
 deletions are really bad (documents coming out).
 
 Is it possible these deletes are hitting many segments, each of which I
 assume must be re-built?  And if there isn't much slack memory laying
 around to begin with, there's a bunch of contention/swap?

A deleteByQuery must first query the entire index to determine which IDs
to delete.  That's going to hit every segment.  In the case of
SolrCloud, it will also hit at least one replica of every single shard
in the collection.

If the data required to satisfy the query is not already sitting in the
OS disk cache, then the actual disk must be read.  When RAM is extremely
tight, any disk operation will erase relevant data out of the OS disk
cache, so the next time it is needed, it must be read off the disk
again.  Disks are SLOW.  What I am describing is not swap, but the
performance impact is similar to swapping.

The actual delete operation (once the IDs are known) doesn't touch any
segments ... it writes Lucene document identifiers to a .del file, and
that file is consulted on all queries.  Any deleted documents found in
the query results are removed.

Thanks,
Shawn

Re: SolrCloud delete by query performance

2015-05-20 Thread Ryan Cutter

Shawn, thank you very much for that explanation.  It helps a lot.

Cheers, Ryan

On Wed, May 20, 2015 at 5:07 PM, Shawn Heisey apa...@elyograg.org wrote:

 On 5/20/2015 5:57 PM, Ryan Cutter wrote:
  GC is operating the way I think it should but I am lacking memory.  I am
  just surprised because indexing is performing fine (documents going in)
 but
  deletions are really bad (documents coming out).
 
  Is it possible these deletes are hitting many segments, each of which I
  assume must be re-built?  And if there isn't much slack memory laying
  around to begin with, there's a bunch of contention/swap?

 A deleteByQuery must first query the entire index to determine which IDs
 to delete.  That's going to hit every segment.  In the case of
 SolrCloud, it will also hit at least one replica of every single shard
 in the collection.

 If the data required to satisfy the query is not already sitting in the
 OS disk cache, then the actual disk must be read.  When RAM is extremely
 tight, any disk operation will erase relevant data out of the OS disk
 cache, so the next time it is needed, it must be read off the disk
 again.  Disks are SLOW.  What I am describing is not swap, but the
 performance impact is similar to swapping.

 The actual delete operation (once the IDs are known) doesn't touch any
 segments ... it writes Lucene document identifiers to a .del file, and
 that file is consulted on all queries.  Any deleted documents found in
 the query results are removed.

 Thanks,
 Shawn

solr 5.x on glassfish/tomcat instead of jetty

2015-05-20 Thread Ravi Solr

I have read that solr 5.x has moved away from deployable WAR architecture
to a runnable Java Application architecture. Our infrastructure/standards
folks are adamant about not running SOLR on Jetty (as we are about to
upgrade from 4.7.2 to 5.1), any ideas on how I can make it run on Glassfish
or at least on Tomcat ?? And do I have to watch for any gotchas regarding
the different containers or the upgrade itself ? Would love to hear from
people who have already treaded down that path.


Thanks

Ravi Kiran Bhaskar

Re: When is too many fields in qf is too many?

2015-05-20 Thread Jack Krupansky

The uf parameter is used to specify which fields a user may query against
- the qf parameter specifies the set of fields that an unfielded query
term must be queried against. The user is free to specify fielded query
terms, like field1:term1 OR field2:term2. So, which use case are you
really talking about.

Could you try to point to a real-world example of where your use case might
apply, so we can relate to it?

Generally, I would say that a Solr document/collection should have no more
than low hundreds of fields. It's not that you absolutely can't have more
or absolutely can't have 5,000 or more, but simply that you will be asking
for trouble, for example, with the cost of comprehending and maintaining
and communicating your solution with others, including this mailing list
for support.

What specifically pushed you to have documents with 1500 field?

Also, is this 1500 fields that are always populated, or are there really a
larger number of different record types, each with a relatively small
number of fields populated in a particular document?


-- Jack Krupansky

On Wed, May 20, 2015 at 8:27 AM, Steven White swhite4...@gmail.com wrote:

 Hi everyone,

 My solution requires that users in group-A can only search against a set of
 fields-A and users in group-B can only search against a set of fields-B,
 etc.  There can be several groups, as many as 100 even more.  To meet this
 need, I build my search by passing in the list of fields via qf.  What
 goes into qf can be large: as many as 1500 fields and each field name
 averages 15 characters long, in effect the data passed via qf will be
 over 20K characters.

 Given the above, beside the fact that a search for apple translating to a
 20K characters passing over the network, what else within Solr and Lucene I
 should be worried about if any?  Will I hit some kind of a limit?  Will
 each search now require more CPU cycles?  Memory?  Etc.

 If the network traffic becomes an issue, my alternative solution is to
 create a /select handler for each group and in that handler list the fields
 under qf.

 I have considered creating pseudo-fields for each group and then use
 copyField into that group.  During search, I than can qf against that one
 field.  Unfortunately, this is not ideal for my solution because the fields
 that go into each group dynamically change (at least once a month) and when
 they do change, I have to re-index everything (this I have to avoid) to
 sync that group-field.

 I'm using qf with edismax and my Solr version is 5.1.

 Thanks

 Steve

Re: When is too many fields in qf is too many?

2015-05-20 Thread Doug Turnbull

Steven,

I'd be concerned about your relevance with that many qf fields. Dismax
takes a winner takes all point of view to search. Field scores can vary
by an order of magnitude (or even two) despite the attempts of query
normalization. You can read more here
http://opensourceconnections.com/blog/2013/07/02/getting-dissed-by-dismax-why-your-incorrect-assumptions-about-dismax-are-hurting-search-relevancy/

I'm about to win the blashphemer merit badge, but ad-hoc all-field like
searching over many fields is actually a good use case for Elasticsearch's
cross field queries.
https://www.elastic.co/guide/en/elasticsearch/guide/master/_cross_fields_queries.html
http://opensourceconnections.com/blog/2015/03/19/elasticsearch-cross-field-search-is-a-lie/

It wouldn't be hard (and actually a great feature for the project) to get
the Lucene query associated with cross field search into Solr. You could
easily write a plugin to integrate it into a query parser:
https://github.com/elastic/elasticsearch/blob/master/src/main/java/org/apache/lucene/queries/BlendedTermQuery.java

Hi everyone,

My solution requires that users in group-A can only search against a set of
fields-A and users in group-B can only search against a set of fields-B,
etc. There can be several groups, as many as 100 even more. To meet this
need, I build my search by passing in the list of fields via qf. What
goes into qf can be large: as many as 1500 fields and each field name
averages 15 characters long, in effect the data passed via qf will be
over 20K characters.

Given the above, beside the fact that a search for apple translating to a
20K characters passing over the network, what else within Solr and Lucene I
should be worried about if any? Will I hit some kind of a limit? Will
each search now require more CPU cycles? Memory? Etc.

If the network traffic becomes an issue, my alternative solution is to
create a /select handler for each group and in that handler list the fields
under qf.

I have considered creating pseudo-fields for each group and then use
copyField into that group. During search, I than can qf against that one
field. Unfortunately, this is not ideal for my solution because the fields
that go into each group dynamically change (at least once a month) and when
they do change, I have to re-index everything (this I have to avoid) to
sync that group-field.

I'm using qf with edismax and my Solr version is 5.1.

Thanks

Steve

Re: solr 5.x on glassfish/tomcat instead of jetty

On 5/20/2015 9:07 AM, Ravi Solr wrote:
 I have read that solr 5.x has moved away from deployable WAR architecture
 to a runnable Java Application architecture. Our infrastructure/standards
 folks are adamant about not running SOLR on Jetty (as we are about to
 upgrade from 4.7.2 to 5.1), any ideas on how I can make it run on Glassfish
 or at least on Tomcat ?? And do I have to watch for any gotchas regarding
 the different containers or the upgrade itself ? Would love to hear from
 people who have already treaded down that path.

I really need to finish the wiki page on this topic.

As of right now, there is still a .war file.  Look in the server/webapps
directory for the .war, server/lib/ext for logging jars, and
server/resources for the logging configuration.  Consult your
container's documentation to learn where to place these things.

At some point in the future, such deployments will no longer be
possible, which is why the docs say you can't do it, even though you
can.  The project is preparing users for the eventual reality with a
documentation change.

I'm wondering ... if Jetty is good enough for the Google App Engine, why
isn't it good enough for your infrastructure standards?  It is the only
container that gets testing ... I assure you that there are no tests in
the Solr source code that make sure Glassfish works.

Thanks,
Shawn

Re: Solr Cloud: No live SolrServers available

2015-05-20 Thread Chetan Vora

Erick

Thanks for your response.

Logs don't seem to show any explicit errors (I have log level at INFO).

I am attaching the logs from a 4.7 start and a 5.1 start here. Note that
both logs seem to show the shards as Down initially but for 5.1, the
state change to Active later on.

Also, note that all the config files, libraries, jarfiles etc are the same
for both Solr instances.

Regards

On Tue, May 19, 2015 at 11:57 AM, Erick Erickson erickerick...@gmail.com
wrote:

What you've done _looks_ correct at a glance. Take a look at the Solr
logs. Don't bother trying to index things unless and until your nodes
are active, it won't happen.

My first guess is that you have some error in your schema or
solrconfig.xml files, syntax errors, typos, class names that are
mis-typed, jars that are missing, whatever.

If that's true, the Solr log (or the screen if you're just running
from the command line) will show big ugly stack traces.

If nothing shows up in the logs then I'm puzzled, but what you
describe is consistent with what I've seen in terms of having bad
configs and trying to create a collection.

Best,
Erick

On Tue, May 19, 2015 at 4:33 AM, Chetan Vora chetanv...@gmail.com wrote:
Hi all

We have a cluster of standalone Solr cores (Solr 4.3) for which we had
built some custom plugins. I'm now trying to prototype converting the
cluster to a Solr Cloud cluster. This is how I am trying to deploy the
cores (in 4.7.2).

Start solr with zookeeper embedded.

java -DzkRun -Djetty.port=8985 -jar start.jar
2.

upload a config into Zookeeper (same config as the standalone cores)

zkcli.bat -zkhost localhost:9985 -cmd upconfig -confdir myconfig
-confname myconfig
3.

Create a new collection (mycollection) of 2 shards using the
Collections
API

http://localhost:8985/solr/admin/collections?action=CREATEname=mycollectionnumShards=2replicationFactor=1maxShardsPerNode=2collection.configName=myconfig

So at this point I have two shards under my solr directory with the
appropriate core.properties

But when I go to http://localhost:8985/solr/#/~cloud, I see that the two
shards' status is Down when they are supposed to be active by default.

And when I try to index documents in them using SolrJ (via
CloudSolrServer
API) , I get the error No live SolrServers available to handle this
request. I restarted Solr but same issue.

private CloudSolrServer cloudSolr;
cloudSolr = new CloudSolrServer(zkHOST);
cloudSolr.setZkClientTimeout(zkClientTimeout);
cloudSolr.setDefaultCollection(collectionName);
cloudSolr.connect();
cloudSolr.add(doc)

What am I doing wrong? I did a lot of digging around and saw an old Jira
bug saying that Solr Cloud shards won't be active until there are some
documents in the index. If that is the reason, that's kind of like a
catch-22 isn't it?

So anyways, I also tried adding some test documents manually and
committed
to see if things improved. Now on the shard statistics page, it correctly
gives me the Numdocs count but when I try to query it says no servers
hosting shard. I next tried passing in shards.tolerant=true as a query
parameter and search, but no cigar. It says 0 documents found.

Any help would be appreciated. My main objective is to rebuilt the old
standalone cores using SolrCloud and test to see if our custom
requesthandlers still work as expected. And at this point, I can't index
documents inside of the 4.7 Solr Cloud collection I have created. I am
trying to use a 4.x SolrCloud release as it seems the internal APIs have
changed quite a bit for the 5.x releases and our custom requesthandlers
don't work anymore as expected.

Thanks and Regards

Re: When is too many fields in qf is too many?