Re: Solr Heap, MMaps and Garbage Collection

2014-03-02 Thread Walter Underwood
New gen should be big enough to handle all allocations that have a lifetime of 
a single request, considering that you'll have multiple concurrent requests. If 
new gen is routinely overflowed, you can put short-lived objects in the old gen.

Yes, you need to go to CMS.

I have usually seen the hit rates on query results and doc caches to be fairly 
similar, with doc cache somewhat higher.

Cache hit rates depend on the number of queries between updates. If you update 
once per day and get a million queries or so, your hit rates can get pretty 
good.

70-80% seems typical for doc cache on an infrequently updated index. We stay 
around 75% on our busiest 4m doc index. 

The query result cache is the most important, because it saves the most work. 
Ours stays around 20%, but I should spend some time improving that.

The perm gen size is very big. I think we run with 128 Meg.

wunder

On Mar 2, 2014, at 10:54 PM, KNitin  wrote:

> Thanks, Walter
> 
> Hit rate on the document caches is close to 70-80% and the filter caches
> are a 100% hit (since most of our queries filter on the same fields but
> have a different q parameter). Query result cache is not of great
> importance to me since the hit rate their is almost negligible.
> 
> Does it mean i need to increase the size of my filter and document cache
> for large indices?
> 
> The split up of my 25Gb heap usage is split as follows
> 
> 1. 19 GB - Old Gen (100% pool utilization)
> 2.  3 Gb - New Gen (50% pool utilization)
> 3. 2.8 Gb - Perm Gen (I am guessing this is because of interned strings)
> 4. Survivor space is in the order of 300-400 MB and is almost always 100%
> full.(Is this a major issue?)
> 
> We are also currently using Parallel GC collector but planning to move to
> CMS for lesser stop-the-world gc times. If i increase the filter cache and
> document cache entry sizes, they would also go to the Old gen right?
> 
> A very naive question: How does increasing young gen going to help if we
> know that solr is already pushing major caches and other objects to old gen
> because of their nature? My young gen pool utilization is still well under
> 50%
> 
> 
> Thanks
> Nitin
> 
> 
> On Sun, Mar 2, 2014 at 9:31 PM, Walter Underwood wrote:
> 
>> An LRU cache will always fill up the old generation. Old objects are
>> ejected, and those are usually in the old generation.
>> 
>> Increasing the heap size will not eliminate this. It will make major, stop
>> the world collections longer.
>> 
>> Increase the new generation size until the rate of old gen increase slows
>> down. Then choose a total heap size to control the frequency (and duration)
>> of major collections.
>> 
>> We run with the new generation at about 25% of the heap, so 8GB total and
>> a 2GB newgen.
>> 
>> A 512 entry cache is very small for query results or docs. We run with 10K
>> or more entries for those. The filter cache size depends on your usage. We
>> have only a handful of different filter queries, so a tiny cache is fine.
>> 
>> What is your hit rate on the caches?
>> 
>> wunder
>> 
>> On Mar 2, 2014, at 7:42 PM, KNitin  wrote:
>> 
>>> Hi
>>> 
>>> I have very large index for a few collections and when they are being
>>> queried, i see the Old gen space close to 100% Usage all the time. The
>>> system becomes extremely slow due to GC activity right after that and it
>>> gets into this cycle very often
>>> 
>>> I have given solr close to 30G of heap in a 65 GB ram machine and rest is
>>> given to RAm. I have a lot of hits in filter,query result and document
>>> caches and the size of all the caches is around 512 entries per
>>> collection.Are all the caches used by solr on or off heap ?
>>> 
>>> 
>>> Given this scenario where GC is the primary bottleneck what is a good
>>> recommended memory settings for solr? Should i increase the heap memory
>>> (that will only postpone the problem before the heap becomes full again
>>> after a while) ? Will memory maps help at all in this scenario?
>>> 
>>> 
>>> Kindly advise on the best practices
>>> Thanks
>>> Nitin
>> 
>> 
>> 

--
Walter Underwood
wun...@wunderwood.org





Re: Solr Heap, MMaps and Garbage Collection

2014-03-02 Thread Bernd Fehling
Actually, I haven't ever seen a PermGen with 2.8 GB.
So you must have a very special use case with SOLR.

For my little index with 60 million docs and 170GB index size I gave
PermGen 82 MB and it is only using 50.6 MB for a single VM.

Permanent Generation (PermGen) is completely separate from the heap.

Permanent Generation (non-heap):
The pool containing all the reflective data of the virtual machine itself,
such as class and method objects. With Java VMs that use class data sharing,
this generation is divided into read-only and read-write areas.

Regards
Bernd


Am 03.03.2014 07:54, schrieb KNitin:
> Thanks, Walter
> 
> Hit rate on the document caches is close to 70-80% and the filter caches
> are a 100% hit (since most of our queries filter on the same fields but
> have a different q parameter). Query result cache is not of great
> importance to me since the hit rate their is almost negligible.
> 
> Does it mean i need to increase the size of my filter and document cache
> for large indices?
> 
> The split up of my 25Gb heap usage is split as follows
> 
> 1. 19 GB - Old Gen (100% pool utilization)
> 2.  3 Gb - New Gen (50% pool utilization)
> 3. 2.8 Gb - Perm Gen (I am guessing this is because of interned strings)
> 4. Survivor space is in the order of 300-400 MB and is almost always 100%
> full.(Is this a major issue?)
> 
> We are also currently using Parallel GC collector but planning to move to
> CMS for lesser stop-the-world gc times. If i increase the filter cache and
> document cache entry sizes, they would also go to the Old gen right?
> 
> A very naive question: How does increasing young gen going to help if we
> know that solr is already pushing major caches and other objects to old gen
> because of their nature? My young gen pool utilization is still well under
> 50%
> 
> 
> Thanks
> Nitin
> 
> 
> On Sun, Mar 2, 2014 at 9:31 PM, Walter Underwood wrote:
> 
>> An LRU cache will always fill up the old generation. Old objects are
>> ejected, and those are usually in the old generation.
>>
>> Increasing the heap size will not eliminate this. It will make major, stop
>> the world collections longer.
>>
>> Increase the new generation size until the rate of old gen increase slows
>> down. Then choose a total heap size to control the frequency (and duration)
>> of major collections.
>>
>> We run with the new generation at about 25% of the heap, so 8GB total and
>> a 2GB newgen.
>>
>> A 512 entry cache is very small for query results or docs. We run with 10K
>> or more entries for those. The filter cache size depends on your usage. We
>> have only a handful of different filter queries, so a tiny cache is fine.
>>
>> What is your hit rate on the caches?
>>
>> wunder
>>
>> On Mar 2, 2014, at 7:42 PM, KNitin  wrote:
>>
>>> Hi
>>>
>>> I have very large index for a few collections and when they are being
>>> queried, i see the Old gen space close to 100% Usage all the time. The
>>> system becomes extremely slow due to GC activity right after that and it
>>> gets into this cycle very often
>>>
>>> I have given solr close to 30G of heap in a 65 GB ram machine and rest is
>>> given to RAm. I have a lot of hits in filter,query result and document
>>> caches and the size of all the caches is around 512 entries per
>>> collection.Are all the caches used by solr on or off heap ?
>>>
>>>
>>> Given this scenario where GC is the primary bottleneck what is a good
>>> recommended memory settings for solr? Should i increase the heap memory
>>> (that will only postpone the problem before the heap becomes full again
>>> after a while) ? Will memory maps help at all in this scenario?
>>>
>>>
>>> Kindly advise on the best practices
>>> Thanks
>>> Nitin
>>
>>
>>
> 

-- 
*
Bernd FehlingBielefeld University Library
Dipl.-Inform. (FH)LibTec - Library Technology
Universitätsstr. 25  and Knowledge Management
33615 Bielefeld
Tel. +49 521 106-4060   bernd.fehling(at)uni-bielefeld.de

BASE - Bielefeld Academic Search Engine - www.base-search.net
*


Re: Solr Heap, MMaps and Garbage Collection

2014-03-02 Thread KNitin
Thanks, Walter

Hit rate on the document caches is close to 70-80% and the filter caches
are a 100% hit (since most of our queries filter on the same fields but
have a different q parameter). Query result cache is not of great
importance to me since the hit rate their is almost negligible.

Does it mean i need to increase the size of my filter and document cache
for large indices?

The split up of my 25Gb heap usage is split as follows

1. 19 GB - Old Gen (100% pool utilization)
2.  3 Gb - New Gen (50% pool utilization)
3. 2.8 Gb - Perm Gen (I am guessing this is because of interned strings)
4. Survivor space is in the order of 300-400 MB and is almost always 100%
full.(Is this a major issue?)

We are also currently using Parallel GC collector but planning to move to
CMS for lesser stop-the-world gc times. If i increase the filter cache and
document cache entry sizes, they would also go to the Old gen right?

A very naive question: How does increasing young gen going to help if we
know that solr is already pushing major caches and other objects to old gen
because of their nature? My young gen pool utilization is still well under
50%


Thanks
Nitin


On Sun, Mar 2, 2014 at 9:31 PM, Walter Underwood wrote:

> An LRU cache will always fill up the old generation. Old objects are
> ejected, and those are usually in the old generation.
>
> Increasing the heap size will not eliminate this. It will make major, stop
> the world collections longer.
>
> Increase the new generation size until the rate of old gen increase slows
> down. Then choose a total heap size to control the frequency (and duration)
> of major collections.
>
> We run with the new generation at about 25% of the heap, so 8GB total and
> a 2GB newgen.
>
> A 512 entry cache is very small for query results or docs. We run with 10K
> or more entries for those. The filter cache size depends on your usage. We
> have only a handful of different filter queries, so a tiny cache is fine.
>
> What is your hit rate on the caches?
>
> wunder
>
> On Mar 2, 2014, at 7:42 PM, KNitin  wrote:
>
> > Hi
> >
> > I have very large index for a few collections and when they are being
> > queried, i see the Old gen space close to 100% Usage all the time. The
> > system becomes extremely slow due to GC activity right after that and it
> > gets into this cycle very often
> >
> > I have given solr close to 30G of heap in a 65 GB ram machine and rest is
> > given to RAm. I have a lot of hits in filter,query result and document
> > caches and the size of all the caches is around 512 entries per
> > collection.Are all the caches used by solr on or off heap ?
> >
> >
> > Given this scenario where GC is the primary bottleneck what is a good
> > recommended memory settings for solr? Should i increase the heap memory
> > (that will only postpone the problem before the heap becomes full again
> > after a while) ? Will memory maps help at all in this scenario?
> >
> >
> > Kindly advise on the best practices
> > Thanks
> > Nitin
>
>
>


Re: stopwords issue with edismax

2014-03-02 Thread Jack Krupansky
As I suggested, you have a couple of field that do not ignore stop words, so 
the stop word must be present in at least one of those fields:


(number:of^3.0 | all_code:of^2.0)

The solution would be to remove the "number" and "all_code" fields from qf.

-- Jack Krupansky

-Original Message- 
From: sureshrk19

Sent: Monday, March 3, 2014 1:05 AM
To: solr-user@lucene.apache.org
Subject: Re: stopwords issue with edismax

Jack,

Thanks for the reply.

Yes. your observation is right. I see, stopwords are not being ignore at
query time.
Say, I'm searching for 'bank of america'. I'm expecting 'of' should not be
the part of search.
But, here I see 'of' is being sent. Same is the query syntax for 'OR' and
'AND' operators and 'OR' is returning results as expected. But in my case, I
want to use 'AND'.

Here is debug query information...

"parsedquery":"(+((DisjunctionMaxQuery((ent_name:bank^7.0 | all_text:bank |
number:bank^3.0 | party:bank^3.0 | all_code:bank^2.0 | name:bank^5.0))
DisjunctionMaxQuery((number:of^3.0 | all_code:of^2.0))
DisjunctionMaxQuery((ent_name:america^7.0 | all_text:america |
number:america^3.0 | party:america^3.0 | all_code:america^2.0 |
name:america^5.0)))~3))/no_coord",
   "parsedquery_toString":"+(((ent_name:bank^7.0 | all_text:bank |
number:bank^3.0 | party:bank^3.0 | all_code:bank^2.0 | name:bank^5.0)
(number:of^3.0 | all_code:of^2.0) (ent_name:america^7.0 | all_text:america |
number:america^3.0 | party:america^3.0 | all_code:america^2.0 |
name:america^5.0))~3)"

Is there any reason why 'stopwords' are not being ignored. I checked
schema.xml for filter and the same is present:





--
View this message in context: 
http://lucene.472066.n3.nabble.com/stopwords-issue-with-edismax-tp4120339p4120815.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: stopwords issue with edismax

2014-03-02 Thread sureshrk19
Jack,

Thanks for the reply.

Yes. your observation is right. I see, stopwords are not being ignore at
query time. 
Say, I'm searching for 'bank of america'. I'm expecting 'of' should not be
the part of search.
But, here I see 'of' is being sent. Same is the query syntax for 'OR' and
'AND' operators and 'OR' is returning results as expected. But in my case, I
want to use 'AND'.

Here is debug query information...

"parsedquery":"(+((DisjunctionMaxQuery((ent_name:bank^7.0 | all_text:bank |
number:bank^3.0 | party:bank^3.0 | all_code:bank^2.0 | name:bank^5.0))
DisjunctionMaxQuery((number:of^3.0 | all_code:of^2.0))
DisjunctionMaxQuery((ent_name:america^7.0 | all_text:america |
number:america^3.0 | party:america^3.0 | all_code:america^2.0 |
name:america^5.0)))~3))/no_coord",
"parsedquery_toString":"+(((ent_name:bank^7.0 | all_text:bank |
number:bank^3.0 | party:bank^3.0 | all_code:bank^2.0 | name:bank^5.0)
(number:of^3.0 | all_code:of^2.0) (ent_name:america^7.0 | all_text:america |
number:america^3.0 | party:america^3.0 | all_code:america^2.0 |
name:america^5.0))~3)"

Is there any reason why 'stopwords' are not being ignored. I checked
schema.xml for filter and the same is present:





--
View this message in context: 
http://lucene.472066.n3.nabble.com/stopwords-issue-with-edismax-tp4120339p4120815.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr Heap, MMaps and Garbage Collection

2014-03-02 Thread Walter Underwood
An LRU cache will always fill up the old generation. Old objects are ejected, 
and those are usually in the old generation.

Increasing the heap size will not eliminate this. It will make major, stop the 
world collections longer.

Increase the new generation size until the rate of old gen increase slows down. 
Then choose a total heap size to control the frequency (and duration) of major 
collections.

We run with the new generation at about 25% of the heap, so 8GB total and a 2GB 
newgen.

A 512 entry cache is very small for query results or docs. We run with 10K or 
more entries for those. The filter cache size depends on your usage. We have 
only a handful of different filter queries, so a tiny cache is fine.

What is your hit rate on the caches?

wunder

On Mar 2, 2014, at 7:42 PM, KNitin  wrote:

> Hi
> 
> I have very large index for a few collections and when they are being
> queried, i see the Old gen space close to 100% Usage all the time. The
> system becomes extremely slow due to GC activity right after that and it
> gets into this cycle very often
> 
> I have given solr close to 30G of heap in a 65 GB ram machine and rest is
> given to RAm. I have a lot of hits in filter,query result and document
> caches and the size of all the caches is around 512 entries per
> collection.Are all the caches used by solr on or off heap ?
> 
> 
> Given this scenario where GC is the primary bottleneck what is a good
> recommended memory settings for solr? Should i increase the heap memory
> (that will only postpone the problem before the heap becomes full again
> after a while) ? Will memory maps help at all in this scenario?
> 
> 
> Kindly advise on the best practices
> Thanks
> Nitin




Solr Heap, MMaps and Garbage Collection

2014-03-02 Thread KNitin
Hi

I have very large index for a few collections and when they are being
queried, i see the Old gen space close to 100% Usage all the time. The
system becomes extremely slow due to GC activity right after that and it
gets into this cycle very often

I have given solr close to 30G of heap in a 65 GB ram machine and rest is
given to RAm. I have a lot of hits in filter,query result and document
caches and the size of all the caches is around 512 entries per
collection.Are all the caches used by solr on or off heap ?


Given this scenario where GC is the primary bottleneck what is a good
recommended memory settings for solr? Should i increase the heap memory
(that will only postpone the problem before the heap becomes full again
after a while) ? Will memory maps help at all in this scenario?


Kindly advise on the best practices
Thanks
Nitin


Re: SolrCloud: heartbeat succeeding while node has failing SSD?

2014-03-02 Thread Mark Miller
The heartbeat that keeps the node alive is the connection it maintains with 
ZooKeeper.

We don’t currently have anything built in that will actively make sure each 
node can serve queries and remove it from clusterstatem.json if it cannot. If a 
replica is maintaining it’s connection with ZooKeeper and in most cases, if it 
is accepting updates, it will appear up. Load balancing should handle the 
failures, but I guess it depends on how sticky the request fails are.

In the past, I’ve seen this handled on a different search engine by having a 
variety of external agent scripts that would occasionally attempt to do a 
query, and if things did not go right, it killed the process to cause it to try 
and startup again (supervised process).

I’m not sure what the right long term feature for Solr is here, but feel free 
to start a JIRA issue around it.

One simple improvement might even be a background thread that periodically 
checks some local readings and depending on the results, pulls itself out of 
the mix as best it can (remove itself from clusterstate.json or simply closes 
it’s zk conneciton).

- Mark

http://about.me/markrmiller

On Mar 2, 2014, at 3:42 PM, Gregg Donovan  wrote:

> We had a brief SolrCloud outage this weekend when a node's SSD began to
> fail but the node still appeared to be up to the rest of the SolrCloud
> cluster (i.e. still green in clusterstate.json). Distributed queries that
> reached this node would fail but whatever heartbeat keeps the node in the
> clustrstate.json must have continued to succeed.
> 
> We eventually had to power the node down to get it to be removed from
> clusterstate.json.
> 
> This is our first foray into SolrCloud, so I'm still somewhat fuzzy on what
> the default heartbeat mechanism is and how we may augment it to be sure
> that the disk is checked as part of the heartbeat and/or we verify that it
> can serve queries.
> 
> Any pointers would be appreciated.
> 
> Thanks!
> 
> --Gregg



SEVERE: org.apache.solr.common.SolrException: no field name specified in query and no default specified via 'df' param

2014-03-02 Thread eShard
Hi,
I'm using Solr 4.0 Final (yes, I know I need to upgrade)

I'm getting this error:
SEVERE: org.apache.solr.common.SolrException: no field name specified in
query and no default specified via 'df' param

And I applied this fix: https://issues.apache.org/jira/browse/SOLR-3646 
And unfortunately, the error persists.
I'm using a multi shard environment and the error is only happening on one
of the shards.
I've already updated about half of the other shards with the missing default
text in /browse but the error persists on that one shard.
Can anyone tell me how to make the error go away?

Thanks,



--
View this message in context: 
http://lucene.472066.n3.nabble.com/SEVERE-org-apache-solr-common-SolrException-no-field-name-specified-in-query-and-no-default-specifiem-tp4120789.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr is NoSQL database or not?

2014-03-02 Thread Michael Sokolov

On 3/1/2014 6:53 PM, Jack Krupansky wrote:

NoSQL? To me it's just a marketing term, like Big Data.

Data store? That does imply support for persistence, as opposed to 
mere caching, but mere persistence doesn't assure that the store is 
suitable for use as a System of Record which is a requirement in my 
view for a true database. So, I wouldn't assert that a data store is a 
database.

I agree, Jack.

Our experience has been that we don't actually need everything a true 
ACID "database" has to offer.  In particular we don't care all that much 
about the I (isolation) part since we don't use Solr to store 
transactional data, just documents, which are loaded by a small number 
of writers that we coordinate. If I had to pick one thing though that 
would make you have to say well um not really a database, it would be 
the transactional model: anyone commits, everyone sees the updates.


-Mike


SolrCloud: heartbeat succeeding while node has failing SSD?

2014-03-02 Thread Gregg Donovan
We had a brief SolrCloud outage this weekend when a node's SSD began to
fail but the node still appeared to be up to the rest of the SolrCloud
cluster (i.e. still green in clusterstate.json). Distributed queries that
reached this node would fail but whatever heartbeat keeps the node in the
clustrstate.json must have continued to succeed.

We eventually had to power the node down to get it to be removed from
clusterstate.json.

This is our first foray into SolrCloud, so I'm still somewhat fuzzy on what
the default heartbeat mechanism is and how we may augment it to be sure
that the disk is checked as part of the heartbeat and/or we verify that it
can serve queries.

Any pointers would be appreciated.

Thanks!

--Gregg


Re: Cluster state ranges are all null after reboot

2014-03-02 Thread Greg Pendlebury
Thanks again for the info. Hopefully we find some more clues if it
continues to occur. The ops team are looking at alternative deployment
methods as well, so we might end up avoiding the issue altogether.

Ta,
Greg


On 28 February 2014 02:42, Shalin Shekhar Mangar wrote:

> I think it is just a side-effect of the current implementation that
> the ranges are assigned linearly. You can also verify this by choosing
> a document from each shard and running it's uniqueKey against the
> CompositeIdRouter's sliceHash method and verifying that it is included
> in the range.
>
> I couldn't reproduce this but I didn't try too hard either. If you are
> able to isolate a reproducible example then please do report back.
> I'll spend some time to review the related code again to see if I can
> spot the problem.
>
> On Thu, Feb 27, 2014 at 2:19 AM, Greg Pendlebury
>  wrote:
> > Thanks Shalin, that code might be helpful... do you know if there is a
> > reliable way to line up the ranges with the shard numbers? When the
> problem
> > occurred we had 80 million documents already in the index, and could not
> > issue even a basic 'deleteById' call. I'm tempted to assume they are just
> > assigned linearly since our Test and Prod clusters both look to work that
> > way now, but I can't be sure whether that is by design or just
> happenstance
> > of boot order.
> >
> > And no, unfortunately we have not been able to reproduce this issue
> > consistently despite trying a number of different things such as
> graceless
> > stop/start and screwing with the underlying WAR file (which is what we
> > thought puppet might be doing). The problem has occurred twice since, but
> > always in our Test environment. The fact that Test has only a single
> > replica per shard is the most likely culprit for me, but as mentioned,
> even
> > gracelessly killing the last replica in the cluster seems to leave the
> > range set correctly in clusterstate when we test it in isolation.
> >
> > In production (45 JVMs, 15 shards with 3 replicas each) we've never seen
> > the problem, despite a similar number of rollouts for version changes
> etc.
> >
> > Ta,
> > Greg
> >
> >
> >
> >
> > On 26 February 2014 23:46, Shalin Shekhar Mangar  >wrote:
> >
> >> If you have 15 shards and assuming that you've never used shard
> >> splitting, you can calculate the shard ranges by using new
> >> CompositeIdRouter().partitionRange(15, new
> >> CompositeIdRouter().fullRange())
> >>
> >> This gives me:
> >> [8000-9110, 9111-a221, a222-b332,
> >> b333-c443, c444-d554, d555-e665,
> >> e666-f776, f777-887, 888-1998,
> >> 1999-2aa9, 2aaa-3bba, 3bbb-4ccb,
> >> 4ccc-5ddc, 5ddd-6eed, 6eee-7fff]
> >>
> >> Have you done any more investigation into why this happened? Anything
> >> strange in the logs? Are you able to reproduce this in a test
> >> environment?
> >>
> >> On Wed, Feb 19, 2014 at 5:16 AM, Greg Pendlebury
> >>  wrote:
> >> > We've got a 15 shard cluster spread across 3 hosts. This morning our
> >> puppet
> >> > software rebooted them all and afterwards the 'range' for each shard
> has
> >> > become null in zookeeper. Is there any way to restore this value
> short of
> >> > rebuilding a fresh index?
> >> >
> >> > I've read various questions from people with a similar problem,
> although
> >> in
> >> > those cases it is usually a single shard that has become null allowing
> >> them
> >> > to infer what the value should be and manually fix it in ZK. In this
> >> case I
> >> > have no idea what the ranges should be. This is our test cluster, and
> >> > checking production I can see that the ranges don't appear to be
> >> > predictable based on the shard number.
> >> >
> >> > I'm also not certain why it even occurred. Our test cluster only has a
> >> > single replica per shard, so when a JVM is rebooted the cluster is
> >> > unavailable... would that cause this? Production has 3 replicas so we
> can
> >> > do rolling reboots.
> >>
> >>
> >>
> >> --
> >> Regards,
> >> Shalin Shekhar Mangar.
> >>
>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>


Re: How to best handle search like Dave & David

2014-03-02 Thread Arun Rangarajan
If you are trying to serve results as users are typing, then you can use
EdgeNGramFilter (see
https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.EdgeNGramFilterFactory
).

Let's say you configure your field like this, as shown in the Solr wiki:


   
  
  
   
   
  
   


Then this is what happens at index time for your tokens:

David ---> | LowerCaseTokenizerFactory | ---> david ---> |
EdgeNGramFilterFactory
| ---> da dav davi david
Dave ---> | LowerCaseTokenizerFactory | ---> dave ---> | EdgeNGramFilterFactory
| ---> da dav dave

And at query time, when your user enters 'Dav' it will match both those
tokens. Note that the moment your user starts typing more, say 'davi' it
won't match 'Dave' since you are doing edge N gramming only at index time
and not at query time. You can also do edge N gramming at query time if you
want 'Dave' to match 'David', probably keeping a larger minGramSize (in
this case 3) to avoid noise (like say 'Dave' matching 'Dana' though with a
lower score), but it will be expensive to do n-gramming at query time.




On Fri, Feb 28, 2014 at 3:22 PM, Susheel Kumar <
susheel.ku...@thedigitalgroup.net> wrote:

> Hi,
>
> We have name searches on Solr for millions of documents. User may search
> like "Morrison Dave" or other may search like "Morrison David".  What's the
> best way to handle that both brings similar results. Adding Synonym is the
> option we are using right.
>
> But we may need to add around such 50,000+ synonyms for different names
> for each specific name there can be couple of synonyms like for Richard, it
> can be Rich, Rick, Richie etc.
>
> Any experience adding so many synonyms or any other thoughts? Stemming may
> help in few situations but not like Dave and David.
>
> Thanks,
> Susheel
>


Re: Date query not returning results only some time

2014-03-02 Thread Arun Rangarajan
Erick,
Thanks a lot for the detailed explanation. That clarified things for me
better.


On Sun, Mar 2, 2014 at 10:04 AM, Erick Erickson wrote:

> Well, in M/S setups the master shouldn't be searching at all,
> but that's a nit.
>
> That aside, whether the master has opened a new or
> searcher or not is irrelevant to what the slave replicates.
> What _is_ relevant is whether any of the files on disk that
> comprise the index (i.e. the segment files) have been
> changed. Really, if any of them have been closed/merged
> whatever since the last sync. Imagine it like this (this isn't
> quite what happens, but it's a useful model). The slave
> says "here's a list of my segments, is it the same as the
> list of closed segments on the master?" If the answer
> is no, a replication is performed. Actually, this is done
> much more efficiently, but that's the idea.
>
> You seem to be really asking about the whole issue of whether
> searches on the various nodes (master + slaves) is
> consistent. This is one of the problems with M/S setups, they
> can be different by whatever has happened in the polling interval.
>
> The state of the master's searchers just doesn't enter the picture.
>
> Glad the problem is solved no matter what.
>
> Erick
>
> On Sat, Mar 1, 2014 at 10:26 PM, Arun Rangarajan
>  wrote:
> >> The slave is polling the master after the interval specified in
> > solrconfig.xml. The slave essentially asks "has anything changed?" If
> so, the
> > changes are brought down to the slave.
> > Yes, I understand this, but if master does not open a new searcher after
> > auto commits (which would indicate that the new index is not quite ready
> > yet) and if master is still using the old index to serve search
> requests, I
> > would expect the slave to do the same as well. Or the slave should at
> least
> > not replicate or not open a new searcher, until the master opened a new
> > searcher. But that is just the way I see it and it may be wrong.
> >
> >> What's your polling interval on the slave anyway? Sounds like it's quite
> > frequent if you notice this immediately after the DIH starts.
> > No, polling interval is set to 1 hour, but the full import was set to run
> > at 1 AM. I believe a delete followed by few docs got replicated after the
> > first few auto commits when the slave probably polled around 1:10 AM and
> > slave index had few docs for an hour before the next polling happened,
> > which is why the date query was returning empty results for exactly that
> > one hour. (The full index takes about 1.5 hours to finish.)
> >
> > Anyway the problem is now solved by specifying "clean=false" in the DIH
> > full import command.
> >
> >
> > On Sat, Mar 1, 2014 at 9:12 AM, Erick Erickson  >wrote:
> >
> >> bq: the slave anyway replicates the index after auto commits! (Is this
> >> desired behavior?)
> >>
> >> Absolutely it's desired behavior. The slave is polling the master
> >> after the interval
> >> specified in solrconfig.xml. The slave essentially asks "has anything
> >> changed?" If so,
> >> the changes are brought down to the slave. And by definition, commits
> >> change the index,
> >> especially if all docs have been deleted
> >>
> >> What's your polling interval on the slave anyway? Sounds like it's
> >> quite frequent if you
> >> notice this immediately after the DIH starts.
> >>
> >> Best,
> >> Erick
> >>
> >> On Fri, Feb 28, 2014 at 9:04 PM, Arun Rangarajan
> >>  wrote:
> >> > I believe I figured out what the issue is. Even though we do not open
> a
> >> new
> >> > searcher on master during full import, the slave anyway replicates the
> >> > index after auto commits! (Is this desired behavior?) Since
> "clean=true"
> >> > this meant all the docs were deleted on slave and a partial index got
> >> > replicated! The reason only the date query did not return any results
> is
> >> > because recently created docs have higher doc IDs and we index by
> >> ascending
> >> > order of IDs!
> >> >
> >> > I believe I have two options:
> >> > - as Chris suggested I have to use "clean=false" so the existing docs
> are
> >> > not deleted first on the slave. Since we have primary keys, newly
> added
> >> > docs will overwrite old docs as they get added.
> >> > - disable replication after commits. Replicate only after optimize.
> >> >
> >> > Thx all for your help.
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > On Fri, Feb 28, 2014 at 8:06 PM, Arun Rangarajan
> >> > wrote:
> >> >
> >> >> Thx, Erick and Chris.
> >> >>
> >> >> This is indeed very strange. Other queries which do not restrict by
> the
> >> >> date field are returning results, so the index is definitely not
> empty.
> >> Has
> >> >> it got something to do with the date query part, with NOW/DAY or
> >> something
> >> >> in here?
> >> >> first_publish_date:[NOW/DAY-33DAYS TO NOW/DAY-3DAYS]
> >> >>
> >> >> For now, I have set up a script to just log the number of docs on the
> >> >> slave every minute. Will monitor and report the findings.
> >> >>
> >> >>
> >> >

Re: Elevation and core create

2014-03-02 Thread Erick Erickson
Hmmm, you _ought_ to be able to specify a relative path
in solrconfig_slave.xml:solrconfig.xml,x.xml,y.xml

But there's certainly the chance that this is hard-coded in
the query elevation component so I can't say that this'll work
with assurance.

Best,
Erick

On Sun, Mar 2, 2014 at 6:14 AM, David Stuart  wrote:
> Hi sorry for the cross post but I got no response in the dev group so assumed 
> I posted in the wrong place.
>
>
>
> I am using Solr 3.6 and am trying to automate the deployment of cores with a 
> custom elevate file. It is proving to be difficult as most of the file 
> (schema, stop words etc) support absolute path elevate seems to need to be in 
> either a conf directory as a sibling to data or in the data directory itself. 
> I am able to achieve my goal by having a secondary process that places the 
> file but thought I would as the group just in case I have missed the obvious. 
> Should I move to Solr 4 is it fixed here? I could also go down the root of 
> extending the SolrCore create function to accept additional params and move 
> the file into the defined data directory.
>
> Ideas?
>
> Thanks for your help
> David Stuart
> M  +44(0) 778 854 2157
> T   +44(0) 845 519 5465
> www.axistwelve.com
> Axis12 Ltd | The Ivories | 6/18 Northampton Street, London | N1 2HY | UK
>
> AXIS12 - Enterprise Web Solutions
>
> Reg Company No. 7215135
> VAT No. 997 4801 60
>
> This e-mail is strictly confidential and intended solely for the ordinary 
> user of the e-mail account to which it is addressed. If you have received 
> this e-mail in error please inform Axis12 immediately by return e-mail or 
> telephone. We advise that in keeping with good computing practice the 
> recipient of this e-mail should ensure that it is virus free. We do not 
> accept any responsibility for any loss or damage that may arise from the use 
> of this email or its contents.
>
>
>


Re: Date query not returning results only some time

2014-03-02 Thread Erick Erickson
Well, in M/S setups the master shouldn't be searching at all,
but that's a nit.

That aside, whether the master has opened a new or
searcher or not is irrelevant to what the slave replicates.
What _is_ relevant is whether any of the files on disk that
comprise the index (i.e. the segment files) have been
changed. Really, if any of them have been closed/merged
whatever since the last sync. Imagine it like this (this isn't
quite what happens, but it's a useful model). The slave
says "here's a list of my segments, is it the same as the
list of closed segments on the master?" If the answer
is no, a replication is performed. Actually, this is done
much more efficiently, but that's the idea.

You seem to be really asking about the whole issue of whether
searches on the various nodes (master + slaves) is
consistent. This is one of the problems with M/S setups, they
can be different by whatever has happened in the polling interval.

The state of the master's searchers just doesn't enter the picture.

Glad the problem is solved no matter what.

Erick

On Sat, Mar 1, 2014 at 10:26 PM, Arun Rangarajan
 wrote:
>> The slave is polling the master after the interval specified in
> solrconfig.xml. The slave essentially asks "has anything changed?" If so, the
> changes are brought down to the slave.
> Yes, I understand this, but if master does not open a new searcher after
> auto commits (which would indicate that the new index is not quite ready
> yet) and if master is still using the old index to serve search requests, I
> would expect the slave to do the same as well. Or the slave should at least
> not replicate or not open a new searcher, until the master opened a new
> searcher. But that is just the way I see it and it may be wrong.
>
>> What's your polling interval on the slave anyway? Sounds like it's quite
> frequent if you notice this immediately after the DIH starts.
> No, polling interval is set to 1 hour, but the full import was set to run
> at 1 AM. I believe a delete followed by few docs got replicated after the
> first few auto commits when the slave probably polled around 1:10 AM and
> slave index had few docs for an hour before the next polling happened,
> which is why the date query was returning empty results for exactly that
> one hour. (The full index takes about 1.5 hours to finish.)
>
> Anyway the problem is now solved by specifying "clean=false" in the DIH
> full import command.
>
>
> On Sat, Mar 1, 2014 at 9:12 AM, Erick Erickson wrote:
>
>> bq: the slave anyway replicates the index after auto commits! (Is this
>> desired behavior?)
>>
>> Absolutely it's desired behavior. The slave is polling the master
>> after the interval
>> specified in solrconfig.xml. The slave essentially asks "has anything
>> changed?" If so,
>> the changes are brought down to the slave. And by definition, commits
>> change the index,
>> especially if all docs have been deleted
>>
>> What's your polling interval on the slave anyway? Sounds like it's
>> quite frequent if you
>> notice this immediately after the DIH starts.
>>
>> Best,
>> Erick
>>
>> On Fri, Feb 28, 2014 at 9:04 PM, Arun Rangarajan
>>  wrote:
>> > I believe I figured out what the issue is. Even though we do not open a
>> new
>> > searcher on master during full import, the slave anyway replicates the
>> > index after auto commits! (Is this desired behavior?) Since "clean=true"
>> > this meant all the docs were deleted on slave and a partial index got
>> > replicated! The reason only the date query did not return any results is
>> > because recently created docs have higher doc IDs and we index by
>> ascending
>> > order of IDs!
>> >
>> > I believe I have two options:
>> > - as Chris suggested I have to use "clean=false" so the existing docs are
>> > not deleted first on the slave. Since we have primary keys, newly added
>> > docs will overwrite old docs as they get added.
>> > - disable replication after commits. Replicate only after optimize.
>> >
>> > Thx all for your help.
>> >
>> >
>> >
>> >
>> >
>> > On Fri, Feb 28, 2014 at 8:06 PM, Arun Rangarajan
>> > wrote:
>> >
>> >> Thx, Erick and Chris.
>> >>
>> >> This is indeed very strange. Other queries which do not restrict by the
>> >> date field are returning results, so the index is definitely not empty.
>> Has
>> >> it got something to do with the date query part, with NOW/DAY or
>> something
>> >> in here?
>> >> first_publish_date:[NOW/DAY-33DAYS TO NOW/DAY-3DAYS]
>> >>
>> >> For now, I have set up a script to just log the number of docs on the
>> >> slave every minute. Will monitor and report the findings.
>> >>
>> >>
>> >> On Fri, Feb 28, 2014 at 6:49 PM, Chris Hostetter <
>> hossman_luc...@fucit.org
>> >> > wrote:
>> >>
>> >>>
>> >>> : This is odd. The full import, I think, deletes the
>> >>> : docs in the index when it starts.
>> >>>
>> >>> Yeah, if you are doing a full-import everyday, and you don't want it to
>> >>> delete all docs when it starts, you need to specify "clearn=false"
>> >>>
>> >>>
>

Re: SolrCloud plugin

2014-03-02 Thread Shalin Shekhar Mangar
Perhaps you just need StatsComponent?

https://cwiki.apache.org/confluence/display/solr/The+Stats+Component

On Sun, Mar 2, 2014 at 6:32 AM, Soumitra Kumar  wrote:
> In general, yes.
>
> I don't how SolrCloud serves a distributed query? What all it does on the
> shards, and what on the server serving the query?
> On Mar 1, 2014 2:58 PM, "Furkan KAMACI"  wrote:
>
>> Hi;
>>
>> Ok, I see that your aim is different. Do you want to implement something
>> similar to Map/Reduce paradigm?
>>
>> Thanks;
>> Furkan KAMACI
>>
>>
>> 2014-03-02 0:09 GMT+02:00 Soumitra Kumar :
>>
>> > I want to add a command to calculate average of some numeric field. How
>> do
>> > I efficiently do this when data is split across multiple shards. I would
>> > like to do the computation on each shard, and then aggregate the result.
>> >
>> >
>> > On Sat, Mar 1, 2014 at 1:51 PM, Furkan KAMACI > > >wrote:
>> >
>> > > Hi;
>> > >
>> > > I've written a dashboard for such kind of purposes and I will make it
>> > open
>> > > source soon. You can get information of SolrCloud via Solrj or you
>> > interact
>> > > with Zookeeper. Could you explain more what do you want to do? Which
>> kind
>> > > of results do you want to aggregate for SolrCloud installation.
>> > >
>> > > Thanks;
>> > > Furkan KAMACI
>> > >
>> > >
>> > > 2014-03-01 23:39 GMT+02:00 Soumitra Kumar :
>> > >
>> > > > Hello,
>> > > >
>> > > > I want to write a plugin for a SolrCloud installation.
>> > > >
>> > > > I could not find where and how to aggregate the results from all
>> > shards,
>> > > > please give some pointers.
>> > > >
>> > > > Thanks,
>> > > > -Soumitra.
>> > > >
>> > >
>> >
>>



-- 
Regards,
Shalin Shekhar Mangar.


Elevation and core create

2014-03-02 Thread David Stuart
Hi sorry for the cross post but I got no response in the dev group so assumed I 
posted in the wrong place.



I am using Solr 3.6 and am trying to automate the deployment of cores with a 
custom elevate file. It is proving to be difficult as most of the file (schema, 
stop words etc) support absolute path elevate seems to need to be in either a 
conf directory as a sibling to data or in the data directory itself. I am able 
to achieve my goal by having a secondary process that places the file but 
thought I would as the group just in case I have missed the obvious. Should I 
move to Solr 4 is it fixed here? I could also go down the root of extending the 
SolrCore create function to accept additional params and move the file into the 
defined data directory.

Ideas?

Thanks for your help
David Stuart
M  +44(0) 778 854 2157
T   +44(0) 845 519 5465
www.axistwelve.com
Axis12 Ltd | The Ivories | 6/18 Northampton Street, London | N1 2HY | UK

AXIS12 - Enterprise Web Solutions

Reg Company No. 7215135
VAT No. 997 4801 60

This e-mail is strictly confidential and intended solely for the ordinary user 
of the e-mail account to which it is addressed. If you have received this 
e-mail in error please inform Axis12 immediately by return e-mail or telephone. 
We advise that in keeping with good computing practice the recipient of this 
e-mail should ensure that it is virus free. We do not accept any responsibility 
for any loss or damage that may arise from the use of this email or its 
contents.