Re: Searchquery on field that contains space

2014-01-09 Thread Ahmet Arslan
Hi Peter,

Here are two different ways to do it.

1) Use phrase query q=yourField:"new y" with the following type.



 
 
 
 


 
 



2) Use prefix query q={!prefix f=yourField}new y with following type:

https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-PrefixQueryParser




 
 
 



By the way I don't post on StackOverflow.

Ahmet



On Thursday, January 9, 2014 7:51 PM, PeterKerk  wrote:
Hi Ahmet,

Thanks. Also for that link, although it's too advanced for my usecase.

I see that by using KeywordTokenizerFactory it almost works now, but when I
search on:

"new y", no results are found, 

but when I search on "new", I do get "New York".

So the space in the searchquery is still causing problems, what could that
be?

Thanks again!

ps. are you guys (like you, Erick, Maurice etc.) also active on
StackOverflow? At least you'll get the credit for good support :)



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Searchquery-on-field-that-contains-space-tp4110166p4110515.html

Sent from the Solr - User mailing list archive at Nabble.com.



Copying Index

2014-01-09 Thread anand chandak

Hi,


I am testing replication feature of solr 4.x  with large index, 
unfortunately, that index that we had was for 3.x format. So I copied 
the index file and ran the upgrade index utility to convert it to 4.x 
format. The utility did, what it is suppose to do and I 4.x index 
(verified it with checkindex). However, now when I am replicating, it is 
not transferring those files ? What could be issue here ? Any suggestions


Thanks,

Anand



Re: Searchquery on field that contains space

2014-01-09 Thread Alexandre Rafalovitch
On Thu, Jan 9, 2014 at 11:34 PM, PeterKerk  wrote:

> Basically a user starts typing the first letters of a city and I want to
> return citynames that start with those letters, case-insensitive and not
> splitting the cityname on separate words (whether the separator is a
> whitespace or a "-").
> But although the search of a user is case-insensitive, I want to return the
> values including casing, search on "new york" would return "New York",
> where
> the latter is how it's stored in my MS-SQL DB.
>

Did you have a look at Analyzing Suggestor, it might be a better match for
your needs:
http://blog.mikemccandless.com/2012/09/lucenes-new-analyzing-suggester.html

Regards,
   Alex.

Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


Re: Index size - to determine storage

2014-01-09 Thread Alexandre Rafalovitch
Try running PDF through standalone Tika and see what comes back. That's the
size of the input. It usually be quite a small proportion of PDF size.
Possibly down to metadata only and no text, if your PDF does not include
text layer.

Then, it depends on your storing and indexing options, your tokenizers,
whether you are using ngrams, synonyms or anything else that multiplies the
content. And so on.

And remember, that you need (2? 3?) times more space on disk than a single
index for when Solr does segment merges.

Regards,
   Alex.

Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Fri, Jan 10, 2014 at 1:55 AM, Amit Jha  wrote:

> Hi,
>
> I would like to know if I index a file I.e PDF of 100KB then what would be
> the size of index. What all factors should be consider to determine the
> disk size?
>
> Rgds
> AJ


Re: Solr Cloud Query Scaling

2014-01-09 Thread Joel Bernstein
You do need to load balance the initial query request across the SolrCloud
nodes. Solj's CloudSolrServer and LBHttpSolrServer can perform the load
balancing for you in the client.  Or you can use a hardware load balancer.

Joel Bernstein
Search Engineer at Heliosearch


On Thu, Jan 9, 2014 at 5:58 PM, Shawn Heisey  wrote:

> On 1/9/2014 4:09 PM, Garth Grimm wrote:
>
>> As a follow-up question on this
>>
>> One would want to use some kind of load balancing 'above' the SolrCloud
>> installation for search queries, correct?  To ensure that the initial
>> requests would get distributed evenly to all nodes?
>>
>> If you don't have that, and send all requests to M2S2 (IRT OP), it would
>> be the only node that would ever act as controller, and it could become a
>> bottleneck that further replicas won't be able to alleviate.  Correct?
>>
>> Or is there something in the SolrCloud itself that even distributes the
>> controller role, regardless of which node the query initially arrives at?
>>
>
> Queries are automatically load balanced across the cloud, even if they all
> hit the same host.  This *probably* includes the controller role, but I am
> not sure about that.
>
> Unless you are using a zookeeper aware client, a load balancer is a good
> idea just from a redundancy perspective -- if the host you're hitting goes
> down, you'll want to automatically switch to another one.  The only
> zookeeper aware client that I know if is CloudSolrServer, which is part of
> SolrJ and allows you to write Java programs that access Solr.
>
> Thanks,
> Shawn
>
>


Re: need help on OpenNLP with Solr

2014-01-09 Thread Lance Norskog
There is no way to do these things with LUCENE-2899.


On Mon, Jan 6, 2014 at 8:07 AM, rashi gandhi wrote:

> Hi,
>
>
>
> I have applied OpenNLP (LUCENE 2899.patch) patch to SOLR-4.5.1 for nlp
> searching and it is working fine.
>
> Also I have designed an analyzer for this:
>
>  positionIncrementGap="100">
>
>   
>
>  sentenceModel="opennlp/en-test-sent.bin"
>tokenizerModel="opennlp/en-test-tokenizer.bin"/>
>
>  ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
>
>  posTaggerModel="opennlp/en-pos-maxent.bin"/>
>
>  nerTaggerModels="opennlp/en-ner-person.bin"/>
>
>  nerTaggerModels="opennlp/en-ner-location.bin"/>
>
>  class="solr.LowerCaseFilterFactory"/>
>
>  class="solr.SnowballPorterFilterFactory"/>
>
>
>
>
>
>  sentenceModel="opennlp/en-test-sent.bin" tokenizerModel
> ="opennlp/en-test-tokenizer.bin"/>
>
>  ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
>
>  posTaggerModel="opennlp/en-pos-maxent.bin"/>
>
>  nerTaggerModels="opennlp/en-ner-person.bin"/>
>
>  nerTaggerModels="opennlp/en-ner-location.bin"/>
>
>  class="solr.LowerCaseFilterFactory"/>
>
>  class="solr.SnowballPorterFilterFactory"/>
>
>
>
> 
>
>
> I am able to find that posTaggerModel is performing tagging in the phrases
> and add the payloads. ( but iam not able to analyze it)
>
> My Question is:
> Can i search a phrase giving high boost to NOUN then VERB ?
> For example: if iam searching "sitting on blanket" , so i want to give high
> boost to NOUN term first then VERB, that are tagged by OpenNLP.
> How can i use payloads for boosting?
> What are the changes required in schema.xml?
>
> Please provide me some pointers to move ahead
>
> Thanks in advance
>



-- 
Lance Norskog
goks...@gmail.com


Re: Solr Cloud Query Scaling

2014-01-09 Thread Shawn Heisey

On 1/9/2014 4:09 PM, Garth Grimm wrote:

As a follow-up question on this

One would want to use some kind of load balancing 'above' the SolrCloud 
installation for search queries, correct?  To ensure that the initial requests 
would get distributed evenly to all nodes?

If you don't have that, and send all requests to M2S2 (IRT OP), it would be the 
only node that would ever act as controller, and it could become a bottleneck 
that further replicas won't be able to alleviate.  Correct?

Or is there something in the SolrCloud itself that even distributes the 
controller role, regardless of which node the query initially arrives at?


Queries are automatically load balanced across the cloud, even if they 
all hit the same host.  This *probably* includes the controller role, 
but I am not sure about that.


Unless you are using a zookeeper aware client, a load balancer is a good 
idea just from a redundancy perspective -- if the host you're hitting 
goes down, you'll want to automatically switch to another one.  The only 
zookeeper aware client that I know if is CloudSolrServer, which is part 
of SolrJ and allows you to write Java programs that access Solr.


Thanks,
Shawn



RE: Solr Cloud Query Scaling

2014-01-09 Thread Garth Grimm
As a follow-up question on this

One would want to use some kind of load balancing 'above' the SolrCloud 
installation for search queries, correct?  To ensure that the initial requests 
would get distributed evenly to all nodes?

If you don't have that, and send all requests to M2S2 (IRT OP), it would be the 
only node that would ever act as controller, and it could become a bottleneck 
that further replicas won't be able to alleviate.  Correct?

Or is there something in the SolrCloud itself that even distributes the 
controller role, regardless of which node the query initially arrives at?

-Original Message-
From: Tim Potter [mailto:tim.pot...@lucidworks.com] 
Sent: Thursday, January 09, 2014 12:28 PM
To: solr-user@lucene.apache.org
Subject: RE: Solr Cloud Query Scaling

Absolutely adding replicas helps you scale query load. Queries do not need to 
be routed to leaders; they can be handled by any replica in a shard. Leaders 
are only needed for handling update requests.

In general, a distributed query has two phases, driven by a controller node 
(what you called collator below). The controller is the Solr that received the 
query request from the client. In Phase 1, the controller distributes the query 
to one of the replicas for all shards and receives back the list of matching 
document IDs from each replica (only a page worth btw). 

The controller merges the results and sorts them to generate a final page of 
results to be returned to the client. In Phase 2, the controller collects all 
the fields from the documents to generate the final result set by querying the 
replicas involved in Phase 1.

The controller uses SolrJ's LBSolrServer to query the shards in Phase 1 so you 
get some basic load-balancing amongst replicas for a shard. I've not done any 
research to see how balanced that selection process is in production but I 
suspect if you have 3 replicas in a shard, then roughly 1/3 of the queries go 
to each.

Timothy Potter
Sr. Software Engineer, LucidWorks
www.lucidworks.com


From: Sir Gilligan 
Sent: Thursday, January 09, 2014 11:02 AM
To: solr-user@lucene.apache.org
Subject: Solr Cloud Query Scaling

Question: Does adding replicas help with query load?

Scenario: 3 Physical Machines. 3 Shards
Query any machine, get results. Standard Solr Cloud stuff.

Update Scenario: 6 Physical Machines. 3 Shards.
M = Machine, S = Shard, -L = Leader
M1S1-L
M2S2
M3S3
M4S1
M5S2-L
M6S3-L

Incoming Query to M2S2. How will Solr Cloud (4.6.0) distribute the query?
Will M2S2 handle the query for shard 2? Or, will it send it to the leader of S2 
which is M5S2?
When the query is distributed, will it send it to the other leaders? OR, will 
it send it to any shard?
Specifically:
Query sent to M2S2. Solr Cloud distributes the query. Could it possibly send 
the query on to M3S3 and M4S1? Some kind of query load balance functionality 
(maybe like a round robin to the shard members).
OR will M2S2 just be the collator, and send the query to the leaders?
OR something different that I have not described?

If queries do not have to be processed by leaders then we could add three more 
physical machines (now total 9 machines) and handle more query load.

Thank you.


Re: Invalid version (expected 2, but 60) or the data in not in 'javabin' format exception while deleting 30k records

2014-01-09 Thread gpssolr2020
Thanks. We will try with more heap.

And we noticed that zookeeper(open jdk) and Solr(sun jdk) is using different
jvm. Will this really cause this OOM issue ?.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Invalid-version-expected-2-but-60-or-the-data-in-not-in-javabin-format-exception-while-deleting-30k-s-tp4109259p4110538.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Range queries with Grouping is slow?

2014-01-09 Thread Kranti Parisa
Thank you, will take a look at it.

Thanks,
Kranti K. Parisa
http://www.linkedin.com/in/krantiparisa



On Thu, Jan 9, 2014 at 10:25 AM, Mikhail Khludnev <
mkhlud...@griddynamics.com> wrote:

> Hello,
>
> Here is workaround for caching separate clauses in OR filters.
> http://blog.griddynamics.com/2014/01/segmented-filter-cache-in-solr.html
> No coding is required, just try to experiment with request parameters.
>
>
> On Wed, Jan 8, 2014 at 9:11 PM, Erick Erickson  >wrote:
>
> > Well, actually you can use fqs, it's just that re-using them becomes a
> bit
> > more tricky. Specifically,
> > fq=field1:blah OR field2:blort
> > is perfectly reasonable. However, it doesn't break things down into
> > sub-clauses, so
> > fq=field1:blah
> > will create a new entry in the filtercache. And
> > fq=field2:blort OR field1:blah
> > will not match the first one.
> >
> > It kind of depends on the query pattern whether the filtercache will be
> > re-used, you have to take care to construct the fq clauses with re-use in
> > mind if you want ORs.
> >
> > Best,
> > Erick
> >
> >
> > On Wed, Jan 8, 2014 at 11:56 AM, Kranti Parisa  > >wrote:
> >
> > > I was trying with the  [* TO *] as an example, the real use case is OR
> > > query between 2/more range queries of timestamp fields (saved in
> > > milliseconds). So I can't use FQs as they are ANDed by definition.
> > >
> > > Am I missing something here?
> > >
> > >
> > >
> > >
> > > Thanks,
> > > Kranti K. Parisa
> > > http://www.linkedin.com/in/krantiparisa
> > >
> > >
> > >
> > > On Wed, Jan 8, 2014 at 8:15 AM, Joel Bernstein 
> > wrote:
> > >
> > > > Kranti,
> > > >
> > > > The range query also looks like a good candidate to be moved to a
> > filter
> > > > query so it can be cached.
> > > >
> > > > Joel Bernstein
> > > > Search Engineer at Heliosearch
> > > >
> > > >
> > > > On Tue, Jan 7, 2014 at 11:34 PM, Smiley, David W.  >
> > > > wrote:
> > > >
> > > > > Kranti,
> > > > >
> > > > > I can't speak to the specific slow-down while grouping, but if you
> > > expect
> > > > > to run [* TO *] queries with any frequency then you should index a
> > > > boolean
> > > > > flag and query for that instead.  You might also reduce the
> > > precisionStep
> > > > > value for the field you are using to 6 or even 4.  But wow that's a
> > big
> > > > > difference you noted; it wouldn't hurt to double-check with the
> > > debugger
> > > > > that the [* TO *] is treated as a numeric range query instead of a
> > > > generic
> > > > > term range.
> > > > >
> > > > > ~ David
> > > > > 
> > > > > From: Kranti Parisa [kranti.par...@gmail.com]
> > > > > Sent: Tuesday, January 07, 2014 10:26 PM
> > > > > To: solr-user@lucene.apache.org
> > > > > Subject: Range queries with Grouping is slow?
> > > > >
> > > > > Is there any known issue with Range queries + grouping?
> > > > >
> > > > > Case1:
> > > > > q=id:123&group=true&sort=price
> > > > > asc&group.field=entityId&group.limit=2&group.ngroups=true
> > > > >
> > > > > Case2:
> > > > > q=id:123 AND price:[* TO *]&group=true&sort=price
> > > > > asc&group.field=entityId&group.limit=2&group.ngroups=true
> > > > >
> > > > > Index Size:10M/~5GB
> > > > > After running both queries at least once, I was expecting to hit
> the
> > > > query
> > > > > caches and response should be quick enough, but
> > > > > Case1: 15-20ms (looks fine)
> > > > > Case2: 400+ms (this seems constantly >400ms even after the first
> > query)
> > > > >
> > > > > any thought? if it's a known issue, please point me to the jira
> link
> > > > > otherwise I can open an issue if this needs some analysis?
> > > > >
> > > > >
> > > > > Thanks,
> > > > > Kranti K. Parisa
> > > > > http://www.linkedin.com/in/krantiparisa
> > > > >
> > > >
> > >
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> 
>  
>


Re: Index size - to determine storage

2014-01-09 Thread Michael Della Bitta
Hi Amit,

It really boils down to how much of that 100kb is actually text, and how
you analyze and store the text. Meaning, it's really hard for us to say.
You're probably going to need to experiment to figure out what the storage
needs for your use case are.

Michael Della Bitta

Applications Developer

o: +1 646 532 3062  | c: +1 917 477 7906

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions  | g+:
plus.google.com/appinions
w: appinions.com 


On Thu, Jan 9, 2014 at 1:55 PM, Amit Jha  wrote:

> Hi,
>
> I would like to know if I index a file I.e PDF of 100KB then what would be
> the size of index. What all factors should be consider to determine the
> disk size?
>
> Rgds
> AJ


Index size - to determine storage

2014-01-09 Thread Amit Jha
Hi,

I would like to know if I index a file I.e PDF of 100KB then what would be the 
size of index. What all factors should be consider to determine the disk size?

Rgds
AJ

Return only distinct combinations of 2 field values

2014-01-09 Thread PeterKerk
I'm searching on cities and returning city and province, some cities exist in
different provinces, which is ok.
However, I have some duplicates, meaning 2 cities occur in the same
province. In that case I only want to return 1 result.
I therefore need to have a distinct and unique city+province combination.

How can I make sure that only unique city+province combinations are returned
by my query?

http://localhost:8983/solr/tt-cities/select/?indent=off&facet=false&fl=id,title,provincetitle_nl&q=*:*&defType=lucene&start=0&rows=15

The respective fields are title and provincetitle_nl. Below my schema.xml


  




  
  





  








--
View this message in context: 
http://lucene.472066.n3.nabble.com/Return-only-distinct-combinations-of-2-field-values-tp4110521.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Solr Cloud Query Scaling

2014-01-09 Thread Tim Potter
Absolutely adding replicas helps you scale query load. Queries do not need to 
be routed to leaders; they can be handled by any replica in a shard. Leaders 
are only needed for handling update requests.

In general, a distributed query has two phases, driven by a controller node 
(what you called collator below). The controller is the Solr that received the 
query request from the client. In Phase 1, the controller distributes the query 
to one of the replicas for all shards and receives back the list of matching 
document IDs from each replica (only a page worth btw). 

The controller merges the results and sorts them to generate a final page of 
results to be returned to the client. In Phase 2, the controller collects all 
the fields from the documents to generate the final result set by querying the 
replicas involved in Phase 1.

The controller uses SolrJ's LBSolrServer to query the shards in Phase 1 so you 
get some basic load-balancing amongst replicas for a shard. I've not done any 
research to see how balanced that selection process is in production but I 
suspect if you have 3 replicas in a shard, then roughly 1/3 of the queries go 
to each.

Timothy Potter
Sr. Software Engineer, LucidWorks
www.lucidworks.com


From: Sir Gilligan 
Sent: Thursday, January 09, 2014 11:02 AM
To: solr-user@lucene.apache.org
Subject: Solr Cloud Query Scaling

Question: Does adding replicas help with query load?

Scenario: 3 Physical Machines. 3 Shards
Query any machine, get results. Standard Solr Cloud stuff.

Update Scenario: 6 Physical Machines. 3 Shards.
M = Machine, S = Shard, -L = Leader
M1S1-L
M2S2
M3S3
M4S1
M5S2-L
M6S3-L

Incoming Query to M2S2. How will Solr Cloud (4.6.0) distribute the query?
Will M2S2 handle the query for shard 2? Or, will it send it to the
leader of S2 which is M5S2?
When the query is distributed, will it send it to the other leaders? OR,
will it send it to any shard?
Specifically:
Query sent to M2S2. Solr Cloud distributes the query. Could it possibly
send the query on to M3S3 and M4S1? Some kind of query load balance
functionality (maybe like a round robin to the shard members).
OR will M2S2 just be the collator, and send the query to the leaders?
OR something different that I have not described?

If queries do not have to be processed by leaders then we could add
three more physical machines (now total 9 machines) and handle more
query load.

Thank you.


Solr Cloud Query Scaling

2014-01-09 Thread Sir Gilligan

Question: Does adding replicas help with query load?

Scenario: 3 Physical Machines. 3 Shards
Query any machine, get results. Standard Solr Cloud stuff.

Update Scenario: 6 Physical Machines. 3 Shards.
M = Machine, S = Shard, -L = Leader
M1S1-L
M2S2
M3S3
M4S1
M5S2-L
M6S3-L

Incoming Query to M2S2. How will Solr Cloud (4.6.0) distribute the query?
Will M2S2 handle the query for shard 2? Or, will it send it to the 
leader of S2 which is M5S2?
When the query is distributed, will it send it to the other leaders? OR, 
will it send it to any shard?

Specifically:
Query sent to M2S2. Solr Cloud distributes the query. Could it possibly 
send the query on to M3S3 and M4S1? Some kind of query load balance 
functionality (maybe like a round robin to the shard members).

OR will M2S2 just be the collator, and send the query to the leaders?
OR something different that I have not described?

If queries do not have to be processed by leaders then we could add 
three more physical machines (now total 9 machines) and handle more 
query load.


Thank you.


Re: Searchquery on field that contains space

2014-01-09 Thread PeterKerk
Hi Ahmet,

Thanks. Also for that link, although it's too advanced for my usecase.

I see that by using KeywordTokenizerFactory it almost works now, but when I
search on:

"new y", no results are found, 

but when I search on "new", I do get "New York".

So the space in the searchquery is still causing problems, what could that
be?

Thanks again!

ps. are you guys (like you, Erick, Maurice etc.) also active on
StackOverflow? At least you'll get the credit for good support :)



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Searchquery-on-field-that-contains-space-tp4110166p4110515.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr increase number of digits that tint fields can store

2014-01-09 Thread Hakim Benoudjit
Thanks that's the response I was searching for. And, I have confirmed that
I need to reindex my data because tlong isnt compatible with tint.


2014/1/9 Chris Hostetter 

>
> A TrieIntField field can never contain a value greater then java's
> Integer.MAX_VALUE -- it doesn't matter what settings you use.
>
> If you want to store larger values, you need to use a TrieLongField and
> re-index.
>
>
> https://lucene.apache.org/solr/4_6_0/solr-core/org/apache/solr/schema/TrieIntField.html
>
> https://lucene.apache.org/solr/4_6_0/solr-core/org/apache/solr/schema/TrieLongField.html
>
> : Do I have to increase precisionStep=8? because an error occured when
> trying
>
> precisionStep has nothing to do with the max/min values that can be
> indexed.  precisionStep controls the amount of precision used when
> encoding additional terms to speed up range queries...
>
>
> https://lucene.apache.org/solr/4_6_0/solr-core/org/apache/solr/schema/TrieField.html
>
> https://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/NumericRangeQuery.html?is-external=true
>
>
> -Hoss
> http://www.lucidworks.com/
>


Re: Searchquery on field that contains space

2014-01-09 Thread Ahmet Arslan
Hi Peter,

Use KeywordTokenizerFactory instead of Whitespace tokenizer.

Also you might interested in this : 
http://www.cominvent.com/2012/01/25/super-flexible-autocomplete-with-solr/

Ahmet



On Thursday, January 9, 2014 6:35 PM, PeterKerk  wrote:
Basically a user starts typing the first letters of a city and I want to
return citynames that start with those letters, case-insensitive and not
splitting the cityname on separate words (whether the separator is a
whitespace or a "-").
But although the search of a user is case-insensitive, I want to return the
values including casing, search on "new york" would return "New York", where
the latter is how it's stored in my MS-SQL DB.

I've been testing my code via the admin/analysis page.

I believe I don't want the WhitespaceTokenizerFactory on my field definition
since that splits the city names I want the following behavior:

query on:

"new*" returns "New york" or "newbee", but does not return values like
"greater new hampshire"
"york*" does NOT return "new york"

"nij*" returns "Nijmegen", but not "Halle-Nijman"

Here's what I have come up so far:

    
    


    
      
        
        
        
        
      
      
        
        
        
        
        
        
      
        
    
    
    But when I leave out the WhitespaceTokenizerFactory I get:  Plugin init
failure for [schema.xml] fieldType "text_lower_exact": analyzer without
class or tokenizer,trace=org.apache.solr.common.SolrException: SolrCore
'tt-cities' is not available due to init failure



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Searchquery-on-field-that-contains-space-tp4110166p4110495.html

Sent from the Solr - User mailing list archive at Nabble.com.



Re: Solr 4.6.0: DocValues (distributed search)

2014-01-09 Thread ku3ia
Today I setup a simple SolrCloud with tow shards. Seems the same. When I'm
debugging a distributed search I can't catch a break-point at lucene codec
file, but when I'm using faceted search everything looks fine - debugger
stops.

Can anyone help me with my question? Thanks.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-4-6-0-DocValues-distributed-search-tp4110289p4110511.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Zookeeper as Service

2014-01-09 Thread Peter Keegan
There's also: http://www.tanukisoftware.com/


On Thu, Jan 9, 2014 at 11:18 AM, Nazik Huq  wrote:

>
>
> From your email I gather your main concern is starting zookeeper on server
> startups.
>
> You may want to look at these non-native service oriented options too:
> Create  a script( cmd or bat) to start ZK on server bootup. This method
> may not restart Zk if Zk crashes(not the server).
> Create C# commad line program that starts on server bootup(see above) that
> uses the .Net System.Diagnostics.Process.Start method to start Zk on
> sever start and monitor the Zk process via a loop. Restart when Zk process
> crash or "hang". I prefer this method. There might be a Java equivalent of
> this. There are many exmaples avaialble on the web.
> Cheers,
> @nazik_huq
>
>
>
> On Thursday, January 9, 2014 10:07 AM, Charlie Hull 
> wrote:
>
> On 09/01/2014 09:44, Karthikeyan.Kannappan wrote:
>
> > I am hosting in windows OS
> >
> >
> >
> > --
> > View this message in context:
> http://lucene.472066.n3.nabble.com/Zookeeper-as-Service-tp4110396p4110413.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
>
> There are various ways to 'servicify' (yes that may not be an actual
> word) executable applications on Windows. The venerable SrvAny is one
> such option as is the newer
>  nssm.exe (Non-Sucking Service Manager).
>
> Bear in mind that a Windows Service doesn't operate quite the same way
> with regard to stdout and stderr which may mean any error messages end
> up in a black hole, with you simply
>  getting something unhelpful 'service
> failed to start' error messages from Windows itself if something goes
> wrong. The 'working directory' is another thing that needs careful
> setting up.
>
> Cheers
>
> Charlie
>
> --
> Charlie Hull
> Flax - Open Source Enterprise Search
>
> tel/fax: +44 (0)8700 118334
> mobile:  +44 (0)7767 825828
> web: www.flax.co.uk
>


Re: solr increase number of digits that tint fields can store

2014-01-09 Thread Chris Hostetter

A TrieIntField field can never contain a value greater then java's 
Integer.MAX_VALUE -- it doesn't matter what settings you use.

If you want to store larger values, you need to use a TrieLongField and 
re-index.

https://lucene.apache.org/solr/4_6_0/solr-core/org/apache/solr/schema/TrieIntField.html
https://lucene.apache.org/solr/4_6_0/solr-core/org/apache/solr/schema/TrieLongField.html

: Do I have to increase precisionStep=8? because an error occured when trying

precisionStep has nothing to do with the max/min values that can be 
indexed.  precisionStep controls the amount of precision used when 
encoding additional terms to speed up range queries...

https://lucene.apache.org/solr/4_6_0/solr-core/org/apache/solr/schema/TrieField.html
https://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/NumericRangeQuery.html?is-external=true


-Hoss
http://www.lucidworks.com/


Re: Searchquery on field that contains space

2014-01-09 Thread PeterKerk
Basically a user starts typing the first letters of a city and I want to
return citynames that start with those letters, case-insensitive and not
splitting the cityname on separate words (whether the separator is a
whitespace or a "-").
But although the search of a user is case-insensitive, I want to return the
values including casing, search on "new york" would return "New York", where
the latter is how it's stored in my MS-SQL DB.

I've been testing my code via the admin/analysis page.

I believe I don't want the WhitespaceTokenizerFactory on my field definition
since that splits the city names I want the following behavior:

query on:

"new*" returns "New york" or "newbee", but does not return values like
"greater new hampshire"
"york*" does NOT return "new york"

"nij*" returns "Nijmegen", but not "Halle-Nijman"

Here's what I have come up so far:






  




  
  






  



But when I leave out the WhitespaceTokenizerFactory I get:  Plugin init
failure for [schema.xml] fieldType "text_lower_exact": analyzer without
class or tokenizer,trace=org.apache.solr.common.SolrException: SolrCore
'tt-cities' is not available due to init failure



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Searchquery-on-field-that-contains-space-tp4110166p4110495.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Checking for similar text (duplicates)

2014-01-09 Thread Mikhail Khludnev
On Thu, Jan 9, 2014 at 5:39 PM, Cristian Bichis  wrote:

> Hi Mikhail,
>
> I seen deduplication part as well but I have some concerns:
>
> 1. Is deduplication supposed to work as well into a check-only (not try to
> actually add new record to index) request ? So if I just check to see if
> "could be" some duplicates of some text ?
>
> that wiki mention special signature field which is added to documents, try
to search for it.


> 2. As far as I seen the deduplication has some bottlenecks when comparing
> extremely similar items (eg just one character difference). I cant find now
> the pages mentioning this but I am concerned this might not be reliable
>
I suppose that MD5Signature is sensitive to single char difference and
TextProfileSignature ignores small diffs. try to experiment with them


>
> Cristian
>
>  Hello Cristian,
>>
>> Have you seen http://wiki.apache.org/solr/Deduplication ?
>>
>>
>> On Thu, Jan 9, 2014 at 5:01 PM, Cristian Bichis  wrote:
>>
>>  Hi,
>>>
>>> I have one app where the search part is based currently on something else
>>> than Solr. However, as the scale/demand and complexity grows I am looking
>>> at Solr for a potential better fit, including for some features currently
>>> implemented into scripting layer (so which are not on search currently).
>>> I
>>> am not quite familiar with Solr at this point, I am into early checking
>>> stage.
>>>
>>> One of the current app features is to detect /if there are/ similar
>>> records into index comparing with a potential new record and /which are
>>> these records/. In other words to check for duplicates (which are not
>>> necessary identical but would be very close to original). The comparison
>>> is
>>> made checking on a description field, which could contain couple hundreds
>>> words (and the words are NOT in English) for each record. Of course the
>>> comparison could be made more complex in the future, to compare 2-3
>>> fields
>>> (a title, the description, additional keywords, etc).
>>>
>>> Currently this feature is implemented directly in PHP using similar_text,
>>> which for us has an advantage over levenshtein because it gives a
>>> straight
>>> % match score and we can decide if a record is a duplicate based on %
>>> score
>>> returned by similar_text (eg: if over 80% match then is a duplicate). The
>>> fact I have a score (filtering limit) for each record compared it helps
>>> me
>>> to decide/tweak the limit I consider is the milestone between duplicates
>>> and non-duplicates (I may decide the comparison is too strict and I may
>>> lower the threshold to 75%).
>>>
>>> Using levensthein (on php) would require additional processing so the
>>> performance benefit would be lost with this overhead. As well, on longer
>>> term any php implementation for this feature would be a performance
>>> bottleneck so this is not quite a solution.
>>>
>>> I am looking to move this "slow" operation into a more efficient
>>> environment, that's why I considered moving into search part this
>>> feature.
>>>
>>> I want to know if anyone has an efficient (working) solution based on
>>> Solr
>>> for this case. I am not sure if fuzzy search would be enough, I havent
>>> made
>>> a test case for this (yet).
>>>
>>> Thank you,
>>> Cristian
>>>
>>>
>>
>>
>


-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics


 


Re: Zookeeper as Service

2014-01-09 Thread Nazik Huq


From your email I gather your main concern is starting zookeeper on server 
startups.

You may want to look at these non-native service oriented options too:
Create  a script( cmd or bat) to start ZK on server bootup. This method may not 
restart Zk if Zk crashes(not the server).
Create C# commad line program that starts on server bootup(see above) that uses 
the .Net System.Diagnostics.Process.Start method to start Zk on sever start and 
monitor the Zk process via a loop. Restart when Zk process crash or "hang". I 
prefer this method. There might be a Java equivalent of this. There are many 
exmaples avaialble on the web.
Cheers,
@nazik_huq



On Thursday, January 9, 2014 10:07 AM, Charlie Hull  wrote:
  
On 09/01/2014 09:44, Karthikeyan.Kannappan wrote:

> I am hosting in windows OS
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Zookeeper-as-Service-tp4110396p4110413.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

There are various ways to 'servicify' (yes that may not be an actual 
word) executable applications on Windows. The venerable SrvAny is one 
such option as is the newer
 nssm.exe (Non-Sucking Service Manager).

Bear in mind that a Windows Service doesn't operate quite the same way 
with regard to stdout and stderr which may mean any error messages end 
up in a black hole, with you simply
 getting something unhelpful 'service 
failed to start' error messages from Windows itself if something goes 
wrong. The 'working directory' is another thing that needs careful 
setting up.

Cheers

Charlie

-- 
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk

Re: How to boost documents ?

2014-01-09 Thread Anca Kopetz

Hi,

I tested the BoostQueryParser and it works on the simplified example.
But we need to keep the edismax Query parser, so I tried the following query 
and it seems to work (I defined a local bf='' for qq).

&q=beautiful Christmas tree
&mm=2
&qf=title^12 description^2
&defType=edismax
&bf=map(query($qq),0,0,0,100.0)
&qq={!edismax bf='' mm=100%}beautiful Christmas tree

Best regards,
Anca

On 01/07/2014 07:42 PM, Ahmet Arslan wrote:

Hi Hoss,

Thanks for the explanation. Very complicated stuff. I never understand 
NestedQParserPlugin.

We want all the documents with all terms (mm=100%) get an extra 1000 points, 
but change nothing else. How would you restructure the following query?

q=beautiful Christmas tree&mm=2&qf=title^12 
description^2&defType=dismax&bf=map(query($qq),0,0,0,100.0)&qq={!dismax qf='title 
description' mm=100%}beautiful Christmas tree

Ahmet


On Tuesday, January 7, 2014 8:12 PM, Chris Hostetter 
 wrote:


: http://localhost:8983/solr/collection1/select?q=ipod
: 
belkin&wt=xml&debugQuery=true&q.op=AND&defType=edismax&bf=map(query($qq),0,0,0,100.0)&qq={!edismax}power
:
: The error is :
: org.apache.solr.search.SyntaxError: Infinite Recursion detected parsing query
: 'power'
:
: And the stacktrace :
:
: ERROR - 2014-01-06 18:27:02.275; org.apache.solr.common.SolrException;
: org.apache.solr.common.SolrException: org.apache.solr.search.SyntaxError:
: Infinite Recursion detected parsing query 'power'

your "qq" param uses the edismax parser which goes looking for a "bf" -
since there is no local bf param, it finds the global one -- which
recursively refers to the "qq" param again.  Hence the infinite recursion.

You either need to override the bf param locally in your qq param, or
restructure your query slightly so the bf is not global

Perhaps something like this...

qq=ipod belkin
q={!query defType=edismax bf=$boostFunc v=$qq}
boostFunc=map(query($boostQuery),0,0,0,100.0)
boostQuery={!edismax}power

having said that however: instead of using "bf" you should probably
consider using the "boost" parser -- it's a multiplicitive boost instead
of an additive boost, and (in my opinion) makes the flow/params easier to
understand...

https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-BoostQueryParser

qq=ipod belkin
q={!boost defType=edismax b=$boostFunc v=$qq}
boostFunc=map(query($boostQuery),0,0,0,100.0)
boostQuery={!edismax}power



-Hoss
http://www.lucidworks.com/





Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.


Re: Searchquery on field that contains space

2014-01-09 Thread PeterKerk
@Ahmet:

Thanks, but I also need to be able to search via wildcard and just found
that a "-" might be resulting in unwanted results. E.g. when using this
query:

http://localhost:8983/solr/tt-cities/select/?indent=off&facet=false&fl=id,title,provincetitle_nl&q=title_search:nij*&defType=lucene&start=0&rows=15

I also get a result for "Halle-Nijman", so it seems the wildcard is not
working, as "Halle-Nijman" does not start with "nij" (or "Nij")
I also tried:
q=title_search:(nij*)
q=title_search:(nij)*

How can I fix this?


@Erick:

When I'm on the analysis page I get the error:

"This Functionality requires the /analysis/field Handler to be registered
and active!"

So I added this line to my solr config (based on this post:
http://stackoverflow.com/questions/12627734/configure-field-analysis-handler-solr-4)



But still the same error occurs.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Searchquery-on-field-that-contains-space-tp4110166p4110485.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Range queries with Grouping is slow?

2014-01-09 Thread Mikhail Khludnev
Hello,

Here is workaround for caching separate clauses in OR filters.
http://blog.griddynamics.com/2014/01/segmented-filter-cache-in-solr.html
No coding is required, just try to experiment with request parameters.


On Wed, Jan 8, 2014 at 9:11 PM, Erick Erickson wrote:

> Well, actually you can use fqs, it's just that re-using them becomes a bit
> more tricky. Specifically,
> fq=field1:blah OR field2:blort
> is perfectly reasonable. However, it doesn't break things down into
> sub-clauses, so
> fq=field1:blah
> will create a new entry in the filtercache. And
> fq=field2:blort OR field1:blah
> will not match the first one.
>
> It kind of depends on the query pattern whether the filtercache will be
> re-used, you have to take care to construct the fq clauses with re-use in
> mind if you want ORs.
>
> Best,
> Erick
>
>
> On Wed, Jan 8, 2014 at 11:56 AM, Kranti Parisa  >wrote:
>
> > I was trying with the  [* TO *] as an example, the real use case is OR
> > query between 2/more range queries of timestamp fields (saved in
> > milliseconds). So I can't use FQs as they are ANDed by definition.
> >
> > Am I missing something here?
> >
> >
> >
> >
> > Thanks,
> > Kranti K. Parisa
> > http://www.linkedin.com/in/krantiparisa
> >
> >
> >
> > On Wed, Jan 8, 2014 at 8:15 AM, Joel Bernstein 
> wrote:
> >
> > > Kranti,
> > >
> > > The range query also looks like a good candidate to be moved to a
> filter
> > > query so it can be cached.
> > >
> > > Joel Bernstein
> > > Search Engineer at Heliosearch
> > >
> > >
> > > On Tue, Jan 7, 2014 at 11:34 PM, Smiley, David W. 
> > > wrote:
> > >
> > > > Kranti,
> > > >
> > > > I can't speak to the specific slow-down while grouping, but if you
> > expect
> > > > to run [* TO *] queries with any frequency then you should index a
> > > boolean
> > > > flag and query for that instead.  You might also reduce the
> > precisionStep
> > > > value for the field you are using to 6 or even 4.  But wow that's a
> big
> > > > difference you noted; it wouldn't hurt to double-check with the
> > debugger
> > > > that the [* TO *] is treated as a numeric range query instead of a
> > > generic
> > > > term range.
> > > >
> > > > ~ David
> > > > 
> > > > From: Kranti Parisa [kranti.par...@gmail.com]
> > > > Sent: Tuesday, January 07, 2014 10:26 PM
> > > > To: solr-user@lucene.apache.org
> > > > Subject: Range queries with Grouping is slow?
> > > >
> > > > Is there any known issue with Range queries + grouping?
> > > >
> > > > Case1:
> > > > q=id:123&group=true&sort=price
> > > > asc&group.field=entityId&group.limit=2&group.ngroups=true
> > > >
> > > > Case2:
> > > > q=id:123 AND price:[* TO *]&group=true&sort=price
> > > > asc&group.field=entityId&group.limit=2&group.ngroups=true
> > > >
> > > > Index Size:10M/~5GB
> > > > After running both queries at least once, I was expecting to hit the
> > > query
> > > > caches and response should be quick enough, but
> > > > Case1: 15-20ms (looks fine)
> > > > Case2: 400+ms (this seems constantly >400ms even after the first
> query)
> > > >
> > > > any thought? if it's a known issue, please point me to the jira link
> > > > otherwise I can open an issue if this needs some analysis?
> > > >
> > > >
> > > > Thanks,
> > > > Kranti K. Parisa
> > > > http://www.linkedin.com/in/krantiparisa
> > > >
> > >
> >
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics


 


Re: Range queries with Grouping is slow?

2014-01-09 Thread Smiley, David W.
It won¹t hit the filter cache if you set {! cache=false} local-param.

On 1/8/14, 12:18 PM, "Kranti Parisa"  wrote:

>yes thats the key, these time ranges change frequently and hitting
>filtercache then is a problem. I will try few more samples and probably
>debug thru it. thanks.
>
>
>Thanks,
>Kranti K. Parisa
>http://www.linkedin.com/in/krantiparisa
>
>
>
>On Wed, Jan 8, 2014 at 12:11 PM, Erick Erickson
>wrote:
>
>> Well, actually you can use fqs, it's just that re-using them becomes a
>>bit
>> more tricky. Specifically,
>> fq=field1:blah OR field2:blort
>> is perfectly reasonable. However, it doesn't break things down into
>> sub-clauses, so
>> fq=field1:blah
>> will create a new entry in the filtercache. And
>> fq=field2:blort OR field1:blah
>> will not match the first one.
>>
>> It kind of depends on the query pattern whether the filtercache will be
>> re-used, you have to take care to construct the fq clauses with re-use
>>in
>> mind if you want ORs.
>>
>> Best,
>> Erick
>>
>>
>> On Wed, Jan 8, 2014 at 11:56 AM, Kranti Parisa > >wrote:
>>
>> > I was trying with the  [* TO *] as an example, the real use case is OR
>> > query between 2/more range queries of timestamp fields (saved in
>> > milliseconds). So I can't use FQs as they are ANDed by definition.
>> >
>> > Am I missing something here?
>> >
>> >
>> >
>> >
>> > Thanks,
>> > Kranti K. Parisa
>> > http://www.linkedin.com/in/krantiparisa
>> >
>> >
>> >
>> > On Wed, Jan 8, 2014 at 8:15 AM, Joel Bernstein 
>> wrote:
>> >
>> > > Kranti,
>> > >
>> > > The range query also looks like a good candidate to be moved to a
>> filter
>> > > query so it can be cached.
>> > >
>> > > Joel Bernstein
>> > > Search Engineer at Heliosearch
>> > >
>> > >
>> > > On Tue, Jan 7, 2014 at 11:34 PM, Smiley, David W.
>>
>> > > wrote:
>> > >
>> > > > Kranti,
>> > > >
>> > > > I can't speak to the specific slow-down while grouping, but if you
>> > expect
>> > > > to run [* TO *] queries with any frequency then you should index a
>> > > boolean
>> > > > flag and query for that instead.  You might also reduce the
>> > precisionStep
>> > > > value for the field you are using to 6 or even 4.  But wow that's
>>a
>> big
>> > > > difference you noted; it wouldn't hurt to double-check with the
>> > debugger
>> > > > that the [* TO *] is treated as a numeric range query instead of a
>> > > generic
>> > > > term range.
>> > > >
>> > > > ~ David
>> > > > 
>> > > > From: Kranti Parisa [kranti.par...@gmail.com]
>> > > > Sent: Tuesday, January 07, 2014 10:26 PM
>> > > > To: solr-user@lucene.apache.org
>> > > > Subject: Range queries with Grouping is slow?
>> > > >
>> > > > Is there any known issue with Range queries + grouping?
>> > > >
>> > > > Case1:
>> > > > q=id:123&group=true&sort=price
>> > > > asc&group.field=entityId&group.limit=2&group.ngroups=true
>> > > >
>> > > > Case2:
>> > > > q=id:123 AND price:[* TO *]&group=true&sort=price
>> > > > asc&group.field=entityId&group.limit=2&group.ngroups=true
>> > > >
>> > > > Index Size:10M/~5GB
>> > > > After running both queries at least once, I was expecting to hit
>>the
>> > > query
>> > > > caches and response should be quick enough, but
>> > > > Case1: 15-20ms (looks fine)
>> > > > Case2: 400+ms (this seems constantly >400ms even after the first
>> query)
>> > > >
>> > > > any thought? if it's a known issue, please point me to the jira
>>link
>> > > > otherwise I can open an issue if this needs some analysis?
>> > > >
>> > > >
>> > > > Thanks,
>> > > > Kranti K. Parisa
>> > > > http://www.linkedin.com/in/krantiparisa
>> > > >
>> > >
>> >
>>



Re: Zookeeper as Service

2014-01-09 Thread Charlie Hull

On 09/01/2014 09:44, Karthikeyan.Kannappan wrote:

I am hosting in windows OS



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Zookeeper-as-Service-tp4110396p4110413.html
Sent from the Solr - User mailing list archive at Nabble.com.



There are various ways to 'servicify' (yes that may not be an actual 
word) executable applications on Windows. The venerable SrvAny is one 
such option as is the newer nssm.exe (Non-Sucking Service Manager).


Bear in mind that a Windows Service doesn't operate quite the same way 
with regard to stdout and stderr which may mean any error messages end 
up in a black hole, with you simply getting something unhelpful 'service 
failed to start' error messages from Windows itself if something goes 
wrong. The 'working directory' is another thing that needs careful 
setting up.


Cheers

Charlie

--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


solr increase number of digits that tint fields can store

2014-01-09 Thread Hakim Benoudjit
Hi,

I have a price field of type tint, from which I will generate a range facet.
And I have now some items in my index that exceed tint type limit (max
integer).
How do I increase tint max integer value?

Here is tint definition in schema.xml:



Do I have to increase precisionStep=8? because an error occured when trying
to insert a number whose the # of digits = 10


Re: Checking for similar text (duplicates)

2014-01-09 Thread Cristian Bichis

Hi Mikhail,

I seen deduplication part as well but I have some concerns:

1. Is deduplication supposed to work as well into a check-only (not try 
to actually add new record to index) request ? So if I just check to see 
if "could be" some duplicates of some text ?


2. As far as I seen the deduplication has some bottlenecks when 
comparing extremely similar items (eg just one character difference). I 
cant find now the pages mentioning this but I am concerned this might 
not be reliable


Cristian

Hello Cristian,

Have you seen http://wiki.apache.org/solr/Deduplication ?


On Thu, Jan 9, 2014 at 5:01 PM, Cristian Bichis  wrote:


Hi,

I have one app where the search part is based currently on something else
than Solr. However, as the scale/demand and complexity grows I am looking
at Solr for a potential better fit, including for some features currently
implemented into scripting layer (so which are not on search currently). I
am not quite familiar with Solr at this point, I am into early checking
stage.

One of the current app features is to detect /if there are/ similar
records into index comparing with a potential new record and /which are
these records/. In other words to check for duplicates (which are not
necessary identical but would be very close to original). The comparison is
made checking on a description field, which could contain couple hundreds
words (and the words are NOT in English) for each record. Of course the
comparison could be made more complex in the future, to compare 2-3 fields
(a title, the description, additional keywords, etc).

Currently this feature is implemented directly in PHP using similar_text,
which for us has an advantage over levenshtein because it gives a straight
% match score and we can decide if a record is a duplicate based on % score
returned by similar_text (eg: if over 80% match then is a duplicate). The
fact I have a score (filtering limit) for each record compared it helps me
to decide/tweak the limit I consider is the milestone between duplicates
and non-duplicates (I may decide the comparison is too strict and I may
lower the threshold to 75%).

Using levensthein (on php) would require additional processing so the
performance benefit would be lost with this overhead. As well, on longer
term any php implementation for this feature would be a performance
bottleneck so this is not quite a solution.

I am looking to move this "slow" operation into a more efficient
environment, that's why I considered moving into search part this feature.

I want to know if anyone has an efficient (working) solution based on Solr
for this case. I am not sure if fuzzy search would be enough, I havent made
a test case for this (yet).

Thank you,
Cristian








Re: solr text analysis showing a red bar error

2014-01-09 Thread Aruna Kumar Pamulapati
See if this helps:

https://groups.google.com/forum/#!topic/lily-discuss/IaQLpNVJRi8


On Thu, Jan 9, 2014 at 8:33 AM, Umapathy S  wrote:

> I checked that before.  I am using solr-4.6.0.  maxFieldLength is not
> applicable.
>
>
> On 9 January 2014 13:23, Aruna Kumar Pamulapati  >wrote:
>
> > If you are using a Solr version before 4.0 you should look into.
> >
> > solrconfig.xml:
> >
> >   1
> >
> >
> > What is your solr version?
> >
> >
> >
> > On Thu, Jan 9, 2014 at 8:16 AM, Aruna Kumar Pamulapati <
> > apamulap...@gmail.com> wrote:
> >
> > > Thanks, can you paste the text that you were trying to analyze?
> > >
> > >
> > > On Thu, Jan 9, 2014 at 8:10 AM, Umapathy S 
> wrote:
> > >
> > >> Thanks.
> > >>
> > >> Actually there is no error thrown.  Just a red bar appears on top.
> > >> I have pasted it on http://snag.gy/U9IiJ.jpg
> > >>
> > >>
> > >> On 9 January 2014 12:56, Aruna Kumar Pamulapati <
> apamulap...@gmail.com
> > >> >wrote:
> > >>
> > >> > Can you copy paste the error, for some reason I can not see the
> image
> > of
> > >> > the screenshot you posted.
> > >> >
> > >> >
> > >> > On Thu, Jan 9, 2014 at 7:52 AM, Umapathy S 
> > wrote:
> > >> >
> > >> > > Hi,
> > >> > >
> > >> > > I am a new to solr/lucene.
> > >> > > I am trying to do a text analysis on my index.  The below error
> > >> > > (screenshot) is shown when I increase the field value length.  I
> > have
> > >> > tried
> > >> > > searching in vain for any length specific restrictions in
> > >> solr.TextField.
> > >> > > There is no error text/exception thrown.
> > >> > >
> > >> > > [image: Inline images 1]
> > >> > >
> > >> > > The field is below
> > >> > >  > >> indexed="true"
> > >> > >  />
> > >> > >
> > >> > > fieldtype is
> > >> > >
> > >> > >  > >> > > positionIncrementGap="100">
> > >> > >   
> > >> > > 
> > >> > >  > >> > > words="stopwords.txt" enablePositionIncrements="true" />
> > >> > > 
> > >> > >   
> > >> > >   
> > >> > > 
> > >> > >  > >> > > words="stopwords.txt" enablePositionIncrements="true" />
> > >> > >  > >> synonyms="synonyms.txt"
> > >> > > ignoreCase="true" expand="true"/>
> > >> > > 
> > >> > >   
> > >> > > 
> > >> > >
> > >> > >
> > >> > > Any help much appreciated.
> > >> > >
> > >> > > Thanks
> > >> > >
> > >> > > Umapathy
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>


Re: solr text analysis showing a red bar error

2014-01-09 Thread Umapathy S
I checked that before.  I am using solr-4.6.0.  maxFieldLength is not
applicable.


On 9 January 2014 13:23, Aruna Kumar Pamulapati wrote:

> If you are using a Solr version before 4.0 you should look into.
>
> solrconfig.xml:
>
>   1
>
>
> What is your solr version?
>
>
>
> On Thu, Jan 9, 2014 at 8:16 AM, Aruna Kumar Pamulapati <
> apamulap...@gmail.com> wrote:
>
> > Thanks, can you paste the text that you were trying to analyze?
> >
> >
> > On Thu, Jan 9, 2014 at 8:10 AM, Umapathy S  wrote:
> >
> >> Thanks.
> >>
> >> Actually there is no error thrown.  Just a red bar appears on top.
> >> I have pasted it on http://snag.gy/U9IiJ.jpg
> >>
> >>
> >> On 9 January 2014 12:56, Aruna Kumar Pamulapati  >> >wrote:
> >>
> >> > Can you copy paste the error, for some reason I can not see the image
> of
> >> > the screenshot you posted.
> >> >
> >> >
> >> > On Thu, Jan 9, 2014 at 7:52 AM, Umapathy S 
> wrote:
> >> >
> >> > > Hi,
> >> > >
> >> > > I am a new to solr/lucene.
> >> > > I am trying to do a text analysis on my index.  The below error
> >> > > (screenshot) is shown when I increase the field value length.  I
> have
> >> > tried
> >> > > searching in vain for any length specific restrictions in
> >> solr.TextField.
> >> > > There is no error text/exception thrown.
> >> > >
> >> > > [image: Inline images 1]
> >> > >
> >> > > The field is below
> >> > >  >> indexed="true"
> >> > >  />
> >> > >
> >> > > fieldtype is
> >> > >
> >> > >  >> > > positionIncrementGap="100">
> >> > >   
> >> > > 
> >> > >  >> > > words="stopwords.txt" enablePositionIncrements="true" />
> >> > > 
> >> > >   
> >> > >   
> >> > > 
> >> > >  >> > > words="stopwords.txt" enablePositionIncrements="true" />
> >> > >  >> synonyms="synonyms.txt"
> >> > > ignoreCase="true" expand="true"/>
> >> > > 
> >> > >   
> >> > > 
> >> > >
> >> > >
> >> > > Any help much appreciated.
> >> > >
> >> > > Thanks
> >> > >
> >> > > Umapathy
> >> > >
> >> >
> >>
> >
> >
>


Re: Checking for similar text (duplicates)

2014-01-09 Thread Mikhail Khludnev
Hello Cristian,

Have you seen http://wiki.apache.org/solr/Deduplication ?


On Thu, Jan 9, 2014 at 5:01 PM, Cristian Bichis  wrote:

> Hi,
>
> I have one app where the search part is based currently on something else
> than Solr. However, as the scale/demand and complexity grows I am looking
> at Solr for a potential better fit, including for some features currently
> implemented into scripting layer (so which are not on search currently). I
> am not quite familiar with Solr at this point, I am into early checking
> stage.
>
> One of the current app features is to detect /if there are/ similar
> records into index comparing with a potential new record and /which are
> these records/. In other words to check for duplicates (which are not
> necessary identical but would be very close to original). The comparison is
> made checking on a description field, which could contain couple hundreds
> words (and the words are NOT in English) for each record. Of course the
> comparison could be made more complex in the future, to compare 2-3 fields
> (a title, the description, additional keywords, etc).
>
> Currently this feature is implemented directly in PHP using similar_text,
> which for us has an advantage over levenshtein because it gives a straight
> % match score and we can decide if a record is a duplicate based on % score
> returned by similar_text (eg: if over 80% match then is a duplicate). The
> fact I have a score (filtering limit) for each record compared it helps me
> to decide/tweak the limit I consider is the milestone between duplicates
> and non-duplicates (I may decide the comparison is too strict and I may
> lower the threshold to 75%).
>
> Using levensthein (on php) would require additional processing so the
> performance benefit would be lost with this overhead. As well, on longer
> term any php implementation for this feature would be a performance
> bottleneck so this is not quite a solution.
>
> I am looking to move this "slow" operation into a more efficient
> environment, that's why I considered moving into search part this feature.
>
> I want to know if anyone has an efficient (working) solution based on Solr
> for this case. I am not sure if fuzzy search would be enough, I havent made
> a test case for this (yet).
>
> Thank you,
> Cristian
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics


 


Re: solr text analysis showing a red bar error

2014-01-09 Thread Aruna Kumar Pamulapati
If you are using a Solr version before 4.0 you should look into.

solrconfig.xml:

  1


What is your solr version?



On Thu, Jan 9, 2014 at 8:16 AM, Aruna Kumar Pamulapati <
apamulap...@gmail.com> wrote:

> Thanks, can you paste the text that you were trying to analyze?
>
>
> On Thu, Jan 9, 2014 at 8:10 AM, Umapathy S  wrote:
>
>> Thanks.
>>
>> Actually there is no error thrown.  Just a red bar appears on top.
>> I have pasted it on http://snag.gy/U9IiJ.jpg
>>
>>
>> On 9 January 2014 12:56, Aruna Kumar Pamulapati > >wrote:
>>
>> > Can you copy paste the error, for some reason I can not see the image of
>> > the screenshot you posted.
>> >
>> >
>> > On Thu, Jan 9, 2014 at 7:52 AM, Umapathy S  wrote:
>> >
>> > > Hi,
>> > >
>> > > I am a new to solr/lucene.
>> > > I am trying to do a text analysis on my index.  The below error
>> > > (screenshot) is shown when I increase the field value length.  I have
>> > tried
>> > > searching in vain for any length specific restrictions in
>> solr.TextField.
>> > > There is no error text/exception thrown.
>> > >
>> > > [image: Inline images 1]
>> > >
>> > > The field is below
>> > > > indexed="true"
>> > >  />
>> > >
>> > > fieldtype is
>> > >
>> > > > > > positionIncrementGap="100">
>> > >   
>> > > 
>> > > > > > words="stopwords.txt" enablePositionIncrements="true" />
>> > > 
>> > >   
>> > >   
>> > > 
>> > > > > > words="stopwords.txt" enablePositionIncrements="true" />
>> > > > synonyms="synonyms.txt"
>> > > ignoreCase="true" expand="true"/>
>> > > 
>> > >   
>> > > 
>> > >
>> > >
>> > > Any help much appreciated.
>> > >
>> > > Thanks
>> > >
>> > > Umapathy
>> > >
>> >
>>
>
>


Re: solr text analysis showing a red bar error

2014-01-09 Thread Aruna Kumar Pamulapati
Thanks, can you paste the text that you were trying to analyze?


On Thu, Jan 9, 2014 at 8:10 AM, Umapathy S  wrote:

> Thanks.
>
> Actually there is no error thrown.  Just a red bar appears on top.
> I have pasted it on http://snag.gy/U9IiJ.jpg
>
>
> On 9 January 2014 12:56, Aruna Kumar Pamulapati  >wrote:
>
> > Can you copy paste the error, for some reason I can not see the image of
> > the screenshot you posted.
> >
> >
> > On Thu, Jan 9, 2014 at 7:52 AM, Umapathy S  wrote:
> >
> > > Hi,
> > >
> > > I am a new to solr/lucene.
> > > I am trying to do a text analysis on my index.  The below error
> > > (screenshot) is shown when I increase the field value length.  I have
> > tried
> > > searching in vain for any length specific restrictions in
> solr.TextField.
> > > There is no error text/exception thrown.
> > >
> > > [image: Inline images 1]
> > >
> > > The field is below
> > >  > >  />
> > >
> > > fieldtype is
> > >
> > >  > > positionIncrementGap="100">
> > >   
> > > 
> > >  > > words="stopwords.txt" enablePositionIncrements="true" />
> > > 
> > >   
> > >   
> > > 
> > >  > > words="stopwords.txt" enablePositionIncrements="true" />
> > >  synonyms="synonyms.txt"
> > > ignoreCase="true" expand="true"/>
> > > 
> > >   
> > > 
> > >
> > >
> > > Any help much appreciated.
> > >
> > > Thanks
> > >
> > > Umapathy
> > >
> >
>


Re: solr text analysis showing a red bar error

2014-01-09 Thread Umapathy S
Thanks.

Actually there is no error thrown.  Just a red bar appears on top.
I have pasted it on http://snag.gy/U9IiJ.jpg


On 9 January 2014 12:56, Aruna Kumar Pamulapati wrote:

> Can you copy paste the error, for some reason I can not see the image of
> the screenshot you posted.
>
>
> On Thu, Jan 9, 2014 at 7:52 AM, Umapathy S  wrote:
>
> > Hi,
> >
> > I am a new to solr/lucene.
> > I am trying to do a text analysis on my index.  The below error
> > (screenshot) is shown when I increase the field value length.  I have
> tried
> > searching in vain for any length specific restrictions in solr.TextField.
> > There is no error text/exception thrown.
> >
> > [image: Inline images 1]
> >
> > The field is below
> >  >  />
> >
> > fieldtype is
> >
> >  > positionIncrementGap="100">
> >   
> > 
> >  > words="stopwords.txt" enablePositionIncrements="true" />
> > 
> >   
> >   
> > 
> >  > words="stopwords.txt" enablePositionIncrements="true" />
> >  > ignoreCase="true" expand="true"/>
> > 
> >   
> > 
> >
> >
> > Any help much appreciated.
> >
> > Thanks
> >
> > Umapathy
> >
>


Checking for similar text (duplicates)

2014-01-09 Thread Cristian Bichis

Hi,

I have one app where the search part is based currently on something 
else than Solr. However, as the scale/demand and complexity grows I am 
looking at Solr for a potential better fit, including for some features 
currently implemented into scripting layer (so which are not on search 
currently). I am not quite familiar with Solr at this point, I am into 
early checking stage.


One of the current app features is to detect /if there are/ similar 
records into index comparing with a potential new record and /which are 
these records/. In other words to check for duplicates (which are not 
necessary identical but would be very close to original). The comparison 
is made checking on a description field, which could contain couple 
hundreds words (and the words are NOT in English) for each record. Of 
course the comparison could be made more complex in the future, to 
compare 2-3 fields (a title, the description, additional keywords, etc).


Currently this feature is implemented directly in PHP using 
similar_text, which for us has an advantage over levenshtein because it 
gives a straight % match score and we can decide if a record is a 
duplicate based on % score returned by similar_text (eg: if over 80% 
match then is a duplicate). The fact I have a score (filtering limit) 
for each record compared it helps me to decide/tweak the limit I 
consider is the milestone between duplicates and non-duplicates (I may 
decide the comparison is too strict and I may lower the threshold to 75%).


Using levensthein (on php) would require additional processing so the 
performance benefit would be lost with this overhead. As well, on longer 
term any php implementation for this feature would be a performance 
bottleneck so this is not quite a solution.


I am looking to move this "slow" operation into a more efficient 
environment, that's why I considered moving into search part this feature.


I want to know if anyone has an efficient (working) solution based on 
Solr for this case. I am not sure if fuzzy search would be enough, I 
havent made a test case for this (yet).


Thank you,
Cristian


Re: solr text analysis showing a red bar error

2014-01-09 Thread Aruna Kumar Pamulapati
Can you copy paste the error, for some reason I can not see the image of
the screenshot you posted.


On Thu, Jan 9, 2014 at 7:52 AM, Umapathy S  wrote:

> Hi,
>
> I am a new to solr/lucene.
> I am trying to do a text analysis on my index.  The below error
> (screenshot) is shown when I increase the field value length.  I have tried
> searching in vain for any length specific restrictions in solr.TextField.
> There is no error text/exception thrown.
>
> [image: Inline images 1]
>
> The field is below
>   />
>
> fieldtype is
>
>  positionIncrementGap="100">
>   
> 
>  words="stopwords.txt" enablePositionIncrements="true" />
> 
>   
>   
> 
>  words="stopwords.txt" enablePositionIncrements="true" />
>  ignoreCase="true" expand="true"/>
> 
>   
> 
>
>
> Any help much appreciated.
>
> Thanks
>
> Umapathy
>


solr text analysis showing a red bar error

2014-01-09 Thread Umapathy S
Hi,

I am a new to solr/lucene.
I am trying to do a text analysis on my index.  The below error
(screenshot) is shown when I increase the field value length.  I have tried
searching in vain for any length specific restrictions in solr.TextField.
There is no error text/exception thrown.

[image: Inline images 1]

The field is below


fieldtype is


  



  
  




  



Any help much appreciated.

Thanks

Umapathy


Re: PeerSync Recovery fails, starting Replication Recovery

2014-01-09 Thread Anca Kopetz

Hi,

We tried to understand why we get a "Connection reset" exception on the leader 
when it tries to foward the documents to one of its replica. We analyzed the GC logs and 
we did not see any long GC pauses around the time the exception was thrown.

For 24 hours of gc logs, the max full gc pause = 3 seconds, but it is not in the same 
time as the "Connection reset" exception.

My question again :

Why does the leader not retry to forward the documents to the replica when it gets an 
IOException in SolrCmdDistributor ? Instead, it sents a "recovery" request to 
the replica.

Solr version is 4.5.1 (we tried 4.6.0, but we had some problems with this version, 
detailed in the mail : "SolrCloud 4.6.0 - leader election issue").

Related to "Connection reset" exception, is there any other tests that we can 
do in order to the cause of it ?

Thank you,
Anca

On 12/20/2013 11:07 AM, Anca Kopetz wrote:

Hi,

We used to have many "Client session timeout" messages in solr logs.

INFO org.apache.zookeeper.ClientCnxn:run:1083 - Client session timed out, have 
not heard from server in 18461ms for sessionid 0x242047fc6d77804, closing 
socket connection and attempting reconnect

Then we set the zkClientTimeout to 30 seconds. Therefore there are no more 
messages of this type in the logs.

But we get some other messages :

Leader logs (solr-39):

2013-12-18 09:45:26,052 [http-8083-74] ERROR 
org.apache.solr.update.SolrCmdDistributor:log:119  - shard update error 
StdNode: 
http://solr-40/searchsolrnodees/es_blue/:org.apache.solr.client.solrj.SolrServerException:
 IOException occured when talking to server at: 
http://solr-40/searchsolrnodees/es_blue
...
Caused by: java.net.SocketException: Connection reset

2013-12-18 09:45:26,060 [http-8083-74] INFO  
org.apache.solr.client.solrj.impl.HttpClientUtil:createClient:103  - Creating new http 
client, config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
2013-12-18 09:45:26,140 [http-8083-49] INFO  
org.apache.solr.handler.admin.CoreAdminHandler:handleWaitForStateAction:819  - 
Going to wait for coreNodeName: solr-40_searchsolrnodees_es_blue, state: 
recovering, checkLive: true, onlyIfLeader: true

Replica logs (solr-40) :

2013-12-18 09:45:26,083 [http-8083-65] INFO  
org.apache.solr.handler.admin.CoreAdminHandler:handleRequestRecoveryAction:705  
- It has been requested that we recover
2013-12-18 09:45:26,091 [http-8083-65] INFO  
org.apache.solr.servlet.SolrDispatchFilter:handleAdminRequest:658  - [admin] webapp=null 
path=/admin/cores 
params={action=REQUESTRECOVERY&core=es_blue&wt=javabin&version=2} status=0 
QTime=8
...
2013-12-18 09:45:29,190 [RecoveryThread] WARN  
org.apache.solr.update.PeerSync:sync:232  - PeerSync: core=es_blue 
url=http://dc2-s6-prod-solr-40.prod.dc2.kelkoo.net:8083/searchsolrnodees too 
many updates received since start - startingUpdates no longer overlaps with our 
currentUpdates
2013-12-18 09:45:29,191 [RecoveryThread] INFO  
org.apache.solr.cloud.RecoveryStrategy:doRecovery:394  - PeerSync Recovery was 
not successful - trying replication. core=es_blue
2013-12-18 09:45:29,191 [RecoveryThread] INFO  
org.apache.solr.cloud.RecoveryStrategy:doRecovery:397  - Starting Replication 
Recovery. core=es_blue

Therefore, if I understand it right, the leader does not manage to foward the documents 
to the replica due to a "Connection reset" problem, and it asks the replica to 
recover. The replica tries to recover, it fails, and it starts a replication recovery.

Why does the leader not retry to forward the documents to the replica when it 
gets an IOException in SolrCmdDistributor ?

Solr version is 4.5.1

For the Garbage Collector, we use the settings defined here 
http://wiki.apache.org/solr/ShawnHeisey#GC_Tuning

Thank you,
Anca

On 12/19/2013 04:50 PM, Mark Miller wrote:

Sounds like you need to raise your ZooKeeper connection timeout.

Also, make sure you are using a concurrent garbage collector as a side note - 
stop the world pauses should be avoided. Just good advice :)

- Mark

On Dec 18, 2013, at 5:48 AM, Anca Kopetz 

 wrote:



Hi,

In our SolrCloud cluster (2 shards, 8 replicas), the replicas go from time to 
time into recovering state, and it takes more than 10 minutes to finish to 
recover.

In logs, we see that "PeerSync Recovery" fails with the message :

PeerSync: core=fr_green url=http://solr-08/searchsolrnodefr too many updates 
received since start - startingUpdates no longer overlaps with our 
currentUpdates

Then "Replication Recovery" starts.

Is there something we can do to avoid the failure of "Peer Recovery" so that 
the recovery process is more rapid (less than 10 minutes) ?

The full trace log is here :

2013-12-05 13:51:53,740 [http-8080-46] INFO  
org.apache.solr.handler.admin.CoreAdminHandler:handleRequestRecoveryAction:705  
- It has been requested that we recover
2013-12-05 13:51:53,740 [http-8080-112] INFO  

Re: Shard splitting error: cannot uncache file="_1.nvm"

2014-01-09 Thread rafal janik
Greg Preston wrote
>  [qtp243983770-60] ERROR org.apache.solr.core.SolrCore  –
> java.io.IOException: cannot uncache file="_1.nvm": it was separately
> also created in the delegate directory
> at
> org.apache.lucene.store.NRTCachingDirectory.unCache(NRTCachingDirectory.java:297)
> at
> org.apache.lucene.store.NRTCachingDirectory.sync(NRTCachingDirectory.java:216)

Hi Greg, have you figured it out. I have the same problem...

rafal  



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Shard-splitting-error-cannot-uncache-file-1-nvm-tp4086863p4110414.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Zookeeper as Service

2014-01-09 Thread Karthikeyan.Kannappan
I am hosting in windows OS



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Zookeeper-as-Service-tp4110396p4110413.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr OOM Crash

2014-01-09 Thread Sébastien Michel
Hi Sandra,

Excuse me for the late reply.
We use lotsofcores (http://wiki.apache.org/solr/LotsOfCores) Solr feature,
around 100 simultaneous loaded cores. But the issue is reproducible with
few less cores.
We also have a high rate of indexing, and also reindexing (atomic update).

We are indexing media files metadata, but also metadata and contents of
PDF, the content is stored in a "text" field (stored="true").
Until release 4.3, Solr  uses a growing buffer to uncompress stored fields
(I assume one buffer per Solr Core or per Shard).
The issue comes when Solr read some big docs, the buffer of
CompressedStoredFieldReader grows but never shrinks.  The more such big
docs are read in different threads, the more the Heap usage is growing,
until the heap has no more free memory available and GC runs continuously.

Analyzing the dump: Class byte[] takes 3gb out of 4gb resourced to the JVM,
mainly referenced by CompressingStoredFieldsReader

I hope it can help you.

Sébastien


2013/12/30 Sandra Scott 

> Hello Sébastien,
>
> Can you give some information about your environment so I can make sure we
> are having the same problem you had?
> Also, did you find out what caused the GC to go crazy or what caused the
> increased commit rate?
>
> Thanks,
> Sandra
>
>
> On Thu, Dec 19, 2013 at 12:34 PM, Sébastien Michel <
> sebastien.mic...@atos.net> wrote:
>
> > Hi Sandra,
> >
> > I'm not sure if your problem is same as ours, but we encountered the same
> > issue on our Solr 4.2, the major memory usage was due to
> > CompressingStoredFieldsReader and GC became crazy.
> > In our context, we have some stored fields and for some documents the
> > content of the text field could be huge.
> >
> > We resolved our issue with the backport of this fix :
> > https://issues.apache.org/jira/browse/LUCENE-4995
> >
> > You should also upgrade to Solr 4.4 or more
> >
> > Regards,
> > Sébastien
> >
> >
> > 2013/12/12 Sandra Scott 
> >
> > > Helllo,
> > >
> > > We are experiencing unexplained OOM crashes. We have already seen it a
> > few
> > > times, over our different solr instances. The crash happens only at a
> > > single shard of the collection.
> > >
> > > Environment details:
> > > 1. Solr 4.3, running on tomcat.
> > > 2. 24 Shards.
> > > 3. Indexing rate of ~800 docs per minute.
> > >
> > > Solrconfig.xml:
> > > 1. Merge factor 4
> > > 2. Sofrcommit every 10 min
> > > 3. Hardcommit every 30 min
> > >
> > > Main findings:
> > > 1. Solr logs: No query failures prior to the OOM, but DOUBLE the amount
> > of
> > > soft and hard commits in comparison to other shards.
> > > 2. Analyzing the dump (VisualVM): Class byte[] takes 4gb out of 5gb
> > > resourced to the JVM, mainly referenced by
> CompressingStoredFieldsReader
> > GC
> > > root (which by looking at the code, we suspect they were created due to
> > > CompressingSortedFieldsWriter.merge).
> > >
> > > Sub findings:
> > > 1. GC logs: Showed 108 GC fails prior to the crash.
> > > 2. CPI: Overall usage seems fine, but the % of CPU time for the GC
> stays
> > > high 6 min before the OOM.
> > > 3. Memory: Half an hour before OOM the usage slowly rises, until it
> gets
> > to
> > > 5.4gb.
> > >
> > > Has anyone encountered higher than normal commit rate that seem to
> > increase
> > > merge rate and cause what I described?
> > >
> >
>