date:20091005

On 10/5/09 8:59 PM, "Christian Zambrano"  wrote:

> 
> Wouldn't it be better to use built-in token filters at both index and
> query that will convert 'it!' to just 'it'? I believe the
> WorkDelimeterFilterFactory will do that for you.
> 

 We do have a field that uses WordDelimiterFilter but it also uses a Stemmer
and Stopword filter. That field is used for a stemmed match with a nominal
boost. However, the field I am talking about is for an exact match (only
lowercase and synonym filter) with a higher boost than the field with the
WordDelimiterFilter.

Prasanna.

Re: Question about PatternReplace filter and automatic Synonym generation


Prasanna,

Wouldn't it be better to use built-in token filters at both index and  
query that will convert 'it!' to just 'it'? I believe the  
WorkDelimeterFilterFactory will do that for you.


Christian

On Oct 5, 2009, at 7:31 PM, Prasanna Ranganathan > wrote:






On 10/5/09 2:46 AM, "Shalin Shekhar Mangar"   
wrote:


Alternatively, is there a filter available which takes in a  
pattern and
produces additional forms of the token depending on the pattern?  
The use

case I am looking at here is using such a filter to automate synonym
generation. In our application, quite a few of the synonym file  
entries
match a specific pattern and having such a filter would make it  
easier I
believe. Pl. do correct me in case I am missing some unwanted side- 
effect

with this approach.


I do not understand this. TokenFilters are used for things like  
stemming,
replacing patterns, lowercasing, n-gramming etc. The synonym filter  
inserts

additional tokens (synonyms) from a file for each token.

What exactly are you trying to do with synonyms? I guess you could do
stemming etc with synonyms but why do you want to do that?


I ll try to explain with an example. Given the term 'it!' in the  
title, it
should match both 'it' and 'it!' in the query as an exact match.  
Currently,
this is done by using a synonym entry  (and index time  
SynonymFilter) as

follows:

it! => it, it!

Now, the above holds true for all cases where you have a title token  
of the

form [aA-zZ]*!. Handling all of those cases requires adding synonyms
manually for each case which is not easy to manage and does not scale.

I am hoping to do the same by using a index time filter that takes  
in a
pattern like the PatternReplace filter and adds the newly created  
token
instead of replacing the original one. Does this make sense? Am I  
missing

something that would break this approach?



Note that a change in synonym file needs a re-index of the affected
documents. Also, the synonym map is kept in memory.


What is the overhead incurred in having an additional filter applied  
during

indexing? It is strictly CPU only?

Thanks a lot for your valuable input.

Regards,

Prasanna.

DataImportHandler problem: Feeding the XPathEntityProcessor with the FieldReaderDataSource

2009-10-05 Thread Lance Norskog

I've added a unit test for the problem down below. It feeds document
field data into the XPathEntityProcessor via the
FieldReaderDataSource, and the XPath EP does not emit unpacked fields.

Running this under the debugger, I can see the supplied StringReader,
with the XML string, being piped into the XPath EP. But somehow the
XPath EP does not pick it apart the right way.

Here is the DIH configuration file separately.


  
  
  
  




  

  
  


Any ideas?

---

package org.apache.solr.handler.dataimport;

import static 
org.apache.solr.handler.dataimport.AbstractDataImportHandlerTest.createMap;
import junit.framework.TestCase;

import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

import org.apache.solr.common.SolrInputDocument;
import org.apache.solr.common.SolrInputField;
import org.apache.solr.handler.dataimport.TestDocBuilder.SolrWriterImpl;
import org.junit.Test;

/*
 * Demonstrate problem feeding XPathEntity from a FieldReaderDatasource
 */

public class TestFieldReaderXPath extends TestCase {
static final String KISSINGER = "Henry";

static final String[][][] DBDOCS = {
{{"dbid", "1"}, {"blob", KISSINGER}},
};

/*
 * Receive a row from SQL and fetch a row from Solr - no value matching
 * stolen from TestDocBuilder
 * */

@Test
public void testSolrEmbedded() throws Exception {
try {
DataImporter di = new DataImporter();
di.loadDataConfig(dih_config_FR_into_XP);
DataImporter.RequestParams rp = new 
DataImporter.RequestParams();
rp.command = "full-import";
rp.requestParams = new HashMap();

DataConfig cfg = di.getConfig();
DataConfig.Entity entity = cfg.document.entities.get(0);
List> l = new 
ArrayList>();
addDBDocuments(l);
MockDataSource.setIterator("select * from x", 
l.iterator());
entity.dataSrc = new MockDataSource();
entity.isDocRoot = true;
SolrWriterImpl swi = new SolrWriterImpl();
di.runCmd(rp, swi);

assertEquals(1, swi.docs.size());
SolrInputDocument doc = swi.docs.get(0);
SolrInputField field;
field = doc.getField("dbid");
assertEquals(field.getValue().toString(), "1");
field = doc.getField("blob");
assertEquals(field.getValue().toString(), KISSINGER);
field = doc.getField("name");
assertNotNull(field);
assertEquals(field.getValue().toString(), "Henry");
} finally {
MockDataSource.clearCache();
}
}


private void addDBDocuments(List> l) {
for(String[][] dbdoc: DBDOCS) {
l.add(createMap(dbdoc[0][0], dbdoc[0][1], dbdoc[1][0], 
dbdoc[1][1]));
}
}

 String dih_config_FR_into_XP = "\r\n" +
 "  \r\n" +
 "  \r\n" +
 "  \r\n" +
 "  \r\n" 
+
 "\r\n" +
 "\r\n" +
 "\r\n" +
 "\r\n" +
 "  \r\n" +
 "\r\n" +
 "  \r\n" +
 "  \r\n" +
 "\r\n"
 ;


}

Help with denormalizing issues

2009-10-05 Thread Eric Reeves

Hi there,

I'm evaluating Solr as a replacement for our current search server, and am 
trying to determine what the best strategy would be to implement our business 
needs.  Our problem is that we have a catalog schema with products and skus, 
one to many.  The most relevant content being indexed is at the product level, 
in the name and description fields.  However we are interested in filtering by 
sku attributes, and in particular making multiple filters apply to a single 
sku.  For example, find a product that contains a sku that is both blue and on 
sale.  No approach I've tried at collapsing the sku data into the product 
document works for this.  If we put the data in separate fields, there's no way 
to apply multiple filters to the same sku. and if we concatenate all of the 
relevant sku data into a single multivalued field then as I understand it, this 
is just indexed as one large field with extra whitespace between the individual 
entries, so there's still no way to enforce that an AND filter query applies to 
the same sku.

One approach I was considering was to create separate indexes for products and 
skus, and store the product IDs in the sku documents.  Then we could apply our 
own filters to the initially generated list, based on unique query parameters.  
I thought creating a component between query and facet would be a good place to 
add such a filter, but further research seems to indicate that this would break 
paging and sorting.  The only other thing I can think of would be to subclass 
QueryComponent itself, which looks rather daunting-the process() method has no 
hooks for this sort of thing, it seems I would have to copy the entire existing 
implementation and add them myself, which looks to be a fair chunk of work and 
brittle to changes in the trunk code.  Ideally it would be nice to be able to 
handle certain fq parameters in a completely different way, perhaps using a 
custom query parser, but I haven't wrapped my head around how those work.  Does 
any of this sound remotely doable?  Any advice?

The other suggestion we are looking at was given to us by our current search 
provider, which is to index the skus themselves.  It looks as if we may be able 
to make this work using the field collapsing patch from SOLR-236.  I have some 
concerns about this approach though: 1) It will make for a much larger index 
and longer indexing times (products can have 10 or more skus in our catalog).  
2) Because the indexing will be copying the description and name from the 
product it will be indexing the same content more than once, and the number of 
times per product will vary based on the number of skus.  I'm concerned that 
this may skew the scoring algorithm, in particular the inverse frequency part.  
3) I'm not sure about the performance of the field collapsing patch, I've read 
contradictory reports on the web.

I apologize if this is a bit rambling.  If anyone has any advice for our 
situation it would be very helpful.

Thanks,
Eric

Re: Question about PatternReplace filter and automatic Synonym generation

On 10/5/09 2:46 AM, "Shalin Shekhar Mangar"  wrote:

>> Alternatively, is there a filter available which takes in a pattern and
>> produces additional forms of the token depending on the pattern? The use
>> case I am looking at here is using such a filter to automate synonym
>> generation. In our application, quite a few of the synonym file entries
>> match a specific pattern and having such a filter would make it easier I
>> believe. Pl. do correct me in case I am missing some unwanted side-effect
>> with this approach.
>> 
>> 
> I do not understand this. TokenFilters are used for things like stemming,
> replacing patterns, lowercasing, n-gramming etc. The synonym filter inserts
> additional tokens (synonyms) from a file for each token.
> 
> What exactly are you trying to do with synonyms? I guess you could do
> stemming etc with synonyms but why do you want to do that?

 I ll try to explain with an example. Given the term 'it!' in the title, it
should match both 'it' and 'it!' in the query as an exact match. Currently,
this is done by using a synonym entry  (and index time SynonymFilter) as
follows:

 it! => it, it!

 Now, the above holds true for all cases where you have a title token of the
form [aA-zZ]*!. Handling all of those cases requires adding synonyms
manually for each case which is not easy to manage and does not scale.

 I am hoping to do the same by using a index time filter that takes in a
pattern like the PatternReplace filter and adds the newly created token
instead of replacing the original one. Does this make sense? Am I missing
something that would break this approach?

> 
> Note that a change in synonym file needs a re-index of the affected
> documents. Also, the synonym map is kept in memory.

 What is the overhead incurred in having an additional filter applied during
indexing? It is strictly CPU only?

 Thanks a lot for your valuable input.

Regards,

Prasanna.

Re: Question about PatternReplace filter and automatic Synonym generation


I just saw the reply from Shalin after sending this email. Kindly excuse.


On 10/5/09 5:17 PM, "Prasanna Ranganathan"  wrote:

> 
>  Can someone please give me some pointers to the questions in my earlier
> email? And and every help is much appreciated.
> 
> Regards,
> 
> Prasanna.
> 
> 
> On 10/2/09 11:01 AM, "Prasanna Ranganathan"  wrote:
> 
>> 
>>  Does the PatternReplaceFilter have an option where you can keep the original
>> token in addition to the modified token? From what I looked at it does not
>> seem to but I want to confirm the same.
>> 
>> Alternatively, is there a filter available which takes in a pattern and
>> produces additional forms of the token depending on the pattern? The use case
>> I am looking at here is using such a filter to automate synonym generation.
>> In our application, quite a few of the synonym file entries match a specific
>> pattern and having such a filter would make it easier I believe. Pl. do
>> correct me in case I am missing some unwanted side-effect with this approach.
>> 
>> Continuing on that line, what is the performance hit in having additional
>> index-time filters as opposed to using a synonym file with more entries? How
>> does the overhead of using a bigger synonym file as opposed to additional
>> filters compare?
>> 
>> Thanks in advance for the help.
>> 
>> Regards,
>> 
>> Prasanna.

Re: Question about PatternReplace filter and automatic Synonym generation


 Can someone please give me some pointers to the questions in my earlier
email? And and every help is much appreciated.

Regards,

Prasanna.


On 10/2/09 11:01 AM, "Prasanna Ranganathan" 
wrote:

> 
>  Does the PatternReplaceFilter have an option where you can keep the original
> token in addition to the modified token? From what I looked at it does not
> seem to but I want to confirm the same.
> 
> Alternatively, is there a filter available which takes in a pattern and
> produces additional forms of the token depending on the pattern? The use case
> I am looking at here is using such a filter to automate synonym generation. In
> our application, quite a few of the synonym file entries match a specific
> pattern and having such a filter would make it easier I believe. Pl. do
> correct me in case I am missing some unwanted side-effect with this approach.
> 
> Continuing on that line, what is the performance hit in having additional
> index-time filters as opposed to using a synonym file with more entries? How
> does the overhead of using a bigger synonym file as opposed to additional
> filters compare?
> 
> Thanks in advance for the help.
> 
> Regards,
> 
> Prasanna.

RE: About SolrJ for XML

2009-10-05 Thread Chaitali Gupta

Hi, 

Thanks a lot. It worked!! 

I was wondering if there is a way in SolrJ to print out the size of the index 
being generated? Or else how do I determine the total size of the generated 
index ?

Thanks, 
Chaitali 


--- On Mon, 10/5/09, Feak, Todd  wrote:

From: Feak, Todd 
Subject: RE: About SolrJ for XML
To: "solr-user@lucene.apache.org" 
Date: Monday, October 5, 2009, 5:17 PM

It looks like you have some confusion about queries vs. facets. You may want to 
look at the Solr wiki reqarding facets a bit. In the meanwhile, if you just 
want to query for that field containing "21"...

I would suggest that you don't set the query type, don't set any facet fields, 
and only set the query. Set the query to "field:21" where "field" should be 
replaced with the fieldname that has a "21" in it.

For example, if the field name is foo, try this instead:

SolrQuery query = new SolrQuery();
query.setQuery("foo:21");  
QueryResponse qr = server.query(query);
SolrDocumentList sdl = qr.getResults();


To delve into more detail, what your original code did was query for a "21" in 
the default field (check your solrconfig.xml to see what is default). It then 
faceted the query results by the "id" field and "weight" fields. Because there 
were no search results at all, the faceting request didn't do anything. I'm not 
sure why you switched the query type to DisMax, as you didn't issue a query 
that would leverage it.

-Todd

-Original Message-
From: Chaitali Gupta [mailto:chaitaligupt...@yahoo.com] 
Sent: Monday, October 05, 2009 2:05 PM
To: solr-user@lucene.apache.org
Subject: About SolrJ for XML 

Hi, 

I am new in Solr. I am using Solr version 1.3 

I would like to index XML files using SolrJ API. I have gone through solr 
mailing list's emails and have been able to index XML files. But when I try to 
query on those files using SolrJ, I get no output. Especially, I do not find 
correct results for numeric fields that I have specified in the schema.xml file 
in the config directory for my XML files. I have made those fields "indexed" 
and "stored" by using "indexed=true" and "stored=true". I am using the 
following code  in order to search for data (In the following code, I am trying 
to find out weight with values 21) - 

 SolrQuery query = new SolrQuery();
 query.setQueryType("dismax");
 query.setFacet(true);
 query.addFacetField("id");
 query.addFacetField("weight");
 query.setQuery("21");  
 QueryResponse qr = server.query(query);
 SolrDocumentList sdl = qr.getResults();

Am I doing anything wrong? Why do I get zero results even when there is a XML 
file with weight being 21. What are the other ways of doing the numeric queries 
in SolrJ ? 

Also, I would like to know how do I get the exact size of the index being 
generated by Solr. I am using a single machine to generate and query the index. 
When I look at the index directory, I see that the size of the files in the 
index directory is much lesser than the size reported by the "total" column in 
"ls -lh" command. Does anyone have any idea why is it the case? 

Thanks in advance. Waiting for your reply soon. 

Regards
Chaitali

RE: cleanup old index directories on slaves

2009-10-05 Thread Francis Yakin

I use it in our env(Prod), it seems to working fine for years now only clean up 
the snapshot, but not the index.

I added it to the cron that run once a day to clean up

-francis

-Original Message-
From: Feak, Todd [mailto:todd.f...@smss.sony.com] 
Sent: Monday, October 05, 2009 2:34 PM
To: solr-user@lucene.apache.org
Subject: RE: cleanup old index directories on slaves

We use the snapcleaner script.

http://wiki.apache.org/solr/SolrCollectionDistributionScripts#snapcleaner

Will that do the job?

-Todd

-Original Message-
From: solr jay [mailto:solr...@gmail.com] 
Sent: Monday, October 05, 2009 1:58 PM
To: solr-user@lucene.apache.org
Subject: cleanup old index directories on slaves

Is there a reliable way to safely clean up index directories? This is needed
mainly on slave side as in several situations, an old index directory is
replaced with a new one, and I'd like to remove those that are no longer in
use.

Thanks,

-- 
J

RE: cleanup old index directories on slaves

We use the snapcleaner script.

http://wiki.apache.org/solr/SolrCollectionDistributionScripts#snapcleaner

Will that do the job?

-Todd

-Original Message-
From: solr jay [mailto:solr...@gmail.com] 
Sent: Monday, October 05, 2009 1:58 PM
To: solr-user@lucene.apache.org
Subject: cleanup old index directories on slaves

Is there a reliable way to safely clean up index directories? This is needed
mainly on slave side as in several situations, an old index directory is
replaced with a new one, and I'd like to remove those that are no longer in
use.

Thanks,

-- 
J

Re: cleanup old index directories on slaves

2009-10-05 Thread Bill Au

Have you looked at snapcleaner?

http://wiki.apache.org/solr/SolrCollectionDistributionScripts#snapcleaner
http://wiki.apache.org/solr/CollectionDistribution#snapcleaner

Bill

On Mon, Oct 5, 2009 at 4:58 PM, solr jay  wrote:

> Is there a reliable way to safely clean up index directories? This is
> needed
> mainly on slave side as in several situations, an old index directory is
> replaced with a new one, and I'd like to remove those that are no longer in
> use.
>
> Thanks,
>
> --
> J
>

RE: About SolrJ for XML

It looks like you have some confusion about queries vs. facets. You may want to 
look at the Solr wiki reqarding facets a bit. In the meanwhile, if you just 
want to query for that field containing "21"...

I would suggest that you don't set the query type, don't set any facet fields, 
and only set the query. Set the query to "field:21" where "field" should be 
replaced with the fieldname that has a "21" in it.

For example, if the field name is foo, try this instead:

SolrQuery query = new SolrQuery();
query.setQuery("foo:21");  
QueryResponse qr = server.query(query);
SolrDocumentList sdl = qr.getResults();


To delve into more detail, what your original code did was query for a "21" in 
the default field (check your solrconfig.xml to see what is default). It then 
faceted the query results by the "id" field and "weight" fields. Because there 
were no search results at all, the faceting request didn't do anything. I'm not 
sure why you switched the query type to DisMax, as you didn't issue a query 
that would leverage it.

-Todd

-Original Message-
From: Chaitali Gupta [mailto:chaitaligupt...@yahoo.com] 
Sent: Monday, October 05, 2009 2:05 PM
To: solr-user@lucene.apache.org
Subject: About SolrJ for XML 

Hi, 

I am new in Solr. I am using Solr version 1.3 

I would like to index XML files using SolrJ API. I have gone through solr 
mailing list's emails and have been able to index XML files. But when I try to 
query on those files using SolrJ, I get no output. Especially, I do not find 
correct results for numeric fields that I have specified in the schema.xml file 
in the config directory for my XML files. I have made those fields "indexed" 
and "stored" by using "indexed=true" and "stored=true". I am using the 
following code  in order to search for data (In the following code, I am trying 
to find out weight with values 21) - 

 SolrQuery query = new SolrQuery();
 query.setQueryType("dismax");
 query.setFacet(true);
 query.addFacetField("id");
 query.addFacetField("weight");
 query.setQuery("21");  
 QueryResponse qr = server.query(query);
 SolrDocumentList sdl = qr.getResults();

Am I doing anything wrong? Why do I get zero results even when there is a XML 
file with weight being 21. What are the other ways of doing the numeric queries 
in SolrJ ? 

Also, I would like to know how do I get the exact size of the index being 
generated by Solr. I am using a single machine to generate and query the index. 
When I look at the index directory, I see that the size of the files in the 
index directory is much lesser than the size reported by the "total" column in 
"ls -lh" command. Does anyone have any idea why is it the case? 

Thanks in advance. Waiting for your reply soon. 

Regards
Chaitali

About SolrJ for XML

2009-10-05 Thread Chaitali Gupta

Hi, 

I am new in Solr. I am using Solr version 1.3 

I would like to index XML files using SolrJ API. I have gone through solr 
mailing list's emails and have been able to index XML files. But when I try to 
query on those files using SolrJ, I get no output. Especially, I do not find 
correct results for numeric fields that I have specified in the schema.xml file 
in the config directory for my XML files. I have made those fields "indexed" 
and "stored" by using "indexed=true" and "stored=true". I am using the 
following code  in order to search for data (In the following code, I am trying 
to find out weight with values 21) - 

 SolrQuery query = new SolrQuery();
 query.setQueryType("dismax");
 query.setFacet(true);
 query.addFacetField("id");
 query.addFacetField("weight");
 query.setQuery("21");  
 QueryResponse qr = server.query(query);
 SolrDocumentList sdl = qr.getResults();

Am I doing anything wrong? Why do I get zero results even when there is a XML 
file with weight being 21. What are the other ways of doing the numeric queries 
in SolrJ ? 

Also, I would like to know how do I get the exact size of the index being 
generated by Solr. I am using a single machine to generate and query the index. 
When I look at the index directory, I see that the size of the files in the 
index directory is much lesser than the size reported by the "total" column in 
"ls -lh" command. Does anyone have any idea why is it the case? 

Thanks in advance. Waiting for your reply soon. 

Regards
Chaitali

Re: Solr Trunk Heap Space Issues

On Mon, Oct 5, 2009 at 4:54 PM, Jeff Newburn  wrote:
> Ok we have done some more testing on this issue.  When I only have the 1
> core the reindex completes fine.  However, when I added a second core with
> no documents it runs out of heap again.  This time the heap was 322Mb of
> LRUCache.  The 1 query that warms returns exactly 2 documents so I have no
> idea where the LRUCache is getting its information or what is even in there.

I guess the obvious thing to check would be the custom search component.
Does it access documents?  I don't see how else the document cache
could self populate with so many entries (assuming it is the document
cache again).

-Yonik
http://www.lucidimagination.com

>
> --
> Jeff Newburn
> Software Engineer, Zappos.com
> jnewb...@zappos.com - 702-943-7562
>
>
>> From: Yonik Seeley 
>> Reply-To: 
>> Date: Mon, 5 Oct 2009 13:32:32 -0400
>> To: 
>> Subject: Re: Solr Trunk Heap Space Issues
>>
>> On Mon, Oct 5, 2009 at 1:00 PM, Jeff Newburn  wrote:
>>> Ok I have eliminated all queries for warming and am still getting the heap
>>> space dump.  Any ideas at this point what could be wrong?  This seems like a
>>> huge increase in memory to go from indexing without issues to not being able
>>> to even with warming off.
>>
>> Do you have any custom Analyzers, Tokenizers, TokenFilters?
>> Another change is that token streams are reused by caching in a
>> thread-local, so every thread in your server could potentially have a
>> copy of an analysis chain (token stream) per field that you have used.
>>  This normally shouldn't be an issue since these will be small.  Also,
>> how many unique fields do you have?
>>
>> -Yonik
>> http://www.lucidimagination.com
>>
>>
>>
>>> Jeff Newburn
>>> Software Engineer, Zappos.com
>>> jnewb...@zappos.com - 702-943-7562
>>>
>>>
 From: Jeff Newburn 
 Reply-To: 
 Date: Thu, 01 Oct 2009 08:41:18 -0700
 To: "solr-user@lucene.apache.org" 
 Subject: Solr Trunk Heap Space Issues

 I am trying to update to the newest version of solr from trunk as of May
 5th.  I updated and compiled from trunk as of yesterday (09/30/2009).  When
 I try to do a full import I am receiving a GC heap error after changing
 nothing in the configuration files.  Why would this happen in the most
 recent versions but not in the version from a few months ago.  The stack
 trace is below.

 Oct 1, 2009 8:34:32 AM org.apache.solr.update.processor.LogUpdateProcessor
 finish
 INFO: {add=[166400, 166608, 166698, 166800, 166811, 167097, 167316, 167353,
 ...(83 more)]} 0 35991
 Oct 1, 2009 8:34:32 AM org.apache.solr.common.SolrException log
 SEVERE: java.lang.OutOfMemoryError: GC overhead limit exceeded
     at java.util.Arrays.copyOfRange(Arrays.java:3209)
     at java.lang.String.(String.java:215)
     at com.ctc.wstx.util.TextBuffer.contentsAsString(TextBuffer.java:384)
     at 
 com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:821)
     at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:280)
     at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:139)
     at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69)
     at
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentSt
 reamHandlerBase.java:54)
     at
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.
 java:131)
     at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
     at
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:3
 38)
     at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:
 241)
     at
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Application
 FilterChain.java:235)
     at
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterCh
 ain.java:206)
     at
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.ja
 va:233)
     at
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.ja
 va:175)
     at
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128
 )
     at
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102
 )
     at
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java
 :109)
     at
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286)
     at
 org.apache.coyote.http11.Http11NioProcessor.process(Http11NioProcessor.java:
 879)
     at
 org.apache.coyote.http11.Http11NioProtocol$Http11ConnectionHandler.process(H
 ttp11NioProtocol.java:719)
     at
 org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:
 2080)
     at
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Thr

cleanup old index directories on slaves

2009-10-05 Thread solr jay

Is there a reliable way to safely clean up index directories? This is needed
mainly on slave side as in several situations, an old index directory is
replaced with a new one, and I'd like to remove those that are no longer in
use.

Thanks,

-- 
J

Re: Solr Trunk Heap Space Issues

2009-10-05 Thread Jeff Newburn

Ok we have done some more testing on this issue.  When I only have the 1
core the reindex completes fine.  However, when I added a second core with
no documents it runs out of heap again.  This time the heap was 322Mb of
LRUCache.  The 1 query that warms returns exactly 2 documents so I have no
idea where the LRUCache is getting its information or what is even in there.


-- 
Jeff Newburn
Software Engineer, Zappos.com
jnewb...@zappos.com - 702-943-7562


> From: Yonik Seeley 
> Reply-To: 
> Date: Mon, 5 Oct 2009 13:32:32 -0400
> To: 
> Subject: Re: Solr Trunk Heap Space Issues
> 
> On Mon, Oct 5, 2009 at 1:00 PM, Jeff Newburn  wrote:
>> Ok I have eliminated all queries for warming and am still getting the heap
>> space dump.  Any ideas at this point what could be wrong?  This seems like a
>> huge increase in memory to go from indexing without issues to not being able
>> to even with warming off.
> 
> Do you have any custom Analyzers, Tokenizers, TokenFilters?
> Another change is that token streams are reused by caching in a
> thread-local, so every thread in your server could potentially have a
> copy of an analysis chain (token stream) per field that you have used.
>  This normally shouldn't be an issue since these will be small.  Also,
> how many unique fields do you have?
> 
> -Yonik
> http://www.lucidimagination.com
> 
> 
> 
>> Jeff Newburn
>> Software Engineer, Zappos.com
>> jnewb...@zappos.com - 702-943-7562
>> 
>> 
>>> From: Jeff Newburn 
>>> Reply-To: 
>>> Date: Thu, 01 Oct 2009 08:41:18 -0700
>>> To: "solr-user@lucene.apache.org" 
>>> Subject: Solr Trunk Heap Space Issues
>>> 
>>> I am trying to update to the newest version of solr from trunk as of May
>>> 5th.  I updated and compiled from trunk as of yesterday (09/30/2009).  When
>>> I try to do a full import I am receiving a GC heap error after changing
>>> nothing in the configuration files.  Why would this happen in the most
>>> recent versions but not in the version from a few months ago.  The stack
>>> trace is below.
>>> 
>>> Oct 1, 2009 8:34:32 AM org.apache.solr.update.processor.LogUpdateProcessor
>>> finish
>>> INFO: {add=[166400, 166608, 166698, 166800, 166811, 167097, 167316, 167353,
>>> ...(83 more)]} 0 35991
>>> Oct 1, 2009 8:34:32 AM org.apache.solr.common.SolrException log
>>> SEVERE: java.lang.OutOfMemoryError: GC overhead limit exceeded
>>>     at java.util.Arrays.copyOfRange(Arrays.java:3209)
>>>     at java.lang.String.(String.java:215)
>>>     at com.ctc.wstx.util.TextBuffer.contentsAsString(TextBuffer.java:384)
>>>     at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:821)
>>>     at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:280)
>>>     at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:139)
>>>     at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69)
>>>     at
>>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentSt
>>> reamHandlerBase.java:54)
>>>     at
>>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.
>>> java:131)
>>>     at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
>>>     at
>>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:3
>>> 38)
>>>     at
>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:
>>> 241)
>>>     at
>>> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Application
>>> FilterChain.java:235)
>>>     at
>>> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterCh
>>> ain.java:206)
>>>     at
>>> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.ja
>>> va:233)
>>>     at
>>> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.ja
>>> va:175)
>>>     at
>>> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128
>>> )
>>>     at
>>> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102
>>> )
>>>     at
>>> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java
>>> :109)
>>>     at
>>> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286)
>>>     at
>>> org.apache.coyote.http11.Http11NioProcessor.process(Http11NioProcessor.java:
>>> 879)
>>>     at
>>> org.apache.coyote.http11.Http11NioProtocol$Http11ConnectionHandler.process(H
>>> ttp11NioProtocol.java:719)
>>>     at
>>> org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:
>>> 2080)
>>>     at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.ja
>>> va:886)
>>>     at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:9
>>> 08)
>>>     at java.lang.Thread.run(Thread.java:619)
>>> 
>>> Oct 1, 2009 8:40:06 AM org.apache.solr.core.SolrCore execute
>>> INFO: [zeta-main] webapp=/solr path=/update params={} status=500 QTime=5265
>>> Oct 1, 2009 8:40:12 AM org.apache.solr.common.SolrException log
>>> SEVERE: java.lang.OutOfMemoryError: G

Re: Need "OR" in DisMax Query

2009-10-05 Thread David Giffin

So, I remove the stop word OR from the stopwords and get the same
result. Using the standard query handler syntax like this
"fq=((tags:red)+OR+(tags:green))" I get 421,000 results. Using dismax
"q=red+OR+green" I get 29,000 results. The debug output from
parsedquery_toString show this:

+(((tags:red)~0.01 (tags:green)~0.01)~2)

It feels like the dismax handler is not handling the "OR" properly. I
also tried "q=red+|+green" and got the same 29,000 results.

Thanks,
David

On Mon, Oct 5, 2009 at 3:02 PM, Christian Zambrano  wrote:
> David,
>
> If your schema includes fields with analyzers that use the StopFilterFactory
> and the dismax QueryHandler is set-up to search within those fields, then
> you are correct.
>
>
> On 10/05/2009 01:36 PM, David Giffin wrote:
>>
>> Hi There,
>>
>> Maybe I'm missing something, but I can't seem to get the dismax
>> request handler to perform and OR query. It appears that OR is removed
>> by the stop words. I like to do something like
>> "qt=dismax&q=red+OR+green" and get all green and all red results.
>>
>> Thanks,
>> David
>>
>

Re: A little help with indexing joined words

Zambrano, I was too quick to respond to your idf explanation. I definitely
did not mean that "idf" and "length-norms" are the same thing.

Andrew, this is how i would have done it -
First, I would create a field called "prefix_text" as undeneath in my
schema.xml

















Second, I would declare a field of this and populate the same (using
copyField) while indexing.

Third, while querying I would query on the both the fields. I would boost
the matches for original field to a large extent over the n-grammed field.
Scenarios where "Dragon Fly" is expected to match against "Dragonfly" in the
index, query on the original field would not give you any matches, thereby
bringing results from the prefix_token field right there on top.

Hope this helps.

Cheers
Avlesh

On Mon, Oct 5, 2009 at 11:10 PM, Christian Zambrano wrote:

> Would you mind explaining how omitNorm has any effect on the IDF problem I
> described earlier?
>
> I agree with your second sentence. I had to use the NGramTokenFilter to
> accommodate partial matches.
>
>
> On 10/05/2009 12:11 PM, Avlesh Singh wrote:
>
>> Using synonyms might be a better solution because the use of
>>> EdgeNGramTokenizerFactory has the potential of creating a large number of
>>> token which will artificially increase the number of tokens in the index
>>> which in turn will affect the IDF score.
>>>
>>>
>>>
>> Well, I don't see a reason as to why someone would need a length based
>> normalization on such matches. I always have done omitNorms while using
>> fields with this filter.
>>
>> Yes, synonyms might an answer when you have limited number of such words
>> (phrases) and their possible combinations.
>>
>> Cheers
>> Avlesh
>>
>> On Mon, Oct 5, 2009 at 10:32 PM, Christian Zambrano> >wrote:
>>
>>
>>
>>> Using synonyms might be a better solution because the use of
>>> EdgeNGramTokenizerFactory has the potential of creating a large number of
>>> token which will artificially increase the number of tokens in the index
>>> which in turn will affect the IDF score.
>>>
>>> A query for "borderland" should have returned results though. It is
>>> difficult to troubleshoot why it didn't without knowing what query you
>>> used,
>>> and what kind of analysis is taking place.
>>>
>>> Have you tried using the analysis page on the admin section to see what
>>> tokens gets generated for 'Borderlands'?
>>>
>>> Christian
>>>
>>>
>>> On 10/05/2009 11:01 AM, Avlesh Singh wrote:
>>>
>>>
>>>
 We have indexed a product database and have come across some search
 terms


> where zero results are returned.  There are products in the index with
> 'Borderlands xxx xxx', 'Dragonfly xx xxx' in the title.  Searches for
> 'Borderland'  or 'Border Land' and 'Dragon Fly' return zero results
> respectively.
>
>
>
>
>
 "Borderland" should have worked for a regular text field. For all other
 desired matches you can use EdgeNGramTokenizerFactory.

 Cheers
 Avlesh

 On Mon, Oct 5, 2009 at 7:51 PM, Andrew McCombe
 wrote:





> Hi
> I am hoping someone can point me in the right direction with regards to
> indexing words that are concatenated together to make other words or
> product
> names.
>
> We have indexed a product database and have come across some search
> terms
> where zero results are returned.  There are products in the index with
> 'Borderlands xxx xxx', 'Dragonfly xx xxx' in the title.  Searches for
> 'Borderland'  or 'Border Land' and 'Dragon Fly' return zero results
> respectively.
>
> Where do I look to resolve this?  The product name field is indexed
> using
> a
> text field type.
>
> Thanks in advance
> Andrew
>
>
>
>
>



>>>
>>>
>>
>>
>

Re: Need "OR" in DisMax Query


David,

If your schema includes fields with analyzers that use the 
StopFilterFactory and the dismax QueryHandler is set-up to search within 
those fields, then you are correct.



On 10/05/2009 01:36 PM, David Giffin wrote:

Hi There,

Maybe I'm missing something, but I can't seem to get the dismax
request handler to perform and OR query. It appears that OR is removed
by the stop words. I like to do something like
"qt=dismax&q=red+OR+green" and get all green and all red results.

Thanks,
David

Re: wildcard searches

On 10/05/2009 01:18 PM, Avlesh Singh wrote:

First of all, I know of no way of doing wildcard phrase queries.

http://wiki.apache.org/lucene-java/LuceneFAQ#Can_I_combine_wildcard_and_phrase_search.2C_e.g._.22foo_ba.2A.22.3F

Thanks for that link

When I said not filters, I meant TokenFilters which is what I believe you

mean by 'not analyzed'

Analysis is a Lucene way of configuring tokenizers and filters for a field
(index time and query time). I guess, both of us mean the same thing.

You are correct. I should have said ' Not Analyzed'. Thanks for the
correction.

Cheers
Avlesh

On Mon, Oct 5, 2009 at 11:04 PM, Christian Zambranowrote:

Avlesh, I don't understand your answer.

First of all, I know of no way of doing wildcard phrase queries.

When I said not filters, I meant TokenFilters which is what I believe you
mean by 'not analyzed'

On 10/05/2009 12:27 PM, Avlesh Singh wrote:

No filters are applied to wildcard/fuzzy searches.

Ah! Not like that ..
I guess, it is just that the phrase searches using wildcards are not
analyzed.

Cheers
Avlesh

On Mon, Oct 5, 2009 at 10:42 PM, Christian Zambrano

wrote:

No filters are applied to wildcard/fuzzy searches.

I couldn't find a reference to this on either the solr or lucene
documentation but I read it on the Solr book from PACKT

On 10/05/2009 12:09 PM, Angel Ice wrote:

Hi everyone,

I have a little question regarding the search engine when a wildcard
character is used in the query.
Let's take the following example :

- I have sent in indexation the word Hésitation (with an accent on the
"e")
- The filters applied to the field that will handle this word, result in
the indexation of "esit" (the mute H is suppressed (home made filter),
the
accent too (IsoLatin1Filter), and the SnowballPorterFilter suppress the
"ation".

When i search for "hesitation", "esitation", "ésitation" etc ... all is
OK, the document is returned.
But as soon as I use a wildcard, like "hésita*", the document is not
returned. In fact, I have to put the wildcard in a manner that match the
indexed term exactly (example "esi*")

Does the search engine applies the filters to the word that prefix the
wildcard ? Or does it use this prefix verbatim ?

Thanks for you help.

Laurent

Need "OR" in DisMax Query

2009-10-05 Thread David Giffin

Hi There,

Maybe I'm missing something, but I can't seem to get the dismax
request handler to perform and OR query. It appears that OR is removed
by the stop words. I like to do something like
"qt=dismax&q=red+OR+green" and get all green and all red results.

Thanks,
David

Re: wildcard searches

Zambrano is right, Laurent. The analyzers for a field are not invoked for
wildcard queries. You custom filter is not even getting executed at
query-time.
If you want to enable wildcard queries, preserving the original token (while
processing each token in your filter) might work.

Cheers
Avlesh

On Mon, Oct 5, 2009 at 10:39 PM, Angel Ice  wrote:

> Hi everyone,
>
> I have a little question regarding the search engine when a wildcard
> character is used in the query.
> Let's take the following example :
>
> - I have sent in indexation the word Hésitation (with an accent on the "e")
> - The filters applied to the field that will handle this word, result in
> the indexation of "esit" (the mute H is suppressed (home made filter), the
> accent too (IsoLatin1Filter), and the SnowballPorterFilter suppress the
> "ation".
>
> When i search for "hesitation", "esitation", "ésitation" etc ... all is OK,
> the document is returned.
> But as soon as I use a wildcard, like "hésita*", the document is not
> returned. In fact, I have to put the wildcard in a manner that match the
> indexed term exactly (example "esi*")
>
> Does the search engine applies the filters to the word that prefix the
> wildcard ? Or does it use this prefix verbatim ?
>
> Thanks for you help.
>
> Laurent
>
>
>
>

Re: wildcard searches

>
> First of all, I know of no way of doing wildcard phrase queries.
>
http://wiki.apache.org/lucene-java/LuceneFAQ#Can_I_combine_wildcard_and_phrase_search.2C_e.g._.22foo_ba.2A.22.3F

When I said not filters, I meant TokenFilters which is what I believe you
> mean by 'not analyzed'
>
Analysis is a Lucene way of configuring tokenizers and filters for a field
(index time and query time). I guess, both of us mean the same thing.

Cheers
Avlesh

On Mon, Oct 5, 2009 at 11:04 PM, Christian Zambrano wrote:

> Avlesh, I don't understand your answer.
>
> First of all, I know of no way of doing wildcard phrase queries.
>
> When I said not filters, I meant TokenFilters which is what I believe you
> mean by 'not analyzed'
>
>
> On 10/05/2009 12:27 PM, Avlesh Singh wrote:
>
>> No filters are applied to wildcard/fuzzy searches.
>>>
>>>
>>>
>> Ah! Not like that ..
>> I guess, it is just that the phrase searches using wildcards are not
>> analyzed.
>>
>> Cheers
>> Avlesh
>>
>> On Mon, Oct 5, 2009 at 10:42 PM, Christian Zambrano> >wrote:
>>
>>
>>
>>> No filters are applied to wildcard/fuzzy searches.
>>>
>>> I couldn't find a reference to this on either the solr or lucene
>>> documentation but I read it on the Solr book from PACKT
>>>
>>>
>>> On 10/05/2009 12:09 PM, Angel Ice wrote:
>>>
>>>
>>>
 Hi everyone,

 I have a little question regarding the search engine when a wildcard
 character is used in the query.
 Let's take the following example :

 - I have sent in indexation the word Hésitation (with an accent on the
 "e")
 - The filters applied to the field that will handle this word, result in
 the indexation of "esit" (the mute H is suppressed (home made filter),
 the
 accent too (IsoLatin1Filter), and the SnowballPorterFilter suppress the
 "ation".

 When i search for "hesitation", "esitation", "ésitation" etc ... all is
 OK, the document is returned.
 But as soon as I use a wildcard, like "hésita*", the document is not
 returned. In fact, I have to put the wildcard in a manner that match the
 indexed term exactly (example "esi*")

 Does the search engine applies the filters to the word that prefix the
 wildcard ? Or does it use this prefix verbatim ?

 Thanks for you help.

 Laurent



>>>
>>>
>>
>>
>

Re: A little help with indexing joined words

2009-10-05 Thread Robert Muir

fyi, if you don't want to turn off norms entirely, try this option in
lucene 2.9 DefaultSimilarity:

public void setDiscountOverlaps(boolean v)

Determines whether overlap tokens (Tokens with 0 position increment)
are ignored when computing norm. By default this is false, meaning
overlap tokens are counted just like non-overlap tokens.

> Well, I don't see a reason as to why someone would need a length based
> normalization on such matches. I always have done omitNorms while using
> fields with this filter.

--
Robert Muir
rcm...@gmail.com

RE: Solr Timeouts

I just grabbed another stack trace for a thread that has been similarly 
blocking for over an hour. Notice that there is no Commit in this one:

http-8080-Processor67 [RUNNABLE] CPU time: 1:02:05
org.apache.lucene.index.TermBuffer.read(IndexInput, FieldInfos)
org.apache.lucene.index.SegmentTermEnum.next()
org.apache.lucene.index.SegmentTermEnum.scanTo(Term)
org.apache.lucene.index.TermInfosReader.get(Term, boolean)
org.apache.lucene.index.TermInfosReader.get(Term)
org.apache.lucene.index.SegmentTermDocs.seek(Term)
org.apache.lucene.index.DocumentsWriter.applyDeletes(IndexReader, int)
org.apache.lucene.index.DocumentsWriter.applyDeletes(SegmentInfos)
org.apache.lucene.index.IndexWriter.applyDeletes()
org.apache.lucene.index.IndexWriter.doFlushInternal(boolean, boolean)
org.apache.lucene.index.IndexWriter.doFlush(boolean, boolean)
org.apache.lucene.index.IndexWriter.flush(boolean, boolean, boolean)
org.apache.lucene.index.IndexWriter.updateDocument(Term, Document, Analyzer)
org.apache.lucene.index.IndexWriter.updateDocument(Term, Document)
org.apache.solr.update.DirectUpdateHandler2.addDoc(AddUpdateCommand)
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(AddUpdateCommand)
org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(SolrContentHandler,
 AddUpdateCommand)
org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(SolrContentHandler)
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(SolrQueryRequest,
 SolrQueryResponse, ContentStream)
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(SolrQueryRequest,
 SolrQueryResponse)
org.apache.solr.handler.RequestHandlerBase.handleRequest(SolrQueryRequest, 
SolrQueryResponse)
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(SolrQueryRequest,
 SolrQueryResponse)
org.apache.solr.core.SolrCore.execute(SolrRequestHandler, SolrQueryRequest, 
SolrQueryResponse)
org.apache.solr.servlet.SolrDispatchFilter.execute(HttpServletRequest, 
SolrRequestHandler, SolrQueryRequest, SolrQueryResponse)
org.apache.solr.servlet.SolrDispatchFilter.doFilter(ServletRequest, 
ServletResponse, FilterChain)
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ServletRequest,
 ServletResponse)
org.apache.catalina.core.ApplicationFilterChain.doFilter(ServletRequest, 
ServletResponse)
org.apache.catalina.core.StandardWrapperValve.invoke(Request, Response)
org.apache.catalina.core.StandardContextValve.invoke(Request, Response)
org.apache.catalina.core.StandardHostValve.invoke(Request, Response)
org.apache.catalina.valves.ErrorReportValve.invoke(Request, Response)
org.apache.catalina.core.StandardEngineValve.invoke(Request, Response)
org.apache.catalina.connector.CoyoteAdapter.service(Request, Response)
org.apache.coyote.http11.Http11Processor.process(InputStream, OutputStream)
org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(TcpConnection,
 Object[])
org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(Socket, TcpConnection, 
Object[])
org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(Object[])
org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run()
java.lang.Thread.run()


-Original Message-
From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley
Sent: Monday, October 05, 2009 1:18 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr Timeouts

OK... next step is to verify that SolrCell doesn't have a bug that
causes it to commit.
I'll try and verify today unless someone else beats me to it.

-Yonik
http://www.lucidimagination.com

On Mon, Oct 5, 2009 at 1:04 PM, Giovanni Fernandez-Kincade
 wrote:
> I'm fairly certain that all of the indexing jobs are calling SOLR with 
> commit=false. They all construct the indexing URLs using a CLR function I 
> wrote, which takes in a Commit parameter, which is always set to false.
>
> Also, I don't see any calls to commit in the Tomcat logs (whereas normally 
> when I make a commit call I do).
>
> This suggests that Solr is doing it automatically, but the extract handler 
> doesn't seem to be the problem:
>   class="org.apache.solr.handler.extraction.ExtractingRequestHandler" 
> startup="lazy">
>    
>      ignored_
>      fileData
>    
>  
>
>
> There is no external config file specified, and I don't see anything about 
> commits here.
>
> I've tried setting up more detailed indexer logging but haven't been able to 
> get it to work:
> true
>
> I tried relative and absolute paths, but no dice so far.
>
> Any other ideas?
>
> -Gio.
>
> -Original Message-
> From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley
> Sent: Monday, October 05, 2009 12:52 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr Timeouts
>
>> This is what one of my SOLR requests look like:
>>
>> http://titans:8080/solr/update/extract/?literal.versionId=684936&literal.filingDate=1997-12-04T00:00:00Z&literal.formTypeId=95&literal.companyId=3567904&literal.sourceI

Re: Solr Trunk Heap Space Issues

2009-10-05 Thread Jeff Newburn

We only have 1 custom search component none of the ones you listed.
Additionally, the last heap dump showed LRUCache and 4 instances of
IndexSchema as all of the memory.  There were 5 cores live but the other 4
are all empty.  I am trying again with all cores offline but the one we are
trying to reindex.
-- 
Jeff Newburn
Software Engineer, Zappos.com
jnewb...@zappos.com - 702-943-7562


> From: Yonik Seeley 
> Reply-To: 
> Date: Mon, 5 Oct 2009 13:32:32 -0400
> To: 
> Subject: Re: Solr Trunk Heap Space Issues
> 
> On Mon, Oct 5, 2009 at 1:00 PM, Jeff Newburn  wrote:
>> Ok I have eliminated all queries for warming and am still getting the heap
>> space dump.  Any ideas at this point what could be wrong?  This seems like a
>> huge increase in memory to go from indexing without issues to not being able
>> to even with warming off.
> 
> Do you have any custom Analyzers, Tokenizers, TokenFilters?
> Another change is that token streams are reused by caching in a
> thread-local, so every thread in your server could potentially have a
> copy of an analysis chain (token stream) per field that you have used.
>  This normally shouldn't be an issue since these will be small.  Also,
> how many unique fields do you have?
> 
> -Yonik
> http://www.lucidimagination.com
> 
> 
> 
>> Jeff Newburn
>> Software Engineer, Zappos.com
>> jnewb...@zappos.com - 702-943-7562
>> 
>> 
>>> From: Jeff Newburn 
>>> Reply-To: 
>>> Date: Thu, 01 Oct 2009 08:41:18 -0700
>>> To: "solr-user@lucene.apache.org" 
>>> Subject: Solr Trunk Heap Space Issues
>>> 
>>> I am trying to update to the newest version of solr from trunk as of May
>>> 5th.  I updated and compiled from trunk as of yesterday (09/30/2009).  When
>>> I try to do a full import I am receiving a GC heap error after changing
>>> nothing in the configuration files.  Why would this happen in the most
>>> recent versions but not in the version from a few months ago.  The stack
>>> trace is below.
>>> 
>>> Oct 1, 2009 8:34:32 AM org.apache.solr.update.processor.LogUpdateProcessor
>>> finish
>>> INFO: {add=[166400, 166608, 166698, 166800, 166811, 167097, 167316, 167353,
>>> ...(83 more)]} 0 35991
>>> Oct 1, 2009 8:34:32 AM org.apache.solr.common.SolrException log
>>> SEVERE: java.lang.OutOfMemoryError: GC overhead limit exceeded
>>>     at java.util.Arrays.copyOfRange(Arrays.java:3209)
>>>     at java.lang.String.(String.java:215)
>>>     at com.ctc.wstx.util.TextBuffer.contentsAsString(TextBuffer.java:384)
>>>     at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:821)
>>>     at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:280)
>>>     at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:139)
>>>     at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69)
>>>     at
>>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentSt
>>> reamHandlerBase.java:54)
>>>     at
>>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.
>>> java:131)
>>>     at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
>>>     at
>>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:3
>>> 38)
>>>     at
>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:
>>> 241)
>>>     at
>>> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Application
>>> FilterChain.java:235)
>>>     at
>>> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterCh
>>> ain.java:206)
>>>     at
>>> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.ja
>>> va:233)
>>>     at
>>> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.ja
>>> va:175)
>>>     at
>>> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128
>>> )
>>>     at
>>> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102
>>> )
>>>     at
>>> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java
>>> :109)
>>>     at
>>> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286)
>>>     at
>>> org.apache.coyote.http11.Http11NioProcessor.process(Http11NioProcessor.java:
>>> 879)
>>>     at
>>> org.apache.coyote.http11.Http11NioProtocol$Http11ConnectionHandler.process(H
>>> ttp11NioProtocol.java:719)
>>>     at
>>> org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:
>>> 2080)
>>>     at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.ja
>>> va:886)
>>>     at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:9
>>> 08)
>>>     at java.lang.Thread.run(Thread.java:619)
>>> 
>>> Oct 1, 2009 8:40:06 AM org.apache.solr.core.SolrCore execute
>>> INFO: [zeta-main] webapp=/solr path=/update params={} status=500 QTime=5265
>>> Oct 1, 2009 8:40:12 AM org.apache.solr.common.SolrException log
>>> SEVERE: java.lang.OutOfMemoryError: GC overhead limit exceeded
>>> 
>>> --
>>> Jeff Newburn
>>> Software E

Re: A little help with indexing joined words

Would you mind explaining how omitNorm has any effect on the IDF problem 
I described earlier?


I agree with your second sentence. I had to use the NGramTokenFilter to 
accommodate partial matches.


On 10/05/2009 12:11 PM, Avlesh Singh wrote:

Using synonyms might be a better solution because the use of
EdgeNGramTokenizerFactory has the potential of creating a large number of
token which will artificially increase the number of tokens in the index
which in turn will affect the IDF score.

 

Well, I don't see a reason as to why someone would need a length based
normalization on such matches. I always have done omitNorms while using
fields with this filter.

Yes, synonyms might an answer when you have limited number of such words
(phrases) and their possible combinations.

Cheers
Avlesh

On Mon, Oct 5, 2009 at 10:32 PM, Christian Zambranowrote:

   

Using synonyms might be a better solution because the use of
EdgeNGramTokenizerFactory has the potential of creating a large number of
token which will artificially increase the number of tokens in the index
which in turn will affect the IDF score.

A query for "borderland" should have returned results though. It is
difficult to troubleshoot why it didn't without knowing what query you used,
and what kind of analysis is taking place.

Have you tried using the analysis page on the admin section to see what
tokens gets generated for 'Borderlands'?

Christian


On 10/05/2009 11:01 AM, Avlesh Singh wrote:

 

We have indexed a product database and have come across some search terms
   

where zero results are returned.  There are products in the index with
'Borderlands xxx xxx', 'Dragonfly xx xxx' in the title.  Searches for
'Borderland'  or 'Border Land' and 'Dragon Fly' return zero results
respectively.



 

"Borderland" should have worked for a regular text field. For all other
desired matches you can use EdgeNGramTokenizerFactory.

Cheers
Avlesh

On Mon, Oct 5, 2009 at 7:51 PM, Andrew McCombe   wrote:



   

Hi
I am hoping someone can point me in the right direction with regards to
indexing words that are concatenated together to make other words or
product
names.

We have indexed a product database and have come across some search terms
where zero results are returned.  There are products in the index with
'Borderlands xxx xxx', 'Dragonfly xx xxx' in the title.  Searches for
'Borderland'  or 'Border Land' and 'Dragon Fly' return zero results
respectively.

Where do I look to resolve this?  The product name field is indexed using
a
text field type.

Thanks in advance
Andrew

Re: wildcard searches


Avlesh, I don't understand your answer.

First of all, I know of no way of doing wildcard phrase queries.

When I said not filters, I meant TokenFilters which is what I believe 
you mean by 'not analyzed'


On 10/05/2009 12:27 PM, Avlesh Singh wrote:

No filters are applied to wildcard/fuzzy searches.

 

Ah! Not like that ..
I guess, it is just that the phrase searches using wildcards are not
analyzed.

Cheers
Avlesh

On Mon, Oct 5, 2009 at 10:42 PM, Christian Zambranowrote:

   

No filters are applied to wildcard/fuzzy searches.

I couldn't find a reference to this on either the solr or lucene
documentation but I read it on the Solr book from PACKT


On 10/05/2009 12:09 PM, Angel Ice wrote:

 

Hi everyone,

I have a little question regarding the search engine when a wildcard
character is used in the query.
Let's take the following example :

- I have sent in indexation the word Hésitation (with an accent on the
"e")
- The filters applied to the field that will handle this word, result in
the indexation of "esit" (the mute H is suppressed (home made filter), the
accent too (IsoLatin1Filter), and the SnowballPorterFilter suppress the
"ation".

When i search for "hesitation", "esitation", "ésitation" etc ... all is
OK, the document is returned.
But as soon as I use a wildcard, like "hésita*", the document is not
returned. In fact, I have to put the wildcard in a manner that match the
indexed term exactly (example "esi*")

Does the search engine applies the filters to the word that prefix the
wildcard ? Or does it use this prefix verbatim ?

Thanks for you help.

Laurent

Re: Solr Trunk Heap Space Issues

On Mon, Oct 5, 2009 at 1:00 PM, Jeff Newburn  wrote:
> Ok I have eliminated all queries for warming and am still getting the heap
> space dump.  Any ideas at this point what could be wrong?  This seems like a
> huge increase in memory to go from indexing without issues to not being able
> to even with warming off.

Do you have any custom Analyzers, Tokenizers, TokenFilters?
Another change is that token streams are reused by caching in a
thread-local, so every thread in your server could potentially have a
copy of an analysis chain (token stream) per field that you have used.
 This normally shouldn't be an issue since these will be small.  Also,
how many unique fields do you have?

-Yonik
http://www.lucidimagination.com



> Jeff Newburn
> Software Engineer, Zappos.com
> jnewb...@zappos.com - 702-943-7562
>
>
>> From: Jeff Newburn 
>> Reply-To: 
>> Date: Thu, 01 Oct 2009 08:41:18 -0700
>> To: "solr-user@lucene.apache.org" 
>> Subject: Solr Trunk Heap Space Issues
>>
>> I am trying to update to the newest version of solr from trunk as of May
>> 5th.  I updated and compiled from trunk as of yesterday (09/30/2009).  When
>> I try to do a full import I am receiving a GC heap error after changing
>> nothing in the configuration files.  Why would this happen in the most
>> recent versions but not in the version from a few months ago.  The stack
>> trace is below.
>>
>> Oct 1, 2009 8:34:32 AM org.apache.solr.update.processor.LogUpdateProcessor
>> finish
>> INFO: {add=[166400, 166608, 166698, 166800, 166811, 167097, 167316, 167353,
>> ...(83 more)]} 0 35991
>> Oct 1, 2009 8:34:32 AM org.apache.solr.common.SolrException log
>> SEVERE: java.lang.OutOfMemoryError: GC overhead limit exceeded
>>     at java.util.Arrays.copyOfRange(Arrays.java:3209)
>>     at java.lang.String.(String.java:215)
>>     at com.ctc.wstx.util.TextBuffer.contentsAsString(TextBuffer.java:384)
>>     at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:821)
>>     at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:280)
>>     at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:139)
>>     at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69)
>>     at
>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentSt
>> reamHandlerBase.java:54)
>>     at
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.
>> java:131)
>>     at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
>>     at
>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:3
>> 38)
>>     at
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:
>> 241)
>>     at
>> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Application
>> FilterChain.java:235)
>>     at
>> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterCh
>> ain.java:206)
>>     at
>> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.ja
>> va:233)
>>     at
>> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.ja
>> va:175)
>>     at
>> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128
>> )
>>     at
>> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102
>> )
>>     at
>> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java
>> :109)
>>     at
>> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286)
>>     at
>> org.apache.coyote.http11.Http11NioProcessor.process(Http11NioProcessor.java:
>> 879)
>>     at
>> org.apache.coyote.http11.Http11NioProtocol$Http11ConnectionHandler.process(H
>> ttp11NioProtocol.java:719)
>>     at
>> org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:
>> 2080)
>>     at
>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.ja
>> va:886)
>>     at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:9
>> 08)
>>     at java.lang.Thread.run(Thread.java:619)
>>
>> Oct 1, 2009 8:40:06 AM org.apache.solr.core.SolrCore execute
>> INFO: [zeta-main] webapp=/solr path=/update params={} status=500 QTime=5265
>> Oct 1, 2009 8:40:12 AM org.apache.solr.common.SolrException log
>> SEVERE: java.lang.OutOfMemoryError: GC overhead limit exceeded
>>
>> --
>> Jeff Newburn
>> Software Engineer, Zappos.com
>> jnewb...@zappos.com - 702-943-7562
>>
>
>

RE: Solr Timeouts

Thanks for your help!

-Original Message-
From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley
Sent: Monday, October 05, 2009 1:18 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr Timeouts

OK... next step is to verify that SolrCell doesn't have a bug that
causes it to commit.
I'll try and verify today unless someone else beats me to it.

-Yonik
http://www.lucidimagination.com

On Mon, Oct 5, 2009 at 1:04 PM, Giovanni Fernandez-Kincade
 wrote:
> I'm fairly certain that all of the indexing jobs are calling SOLR with 
> commit=false. They all construct the indexing URLs using a CLR function I 
> wrote, which takes in a Commit parameter, which is always set to false.
>
> Also, I don't see any calls to commit in the Tomcat logs (whereas normally 
> when I make a commit call I do).
>
> This suggests that Solr is doing it automatically, but the extract handler 
> doesn't seem to be the problem:
>   class="org.apache.solr.handler.extraction.ExtractingRequestHandler" 
> startup="lazy">
>    
>      ignored_
>      fileData
>    
>  
>
>
> There is no external config file specified, and I don't see anything about 
> commits here.
>
> I've tried setting up more detailed indexer logging but haven't been able to 
> get it to work:
> true
>
> I tried relative and absolute paths, but no dice so far.
>
> Any other ideas?
>
> -Gio.
>
> -Original Message-
> From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley
> Sent: Monday, October 05, 2009 12:52 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr Timeouts
>
>> This is what one of my SOLR requests look like:
>>
>> http://titans:8080/solr/update/extract/?literal.versionId=684936&literal.filingDate=1997-12-04T00:00:00Z&literal.formTypeId=95&literal.companyId=3567904&literal.sourceId=0&resource.name=684936.txt&commit=false
>
> Have you verified that all of your indexing jobs (you said you had 4
> or 5) have commit=false?
>
> Also make sure that your extract handler doesn't have a default of
> something that could cause a commit - like commitWithin or something.
>
> -Yonik
> http://www.lucidimagination.com
>
>
>
> On Mon, Oct 5, 2009 at 12:44 PM, Giovanni Fernandez-Kincade
>  wrote:
>> Is there somewhere other than solrConfig.xml that the autoCommit feature is 
>> enabled? I've looked through that file and found autocommit to be commented 
>> out:
>>
>>
>>
>> 
>>
>>
>>
>
>>
>>
>>
>> -Original Message-
>> From: Feak, Todd [mailto:todd.f...@smss.sony.com]
>> Sent: Monday, October 05, 2009 12:40 PM
>> To: solr-user@lucene.apache.org
>> Subject: RE: Solr Timeouts
>>
>>
>>
>> Actually, ignore my other response.
>>
>>
>>
>> I believe you are committing, whether you know it or not.
>>
>>
>>
>> This is in your provided stack trace
>>
>> org.apache.solr.handler.RequestHandlerUtils.handleCommit(UpdateRequestProcessor,
>>  SolrParams, boolean) 
>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(SolrQueryRequest,
>>  SolrQueryResponse)
>>
>>
>>
>> I think Yonik gave you additional information for how to make it faster.
>>
>>
>>
>> -Todd
>>
>>
>>
>> -Original Message-
>>
>> From: Giovanni Fernandez-Kincade [mailto:gfernandez-kinc...@capitaliq.com]
>>
>> Sent: Monday, October 05, 2009 9:30 AM
>>
>> To: solr-user@lucene.apache.org
>>
>> Subject: RE: Solr Timeouts
>>
>>
>>
>> I'm not committing at all actually - I'm waiting for all 6 million to be 
>> done.
>>
>>
>>
>> -Original Message-
>>
>> From: Feak, Todd [mailto:todd.f...@smss.sony.com]
>>
>> Sent: Monday, October 05, 2009 12:10 PM
>>
>> To: solr-user@lucene.apache.org
>>
>> Subject: RE: Solr Timeouts
>>
>>
>>
>> How often are you committing?
>>
>>
>>
>> Every time you commit, Solr will close the old index and open the new one. 
>> If you are doing this in parallel from multiple jobs (4-5 you mention) then 
>> eventually the server gets behind and you start to pile up commit requests. 
>> Once this starts to happen, it will cascade out of control if the rate of 
>> commits isn't slowed.
>>
>>
>>
>> -Todd
>>
>>
>>
>> 
>>
>> From: Giovanni Fernandez-Kincade [mailto:gfernandez-kinc...@capitaliq.com]
>>
>> Sent: Monday, October 05, 2009 9:04 AM
>>
>> To: solr-user@lucene.apache.org
>>
>> Subject: Solr Timeouts
>>
>>
>>
>> Hi,
>>
>> I'm attempting to index approximately 6 million HTML/Text files using SOLR 
>> 1.4/Tomcat6 on Windows Server 2003 x64. I'm running 64 bit Tomcat and JVM. 
>> I've fired up 4-5 different jobs that are making indexing requests using the 
>> ExtractionRequestHandler, and everything works well for about 30-40 minutes, 
>> after which all indexing requests start timing out. I profiled the server 
>> and found that all of the threads are getting blocked by this call to flush 
>> the Lucene index to disk (see below).
>>
>>
>>
>> This leads me to a few questions:
>>
>>
>>
>> 1.       Is this normal?
>>
>>
>>
>> 2.       Can I reduce the frequency with which this happens some

Re: wildcard searches

>
> No filters are applied to wildcard/fuzzy searches.
>
Ah! Not like that ..
I guess, it is just that the phrase searches using wildcards are not
analyzed.

Cheers
Avlesh

On Mon, Oct 5, 2009 at 10:42 PM, Christian Zambrano wrote:

> No filters are applied to wildcard/fuzzy searches.
>
> I couldn't find a reference to this on either the solr or lucene
> documentation but I read it on the Solr book from PACKT
>
>
> On 10/05/2009 12:09 PM, Angel Ice wrote:
>
>> Hi everyone,
>>
>> I have a little question regarding the search engine when a wildcard
>> character is used in the query.
>> Let's take the following example :
>>
>> - I have sent in indexation the word Hésitation (with an accent on the
>> "e")
>> - The filters applied to the field that will handle this word, result in
>> the indexation of "esit" (the mute H is suppressed (home made filter), the
>> accent too (IsoLatin1Filter), and the SnowballPorterFilter suppress the
>> "ation".
>>
>> When i search for "hesitation", "esitation", "ésitation" etc ... all is
>> OK, the document is returned.
>> But as soon as I use a wildcard, like "hésita*", the document is not
>> returned. In fact, I have to put the wildcard in a manner that match the
>> indexed term exactly (example "esi*")
>>
>> Does the search engine applies the filters to the word that prefix the
>> wildcard ? Or does it use this prefix verbatim ?
>>
>> Thanks for you help.
>>
>> Laurent
>>
>

Re: Solr Trunk Heap Space Issues

2009-10-05 Thread Mark Miller

Jeff Newburn wrote:
> Ok I have eliminated all queries for warming and am still getting the heap
> space dump.  Any ideas at this point what could be wrong?  This seems like a
> huge increase in memory to go from indexing without issues to not being able
> to even with warming off.
>   
How about a heap dump without those warming queries? If you subtract the
search side of your last dump,
still doesn't make much sense ...

-- 
- Mark

http://www.lucidimagination.com

Re: Solr Timeouts

OK... next step is to verify that SolrCell doesn't have a bug that
causes it to commit.
I'll try and verify today unless someone else beats me to it.

-Yonik
http://www.lucidimagination.com

On Mon, Oct 5, 2009 at 1:04 PM, Giovanni Fernandez-Kincade
 wrote:
> I'm fairly certain that all of the indexing jobs are calling SOLR with 
> commit=false. They all construct the indexing URLs using a CLR function I 
> wrote, which takes in a Commit parameter, which is always set to false.
>
> Also, I don't see any calls to commit in the Tomcat logs (whereas normally 
> when I make a commit call I do).
>
> This suggests that Solr is doing it automatically, but the extract handler 
> doesn't seem to be the problem:
>   class="org.apache.solr.handler.extraction.ExtractingRequestHandler" 
> startup="lazy">
>    
>      ignored_
>      fileData
>    
>  
>
>
> There is no external config file specified, and I don't see anything about 
> commits here.
>
> I've tried setting up more detailed indexer logging but haven't been able to 
> get it to work:
> true
>
> I tried relative and absolute paths, but no dice so far.
>
> Any other ideas?
>
> -Gio.
>
> -Original Message-
> From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley
> Sent: Monday, October 05, 2009 12:52 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr Timeouts
>
>> This is what one of my SOLR requests look like:
>>
>> http://titans:8080/solr/update/extract/?literal.versionId=684936&literal.filingDate=1997-12-04T00:00:00Z&literal.formTypeId=95&literal.companyId=3567904&literal.sourceId=0&resource.name=684936.txt&commit=false
>
> Have you verified that all of your indexing jobs (you said you had 4
> or 5) have commit=false?
>
> Also make sure that your extract handler doesn't have a default of
> something that could cause a commit - like commitWithin or something.
>
> -Yonik
> http://www.lucidimagination.com
>
>
>
> On Mon, Oct 5, 2009 at 12:44 PM, Giovanni Fernandez-Kincade
>  wrote:
>> Is there somewhere other than solrConfig.xml that the autoCommit feature is 
>> enabled? I've looked through that file and found autocommit to be commented 
>> out:
>>
>>
>>
>> 
>>
>>
>>
>
>>
>>
>>
>> -Original Message-
>> From: Feak, Todd [mailto:todd.f...@smss.sony.com]
>> Sent: Monday, October 05, 2009 12:40 PM
>> To: solr-user@lucene.apache.org
>> Subject: RE: Solr Timeouts
>>
>>
>>
>> Actually, ignore my other response.
>>
>>
>>
>> I believe you are committing, whether you know it or not.
>>
>>
>>
>> This is in your provided stack trace
>>
>> org.apache.solr.handler.RequestHandlerUtils.handleCommit(UpdateRequestProcessor,
>>  SolrParams, boolean) 
>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(SolrQueryRequest,
>>  SolrQueryResponse)
>>
>>
>>
>> I think Yonik gave you additional information for how to make it faster.
>>
>>
>>
>> -Todd
>>
>>
>>
>> -Original Message-
>>
>> From: Giovanni Fernandez-Kincade [mailto:gfernandez-kinc...@capitaliq.com]
>>
>> Sent: Monday, October 05, 2009 9:30 AM
>>
>> To: solr-user@lucene.apache.org
>>
>> Subject: RE: Solr Timeouts
>>
>>
>>
>> I'm not committing at all actually - I'm waiting for all 6 million to be 
>> done.
>>
>>
>>
>> -Original Message-
>>
>> From: Feak, Todd [mailto:todd.f...@smss.sony.com]
>>
>> Sent: Monday, October 05, 2009 12:10 PM
>>
>> To: solr-user@lucene.apache.org
>>
>> Subject: RE: Solr Timeouts
>>
>>
>>
>> How often are you committing?
>>
>>
>>
>> Every time you commit, Solr will close the old index and open the new one. 
>> If you are doing this in parallel from multiple jobs (4-5 you mention) then 
>> eventually the server gets behind and you start to pile up commit requests. 
>> Once this starts to happen, it will cascade out of control if the rate of 
>> commits isn't slowed.
>>
>>
>>
>> -Todd
>>
>>
>>
>> 
>>
>> From: Giovanni Fernandez-Kincade [mailto:gfernandez-kinc...@capitaliq.com]
>>
>> Sent: Monday, October 05, 2009 9:04 AM
>>
>> To: solr-user@lucene.apache.org
>>
>> Subject: Solr Timeouts
>>
>>
>>
>> Hi,
>>
>> I'm attempting to index approximately 6 million HTML/Text files using SOLR 
>> 1.4/Tomcat6 on Windows Server 2003 x64. I'm running 64 bit Tomcat and JVM. 
>> I've fired up 4-5 different jobs that are making indexing requests using the 
>> ExtractionRequestHandler, and everything works well for about 30-40 minutes, 
>> after which all indexing requests start timing out. I profiled the server 
>> and found that all of the threads are getting blocked by this call to flush 
>> the Lucene index to disk (see below).
>>
>>
>>
>> This leads me to a few questions:
>>
>>
>>
>> 1.       Is this normal?
>>
>>
>>
>> 2.       Can I reduce the frequency with which this happens somehow? I've 
>> greatly increased the indexing options in SolrConfig.xml (attached here) to 
>> no avail.
>>
>>
>>
>> 3.       During these flushes, resource utilization (CPU, I/O, Memory 
>> Consumption) is significantly down c

Re: AW: AW: Concept Expansion

2009-10-05 Thread gdeconto


i had a similar question in my post 
http://www.nabble.com/forum/ViewPost.jtp?post=25752898&framed=y
http://www.nabble.com/forum/ViewPost.jtp?post=25752898&framed=y 

since queries can be quite complex, how would we parse the q string so that
we could identify and expand specific terms (ie is there an existing method)
in a custom QParserPlugin?



polx wrote:
> 
> 
> Le 05-sept.-09 à 23:26, Villemos, Gert a écrit :
> 
>> - As part of the construction the plugin parses the q string and  
>> extracts the parameters, ading them as TermQuery(s) to the parser
>  
> 

-- 
View this message in context: 
http://www.nabble.com/TermsComponent-tp25302503p25754730.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: about modifying querystring parameters submitted to solr

2009-10-05 Thread gdeconto


Note that there is a similar question in 
http://www.nabble.com/TermsComponent-to25302503.html#a25312549
http://www.nabble.com/TermsComponent-to25302503.html#a25312549 


-- 
View this message in context: 
http://www.nabble.com/about-modifying-querystring-parameters-submitted-to-solr-tp25752898p25754727.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: wildcard searches


No filters are applied to wildcard/fuzzy searches.

I couldn't find a reference to this on either the solr or lucene 
documentation but I read it on the Solr book from PACKT


On 10/05/2009 12:09 PM, Angel Ice wrote:

Hi everyone,

I have a little question regarding the search engine when a wildcard character 
is used in the query.
Let's take the following example :

- I have sent in indexation the word Hésitation (with an accent on the "e")
- The filters applied to the field that will handle this word, result in the indexation of 
"esit" (the mute H is suppressed (home made filter), the accent too (IsoLatin1Filter), 
and the SnowballPorterFilter suppress the "ation".

When i search for "hesitation", "esitation", "ésitation" etc ... all is OK, the 
document is returned.
But as soon as I use a wildcard, like "hésita*", the document is not returned. In fact, I 
have to put the wildcard in a manner that match the indexed term exactly (example "esi*")

Does the search engine applies the filters to the word that prefix the wildcard 
? Or does it use this prefix verbatim ?

Thanks for you help.

Laurent

Re: Question regarding synonym


You are correct.

I would recommend to only use the Synonym TokenFilter at index time 
unless you have a very good reason to do it at query time.


On 10/05/2009 11:46 AM, darniz wrote:

yes that's what we decided to expand these terms while indexing.
if we have
bayrische motoren werke =>  bmw

and i have a document which has bmw in it, searching for text:bayrische does
not give me results. i have to give
text:"bayrische motoren werke" then it actually takes the synonym and gets
me the document.

Now if i change the synonym mapping to
bayrische motoren werke , bmw with expand parameter to true and also use
this file at indexing.

now at the  time i index this document along with "bmw" i also index the
following words "bayrische" "motoren" "werke"

any text query like text:motoren or text:bayrische will give me results now.

Please correct me if my assumption is wrong.

Thanks
darniz









Christian Zambrano wrote:
   



On 10/02/2009 06:02 PM, darniz wrote:
 

Thanks
As i said it even works by giving double quotes too.
like carDescription:"austin martin"

So is that the conclusion that in order to map two word synonym i have to
always enclose in double quotes, so that it doen not split the words




   

Yes, but there are things you need to keep in mind.

  From the solr wiki:

Keep in mind that while the SynonymFilter will happily work with
*synonyms* containing multiple words (ie:
"sea biscuit, sea biscit, seabiscuit") The recommended approach for
dealing with *synonyms* like this, is to expand the synonym when
indexing. This is because there are two potential issues that can arrise
at query time:

1.

   The Lucene QueryParser tokenizes on white space before giving any
   text to the Analyzer, so if a person searches for the words
   sea biscit the analyzer will be given the words "sea" and "biscit"
   seperately, and will not know that they match a synonym.

2.

   Phrase searching (ie: "sea biscit") will cause the QueryParser to
   pass the entire string to the analyzer, but if the SynonymFilter
   is configured to expand the *synonyms*, then when the QueryParser
   gets the resulting list of tokens back from the Analyzer, it will
   construct a MultiPhraseQuery that will not have the desired
   effect. This is because of the limited mechanism available for the
   Analyzer to indicate that two terms occupy the same position:
   there is no way to indicate that a "phrase" occupies the same
   position as a term. For our example the resulting MultiPhraseQuery
   would be "(sea | sea | seabiscuit) (biscuit | biscit)" which would
   not match the simple case of "seabisuit" occuring in a document


 







Christian Zambrano wrote:

   

When you use a field qualifier(fieldName:valueToLookFor) it only applies
to the word right after the semicolon. If you look at the debug
infomation you will notice that for the second word it is using the
default field.

carDescription:austin
*text*:martin

the following should word:

carDescription:(austin martin)


On 10/02/2009 05:46 PM, darniz wrote:

 

This is not working when i search documents i have a document which
contains
text aston martin

when i search carDescription:"austin martin" i get a match but when i
dont
give double quotes

like carDescription:austin martin
there is no match

in the analyser if i give austin martin with out quotes, when it passes
through synonym filter it matches aston martin ,
may be by default analyser treats it as a phrase "austin martin" but
when
i
try to do a query by typing
carDescription:austin martin i get 0 documents. the following is the
debug
node info with debugQuery=on

carDescription:austin martin
carDescription:austin martin
carDescription:austin text:martin
carDescription:austin
text:martin

dont know why it breaks the word, may be its a desired behaviour
when i give carDescription:"austin martin" of course in this its able
to
map
to synonym and i get the desired result

Any opinion

darniz



Ensdorf Ken wrote:


   


 

Hi
i have a question regarding synonymfilter
i have a one way mapping defined
austin martin, astonmartin =>aston martin



   

...


 

Can anybody please explain if my observation is correct. This is a
very
critical aspect for my work.


   

That is correct - the synonym filter can recognize multi-token
synonyms
from consecutive tokens in a stream.

Re: A little help with indexing joined words

>
> Using synonyms might be a better solution because the use of
> EdgeNGramTokenizerFactory has the potential of creating a large number of
> token which will artificially increase the number of tokens in the index
> which in turn will affect the IDF score.
>
Well, I don't see a reason as to why someone would need a length based
normalization on such matches. I always have done omitNorms while using
fields with this filter.

Yes, synonyms might an answer when you have limited number of such words
(phrases) and their possible combinations.

Cheers
Avlesh

On Mon, Oct 5, 2009 at 10:32 PM, Christian Zambrano wrote:

> Using synonyms might be a better solution because the use of
> EdgeNGramTokenizerFactory has the potential of creating a large number of
> token which will artificially increase the number of tokens in the index
> which in turn will affect the IDF score.
>
> A query for "borderland" should have returned results though. It is
> difficult to troubleshoot why it didn't without knowing what query you used,
> and what kind of analysis is taking place.
>
> Have you tried using the analysis page on the admin section to see what
> tokens gets generated for 'Borderlands'?
>
> Christian
>
>
> On 10/05/2009 11:01 AM, Avlesh Singh wrote:
>
>> We have indexed a product database and have come across some search terms
>>> where zero results are returned.  There are products in the index with
>>> 'Borderlands xxx xxx', 'Dragonfly xx xxx' in the title.  Searches for
>>> 'Borderland'  or 'Border Land' and 'Dragon Fly' return zero results
>>> respectively.
>>>
>>>
>>>
>> "Borderland" should have worked for a regular text field. For all other
>> desired matches you can use EdgeNGramTokenizerFactory.
>>
>> Cheers
>> Avlesh
>>
>> On Mon, Oct 5, 2009 at 7:51 PM, Andrew McCombe  wrote:
>>
>>
>>
>>> Hi
>>> I am hoping someone can point me in the right direction with regards to
>>> indexing words that are concatenated together to make other words or
>>> product
>>> names.
>>>
>>> We have indexed a product database and have come across some search terms
>>> where zero results are returned.  There are products in the index with
>>> 'Borderlands xxx xxx', 'Dragonfly xx xxx' in the title.  Searches for
>>> 'Borderland'  or 'Border Land' and 'Dragon Fly' return zero results
>>> respectively.
>>>
>>> Where do I look to resolve this?  The product name field is indexed using
>>> a
>>> text field type.
>>>
>>> Thanks in advance
>>> Andrew
>>>
>>>
>>>
>>
>>
>

wildcard searches

2009-10-05 Thread Angel Ice

Hi everyone,

I have a little question regarding the search engine when a wildcard character 
is used in the query.
Let's take the following example :

- I have sent in indexation the word Hésitation (with an accent on the "e")
- The filters applied to the field that will handle this word, result in the 
indexation of "esit" (the mute H is suppressed (home made filter), the accent 
too (IsoLatin1Filter), and the SnowballPorterFilter suppress the "ation".

When i search for "hesitation", "esitation", "ésitation" etc ... all is OK, the 
document is returned.
But as soon as I use a wildcard, like "hésita*", the document is not returned. 
In fact, I have to put the wildcard in a manner that match the indexed term 
exactly (example "esi*")

Does the search engine applies the filters to the word that prefix the wildcard 
? Or does it use this prefix verbatim ?

Thanks for you help.

Laurent

RE: Solr Timeouts

I'm fairly certain that all of the indexing jobs are calling SOLR with 
commit=false. They all construct the indexing URLs using a CLR function I 
wrote, which takes in a Commit parameter, which is always set to false.

Also, I don't see any calls to commit in the Tomcat logs (whereas normally when 
I make a commit call I do). 

This suggests that Solr is doing it automatically, but the extract handler 
doesn't seem to be the problem:
  

  ignored_
  fileData

  


There is no external config file specified, and I don't see anything about 
commits here. 

I've tried setting up more detailed indexer logging but haven't been able to 
get it to work:
true

I tried relative and absolute paths, but no dice so far. 

Any other ideas?

-Gio.

-Original Message-
From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley
Sent: Monday, October 05, 2009 12:52 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr Timeouts

> This is what one of my SOLR requests look like:
>
> http://titans:8080/solr/update/extract/?literal.versionId=684936&literal.filingDate=1997-12-04T00:00:00Z&literal.formTypeId=95&literal.companyId=3567904&literal.sourceId=0&resource.name=684936.txt&commit=false

Have you verified that all of your indexing jobs (you said you had 4
or 5) have commit=false?

Also make sure that your extract handler doesn't have a default of
something that could cause a commit - like commitWithin or something.

-Yonik
http://www.lucidimagination.com



On Mon, Oct 5, 2009 at 12:44 PM, Giovanni Fernandez-Kincade
 wrote:
> Is there somewhere other than solrConfig.xml that the autoCommit feature is 
> enabled? I've looked through that file and found autocommit to be commented 
> out:
>
>
>
> 
>
>
>

>
>
>
> -Original Message-
> From: Feak, Todd [mailto:todd.f...@smss.sony.com]
> Sent: Monday, October 05, 2009 12:40 PM
> To: solr-user@lucene.apache.org
> Subject: RE: Solr Timeouts
>
>
>
> Actually, ignore my other response.
>
>
>
> I believe you are committing, whether you know it or not.
>
>
>
> This is in your provided stack trace
>
> org.apache.solr.handler.RequestHandlerUtils.handleCommit(UpdateRequestProcessor,
>  SolrParams, boolean) 
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(SolrQueryRequest,
>  SolrQueryResponse)
>
>
>
> I think Yonik gave you additional information for how to make it faster.
>
>
>
> -Todd
>
>
>
> -Original Message-
>
> From: Giovanni Fernandez-Kincade [mailto:gfernandez-kinc...@capitaliq.com]
>
> Sent: Monday, October 05, 2009 9:30 AM
>
> To: solr-user@lucene.apache.org
>
> Subject: RE: Solr Timeouts
>
>
>
> I'm not committing at all actually - I'm waiting for all 6 million to be done.
>
>
>
> -Original Message-
>
> From: Feak, Todd [mailto:todd.f...@smss.sony.com]
>
> Sent: Monday, October 05, 2009 12:10 PM
>
> To: solr-user@lucene.apache.org
>
> Subject: RE: Solr Timeouts
>
>
>
> How often are you committing?
>
>
>
> Every time you commit, Solr will close the old index and open the new one. If 
> you are doing this in parallel from multiple jobs (4-5 you mention) then 
> eventually the server gets behind and you start to pile up commit requests. 
> Once this starts to happen, it will cascade out of control if the rate of 
> commits isn't slowed.
>
>
>
> -Todd
>
>
>
> 
>
> From: Giovanni Fernandez-Kincade [mailto:gfernandez-kinc...@capitaliq.com]
>
> Sent: Monday, October 05, 2009 9:04 AM
>
> To: solr-user@lucene.apache.org
>
> Subject: Solr Timeouts
>
>
>
> Hi,
>
> I'm attempting to index approximately 6 million HTML/Text files using SOLR 
> 1.4/Tomcat6 on Windows Server 2003 x64. I'm running 64 bit Tomcat and JVM. 
> I've fired up 4-5 different jobs that are making indexing requests using the 
> ExtractionRequestHandler, and everything works well for about 30-40 minutes, 
> after which all indexing requests start timing out. I profiled the server and 
> found that all of the threads are getting blocked by this call to flush the 
> Lucene index to disk (see below).
>
>
>
> This leads me to a few questions:
>
>
>
> 1.       Is this normal?
>
>
>
> 2.       Can I reduce the frequency with which this happens somehow? I've 
> greatly increased the indexing options in SolrConfig.xml (attached here) to 
> no avail.
>
>
>
> 3.       During these flushes, resource utilization (CPU, I/O, Memory 
> Consumption) is significantly down compared to when requests are being 
> handled. Is there any way to make this index go faster? I have plenty of 
> bandwidth on the machine.
>
>
>
> I appreciate any insight you can provide. We're currently using MS SQL 2005 
> as our full-text solution and are pretty much miserable. So far SOLR has been 
> a great experience.
>
>
>
> Thanks,
>
> Gio.
>
>
>
> http-8080-Processor21 [RUNNABLE] CPU time: 9:51
>
> java.io.RandomAccessFile.seek(long)
>
> org.apache.lucene.store.SimpleFSDirectory$SimpleFSIndexInput.readInternal(byte[],
>  int, int)
>
> org.ap

Re: A little help with indexing joined words

Using synonyms might be a better solution because the use of 
EdgeNGramTokenizerFactory has the potential of creating a large number 
of token which will artificially increase the number of tokens in the 
index which in turn will affect the IDF score.


A query for "borderland" should have returned results though. It is 
difficult to troubleshoot why it didn't without knowing what query you 
used, and what kind of analysis is taking place.


Have you tried using the analysis page on the admin section to see what 
tokens gets generated for 'Borderlands'?


Christian

On 10/05/2009 11:01 AM, Avlesh Singh wrote:

We have indexed a product database and have come across some search terms
where zero results are returned.  There are products in the index with
'Borderlands xxx xxx', 'Dragonfly xx xxx' in the title.  Searches for
'Borderland'  or 'Border Land' and 'Dragon Fly' return zero results
respectively.

 

"Borderland" should have worked for a regular text field. For all other
desired matches you can use EdgeNGramTokenizerFactory.

Cheers
Avlesh

On Mon, Oct 5, 2009 at 7:51 PM, Andrew McCombe  wrote:

   

Hi
I am hoping someone can point me in the right direction with regards to
indexing words that are concatenated together to make other words or
product
names.

We have indexed a product database and have come across some search terms
where zero results are returned.  There are products in the index with
'Borderlands xxx xxx', 'Dragonfly xx xxx' in the title.  Searches for
'Borderland'  or 'Border Land' and 'Dragon Fly' return zero results
respectively.

Where do I look to resolve this?  The product name field is indexed using a
text field type.

Thanks in advance
Andrew

Re: Solr Trunk Heap Space Issues

2009-10-05 Thread Jeff Newburn

Ok I have eliminated all queries for warming and am still getting the heap
space dump.  Any ideas at this point what could be wrong?  This seems like a
huge increase in memory to go from indexing without issues to not being able
to even with warming off.
-- 
Jeff Newburn
Software Engineer, Zappos.com
jnewb...@zappos.com - 702-943-7562


> From: Jeff Newburn 
> Reply-To: 
> Date: Thu, 01 Oct 2009 08:41:18 -0700
> To: "solr-user@lucene.apache.org" 
> Subject: Solr Trunk Heap Space Issues
> 
> I am trying to update to the newest version of solr from trunk as of May
> 5th.  I updated and compiled from trunk as of yesterday (09/30/2009).  When
> I try to do a full import I am receiving a GC heap error after changing
> nothing in the configuration files.  Why would this happen in the most
> recent versions but not in the version from a few months ago.  The stack
> trace is below.
> 
> Oct 1, 2009 8:34:32 AM org.apache.solr.update.processor.LogUpdateProcessor
> finish
> INFO: {add=[166400, 166608, 166698, 166800, 166811, 167097, 167316, 167353,
> ...(83 more)]} 0 35991
> Oct 1, 2009 8:34:32 AM org.apache.solr.common.SolrException log
> SEVERE: java.lang.OutOfMemoryError: GC overhead limit exceeded
> at java.util.Arrays.copyOfRange(Arrays.java:3209)
> at java.lang.String.(String.java:215)
> at com.ctc.wstx.util.TextBuffer.contentsAsString(TextBuffer.java:384)
> at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:821)
> at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:280)
> at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:139)
> at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69)
> at 
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentSt
> reamHandlerBase.java:54)
> at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.
> java:131)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:3
> 38)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:
> 241)
> at 
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Application
> FilterChain.java:235)
> at 
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterCh
> ain.java:206)
> at 
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.ja
> va:233)
> at 
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.ja
> va:175)
> at 
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128
> )
> at 
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102
> )
> at 
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java
> :109)
> at 
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286)
> at 
> org.apache.coyote.http11.Http11NioProcessor.process(Http11NioProcessor.java:
> 879)
> at 
> org.apache.coyote.http11.Http11NioProtocol$Http11ConnectionHandler.process(H
> ttp11NioProtocol.java:719)
> at 
> org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:
> 2080)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.ja
> va:886)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:9
> 08)
> at java.lang.Thread.run(Thread.java:619)
> 
> Oct 1, 2009 8:40:06 AM org.apache.solr.core.SolrCore execute
> INFO: [zeta-main] webapp=/solr path=/update params={} status=500 QTime=5265
> Oct 1, 2009 8:40:12 AM org.apache.solr.common.SolrException log
> SEVERE: java.lang.OutOfMemoryError: GC overhead limit exceeded
> 
> -- 
> Jeff Newburn
> Software Engineer, Zappos.com
> jnewb...@zappos.com - 702-943-7562
>

RE: Solr Timeouts

2009-10-05 Thread Walter Underwood

How long is your timeout? Maybe it should be longer, since this is normal
Solr behavior. --wunder

-Original Message-
From: Giovanni Fernandez-Kincade [mailto:gfernandez-kinc...@capitaliq.com] 
Sent: Monday, October 05, 2009 9:45 AM
To: solr-user@lucene.apache.org
Subject: RE: Solr Timeouts

Is there somewhere other than solrConfig.xml that the autoCommit feature is
enabled? I've looked through that file and found autocommit to be commented
out:

This is what one of my SOLR requests look like:

http://titans:8080/solr/update/extract/?literal.versionId=684936&literal.fil
ingDate=1997-12-04T00:00:00Z&literal.formTypeId=95&literal.companyId=3567904
&literal.sourceId=0&resource.name=684936.txt&commit=false

-Original Message-
From: Feak, Todd [mailto:todd.f...@smss.sony.com]
Sent: Monday, October 05, 2009 12:40 PM
To: solr-user@lucene.apache.org
Subject: RE: Solr Timeouts

Actually, ignore my other response.

I believe you are committing, whether you know it or not.

This is in your provided stack trace

org.apache.solr.handler.RequestHandlerUtils.handleCommit(UpdateRequestProces
sor, SolrParams, boolean)
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(SolrQuery
Request, SolrQueryResponse)

I think Yonik gave you additional information for how to make it faster.

-Todd

-Original Message-

From: Giovanni Fernandez-Kincade [mailto:gfernandez-kinc...@capitaliq.com]

Sent: Monday, October 05, 2009 9:30 AM

To: solr-user@lucene.apache.org

Subject: RE: Solr Timeouts

I'm not committing at all actually - I'm waiting for all 6 million to be
done.

-Original Message-

From: Feak, Todd [mailto:todd.f...@smss.sony.com]

Sent: Monday, October 05, 2009 12:10 PM

To: solr-user@lucene.apache.org

Subject: RE: Solr Timeouts

How often are you committing?

Every time you commit, Solr will close the old index and open the new one.
If you are doing this in parallel from multiple jobs (4-5 you mention) then
eventually the server gets behind and you start to pile up commit requests.
Once this starts to happen, it will cascade out of control if the rate of
commits isn't slowed.

-Todd

From: Giovanni Fernandez-Kincade [mailto:gfernandez-kinc...@capitaliq.com]

Sent: Monday, October 05, 2009 9:04 AM

To: solr-user@lucene.apache.org

Subject: Solr Timeouts

Hi,

I'm attempting to index approximately 6 million HTML/Text files using SOLR
1.4/Tomcat6 on Windows Server 2003 x64. I'm running 64 bit Tomcat and JVM.
I've fired up 4-5 different jobs that are making indexing requests using the
ExtractionRequestHandler, and everything works well for about 30-40 minutes,
after which all indexing requests start timing out. I profiled the server
and found that all of the threads are getting blocked by this call to flush
the Lucene index to disk (see below).

This leads me to a few questions:

1.   Is this normal?

2.   Can I reduce the frequency with which this happens somehow? I've
greatly increased the indexing options in SolrConfig.xml (attached here) to
no avail.

3.   During these flushes, resource utilization (CPU, I/O, Memory
Consumption) is significantly down compared to when requests are being
handled. Is there any way to make this index go faster? I have plenty of
bandwidth on the machine.

I appreciate any insight you can provide. We're currently using MS SQL 2005
as our full-text solution and are pretty much miserable. So far SOLR has
been a great experience.

Thanks,

Gio.

http-8080-Processor21 [RUNNABLE] CPU time: 9:51

java.io.RandomAccessFile.seek(long)

org.apache.lucene.store.SimpleFSDirectory$SimpleFSIndexInput.readInternal(by
te[], int, int)

org.apache.lucene.store.BufferedIndexInput.refill()

org.apache.lucene.store.BufferedIndexInput.readByte()

org.apache.lucene.store.IndexInput.readVInt()

org.apache.lucene.index.TermBuffer.read(IndexInput, FieldInfos)

org.apache.lucene.index.SegmentTermEnum.next()

org.apache.lucene.index.SegmentTermEnum.scanTo(Term)

org.apache.lucene.index.TermInfosReader.get(Term, boolean)

org.apache.lucene.index.TermInfosReader.get(Term)

org.apache.lucene.index.SegmentTermDocs.seek(Term)

org.apache.lucene.index.DocumentsWriter.applyDeletes(IndexReader, int)

org.apache.lucene.index.DocumentsWriter.applyDeletes(SegmentInfos)

org.apache.lucene.index.IndexWriter.applyDeletes()

org.apache.lucene.index.IndexWriter.doFlushInternal(boolean, boolean)

org.apache.lucene.index.IndexWriter.doFlush(boolean, boolean)

org.apache.lucene.index.IndexWriter.flush(boolean, boolean, boolean)

org.apache.lucene.index.IndexWriter.closeInternal(boolean)

org.apache.lucene.index.IndexWriter.close(boolean)

org.apache.lucene.index.IndexWriter.close()

org.apache.solr.update.SolrIndexWriter.close()

org.apache.solr.update.DirectUpdateHandler2.closeWriter()

org.apache.solr.update.DirectUpdateHandler2.commit(CommitUpdateCommand)

org.apache.solr.update.processor.RunUpdateP

Re: Solr Timeouts

> This is what one of my SOLR requests look like:
>
> http://titans:8080/solr/update/extract/?literal.versionId=684936&literal.filingDate=1997-12-04T00:00:00Z&literal.formTypeId=95&literal.companyId=3567904&literal.sourceId=0&resource.name=684936.txt&commit=false

Have you verified that all of your indexing jobs (you said you had 4
or 5) have commit=false?

Also make sure that your extract handler doesn't have a default of
something that could cause a commit - like commitWithin or something.

-Yonik
http://www.lucidimagination.com



On Mon, Oct 5, 2009 at 12:44 PM, Giovanni Fernandez-Kincade
 wrote:
> Is there somewhere other than solrConfig.xml that the autoCommit feature is 
> enabled? I've looked through that file and found autocommit to be commented 
> out:
>
>
>
> 
>
>
>

>
>
>
> -Original Message-
> From: Feak, Todd [mailto:todd.f...@smss.sony.com]
> Sent: Monday, October 05, 2009 12:40 PM
> To: solr-user@lucene.apache.org
> Subject: RE: Solr Timeouts
>
>
>
> Actually, ignore my other response.
>
>
>
> I believe you are committing, whether you know it or not.
>
>
>
> This is in your provided stack trace
>
> org.apache.solr.handler.RequestHandlerUtils.handleCommit(UpdateRequestProcessor,
>  SolrParams, boolean) 
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(SolrQueryRequest,
>  SolrQueryResponse)
>
>
>
> I think Yonik gave you additional information for how to make it faster.
>
>
>
> -Todd
>
>
>
> -Original Message-
>
> From: Giovanni Fernandez-Kincade [mailto:gfernandez-kinc...@capitaliq.com]
>
> Sent: Monday, October 05, 2009 9:30 AM
>
> To: solr-user@lucene.apache.org
>
> Subject: RE: Solr Timeouts
>
>
>
> I'm not committing at all actually - I'm waiting for all 6 million to be done.
>
>
>
> -Original Message-
>
> From: Feak, Todd [mailto:todd.f...@smss.sony.com]
>
> Sent: Monday, October 05, 2009 12:10 PM
>
> To: solr-user@lucene.apache.org
>
> Subject: RE: Solr Timeouts
>
>
>
> How often are you committing?
>
>
>
> Every time you commit, Solr will close the old index and open the new one. If 
> you are doing this in parallel from multiple jobs (4-5 you mention) then 
> eventually the server gets behind and you start to pile up commit requests. 
> Once this starts to happen, it will cascade out of control if the rate of 
> commits isn't slowed.
>
>
>
> -Todd
>
>
>
> 
>
> From: Giovanni Fernandez-Kincade [mailto:gfernandez-kinc...@capitaliq.com]
>
> Sent: Monday, October 05, 2009 9:04 AM
>
> To: solr-user@lucene.apache.org
>
> Subject: Solr Timeouts
>
>
>
> Hi,
>
> I'm attempting to index approximately 6 million HTML/Text files using SOLR 
> 1.4/Tomcat6 on Windows Server 2003 x64. I'm running 64 bit Tomcat and JVM. 
> I've fired up 4-5 different jobs that are making indexing requests using the 
> ExtractionRequestHandler, and everything works well for about 30-40 minutes, 
> after which all indexing requests start timing out. I profiled the server and 
> found that all of the threads are getting blocked by this call to flush the 
> Lucene index to disk (see below).
>
>
>
> This leads me to a few questions:
>
>
>
> 1.       Is this normal?
>
>
>
> 2.       Can I reduce the frequency with which this happens somehow? I've 
> greatly increased the indexing options in SolrConfig.xml (attached here) to 
> no avail.
>
>
>
> 3.       During these flushes, resource utilization (CPU, I/O, Memory 
> Consumption) is significantly down compared to when requests are being 
> handled. Is there any way to make this index go faster? I have plenty of 
> bandwidth on the machine.
>
>
>
> I appreciate any insight you can provide. We're currently using MS SQL 2005 
> as our full-text solution and are pretty much miserable. So far SOLR has been 
> a great experience.
>
>
>
> Thanks,
>
> Gio.
>
>
>
> http-8080-Processor21 [RUNNABLE] CPU time: 9:51
>
> java.io.RandomAccessFile.seek(long)
>
> org.apache.lucene.store.SimpleFSDirectory$SimpleFSIndexInput.readInternal(byte[],
>  int, int)
>
> org.apache.lucene.store.BufferedIndexInput.refill()
>
> org.apache.lucene.store.BufferedIndexInput.readByte()
>
> org.apache.lucene.store.IndexInput.readVInt()
>
> org.apache.lucene.index.TermBuffer.read(IndexInput, FieldInfos)
>
> org.apache.lucene.index.SegmentTermEnum.next()
>
> org.apache.lucene.index.SegmentTermEnum.scanTo(Term)
>
> org.apache.lucene.index.TermInfosReader.get(Term, boolean)
>
> org.apache.lucene.index.TermInfosReader.get(Term)
>
> org.apache.lucene.index.SegmentTermDocs.seek(Term)
>
> org.apache.lucene.index.DocumentsWriter.applyDeletes(IndexReader, int)
>
> org.apache.lucene.index.DocumentsWriter.applyDeletes(SegmentInfos)
>
> org.apache.lucene.index.IndexWriter.applyDeletes()
>
> org.apache.lucene.index.IndexWriter.doFlushInternal(boolean, boolean)
>
> org.apache.lucene.index.IndexWriter.doFlush(boolean, boolean)
>
> org.apache.lucene.index.IndexWriter.flush(boolean, boolean, boolean)
>
> org.apache.lucene.index.IndexWriter.closeIntern

Re: Solr Porting to .Net

2009-10-05 Thread Antonio Calò

Hi Mauricio, thanks for your feedback.

I suppose we will move to a mixed solution Solr on Tomcat and a .Net client
(maybe SolrNet)

But the Solr on KVM could be interesting. If I've time I'll try It and I'll
let you know in success case.

Antonio

2009/9/30 Mauricio Scheffer 

> Solr is a server that runs on Java and it exposes a http interface.SolrNet
> is a client library for .Net that connects to a Solr instance via its http
> interface.
> My experiment (let's call it SolrIKVM) is an attempt to run Solr on .Net.
>
> Hope that clear things up.
>
> On Wed, Sep 30, 2009 at 11:50 AM, Antonio Calò 
> wrote:
>
> > I guys, thanks for your prompt feedback.
> >
> >
> > So, you are saying that SolrNet is just a wrapper written in C#, that
> > connnect the Solr (still written in Java that run on the IKVM) ?
> >
> > Is my understanding correct?
> >
> > Regards
> >
> > Antonio
> >
> > 2009/9/30 Mauricio Scheffer 
> >
> > > SolrNet is only a http client to Solr.
> > > I've been experimenting with IKVM but wasn't very successful... There
> > seem
> > > to be some issues with class loading, but unfortunately I don't have
> much
> > > time to continue these experiments right now. In case you're interested
> > in
> > > continuing this, here's the repository:
> > > http://code.google.com/p/mausch/source/browse/trunk/SolrIKVM
> > >
> > > Also recently someone registered a project on google code with the same
> > > intentions, but no commits yet: http://code.google.com/p/solrwin/
> > >
> > > Cheers,
> > > Mauricio
> > >
> > > On Wed, Sep 30, 2009 at 7:09 AM, Pravin Paratey 
> > wrote:
> > >
> > > > You may want to check out - http://code.google.com/p/solrnet/
> > > >
> > > > 2009/9/30 Antonio Calò :
> > > > > Hi All
> > > > >
> > > > > I'm wondering if is already available a Solr version for .Net or if
> > it
> > > is
> > > > > still under development/planning. I've searched on Solr website but
> > > I've
> > > > > found only info on Lucene .Net project.
> > > > >
> > > > > Best Regards
> > > > >
> > > > > Antonio
> > > > >
> > > > > --
> > > > > Antonio Calò
> > > > > --
> > > > > Software Developer Engineer
> > > > > @ Intellisemantic
> > > > > Mail anton.c...@gmail.com
> > > > > Tel. 011-56.90.429
> > > > > --
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> > Antonio Calò
> > --
> > Software Developer Engineer
> > @ Intellisemantic
> > Mail anton.c...@gmail.com
> > Tel. 011-56.90.429
> > --
> >
>



-- 
Antonio Calò
--
Software Developer Engineer
@ Intellisemantic
Mail anton.c...@gmail.com
Tel. 011-56.90.429
--

HighLithing exact phrases with solr

2009-10-05 Thread Antonio Calò

Hi Guys

I'm getting crazy with the highlighting in solr. The problem is the follow:
when I submit an exact phrase query, I get the related results and the
related snippets with highlight. But I've noticed that the *single term of
the phrase are highlighted too*. Here an example:

If I start a search for "quick brown fox", I obtain the correct result with
the doc wich contains the phrase, but the snippets came to me like this:


 


The quick brown fox jump over the lazy dog. The fox is a
nice animal.

 
  



Also with some documents, only single terms are highlighted insteand of
exact sentence even if the exact phrase is contained into the document i.
e.:

 


The fox is a nice animal.

 
  



My understanding of highlighting is that if I search for exact phrase, only
the exact phrase is should be highlighted.

Here an extract of my solrconfig.xml & schema.xml

solrconfig.xml:


   
   
   

 500

   

   
   

  
  700
  
  0.5
  
  [-\w ,/\n\"']{20,200}

  true

  true

   

   
   

 
 

   


schema.xml:



 


  














Maybe I'm missing something, or my understanding of the highlighting feature
is not correct. Any Idea?

As always, thanks for your support!

Regards, Antonio

Re: Question regarding synonym

2009-10-05 Thread darniz


yes that's what we decided to expand these terms while indexing.
if we have
bayrische motoren werke => bmw

and i have a document which has bmw in it, searching for text:bayrische does
not give me results. i have to give
text:"bayrische motoren werke" then it actually takes the synonym and gets
me the document.

Now if i change the synonym mapping to 
bayrische motoren werke , bmw with expand parameter to true and also use
this file at indexing.

now at the  time i index this document along with "bmw" i also index the
following words "bayrische" "motoren" "werke"

any text query like text:motoren or text:bayrische will give me results now.

Please correct me if my assumption is wrong.

Thanks
darniz









Christian Zambrano wrote:
> 
> 
> 
> On 10/02/2009 06:02 PM, darniz wrote:
>> Thanks
>> As i said it even works by giving double quotes too.
>> like carDescription:"austin martin"
>>
>> So is that the conclusion that in order to map two word synonym i have to
>> always enclose in double quotes, so that it doen not split the words
>>
>>
>>
>>
> Yes, but there are things you need to keep in mind.
> 
>  From the solr wiki:
> 
> Keep in mind that while the SynonymFilter will happily work with 
> *synonyms* containing multiple words (ie: 
> "sea biscuit, sea biscit, seabiscuit") The recommended approach for 
> dealing with *synonyms* like this, is to expand the synonym when 
> indexing. This is because there are two potential issues that can arrise 
> at query time:
> 
>1.
> 
>   The Lucene QueryParser tokenizes on white space before giving any
>   text to the Analyzer, so if a person searches for the words
>   sea biscit the analyzer will be given the words "sea" and "biscit"
>   seperately, and will not know that they match a synonym.
> 
>2.
> 
>   Phrase searching (ie: "sea biscit") will cause the QueryParser to
>   pass the entire string to the analyzer, but if the SynonymFilter
>   is configured to expand the *synonyms*, then when the QueryParser
>   gets the resulting list of tokens back from the Analyzer, it will
>   construct a MultiPhraseQuery that will not have the desired
>   effect. This is because of the limited mechanism available for the
>   Analyzer to indicate that two terms occupy the same position:
>   there is no way to indicate that a "phrase" occupies the same
>   position as a term. For our example the resulting MultiPhraseQuery
>   would be "(sea | sea | seabiscuit) (biscuit | biscit)" which would
>   not match the simple case of "seabisuit" occuring in a document
> 
> 
>>
>>
>>
>>
>>
>>
>>
>> Christian Zambrano wrote:
>>
>>> When you use a field qualifier(fieldName:valueToLookFor) it only applies
>>> to the word right after the semicolon. If you look at the debug
>>> infomation you will notice that for the second word it is using the
>>> default field.
>>>
>>> carDescription:austin
>>> *text*:martin
>>>
>>> the following should word:
>>>
>>> carDescription:(austin martin)
>>>
>>>
>>> On 10/02/2009 05:46 PM, darniz wrote:
>>>  
 This is not working when i search documents i have a document which
 contains
 text aston martin

 when i search carDescription:"austin martin" i get a match but when i
 dont
 give double quotes

 like carDescription:austin martin
 there is no match

 in the analyser if i give austin martin with out quotes, when it passes
 through synonym filter it matches aston martin ,
 may be by default analyser treats it as a phrase "austin martin" but
 when
 i
 try to do a query by typing
 carDescription:austin martin i get 0 documents. the following is the
 debug
 node info with debugQuery=on

 carDescription:austin martin
 carDescription:austin martin
 carDescription:austin text:martin
 carDescription:austin
 text:martin

 dont know why it breaks the word, may be its a desired behaviour
 when i give carDescription:"austin martin" of course in this its able
 to
 map
 to synonym and i get the desired result

 Any opinion

 darniz



 Ensdorf Ken wrote:


>
>  
>> Hi
>> i have a question regarding synonymfilter
>> i have a one way mapping defined
>> austin martin, astonmartin =>   aston martin
>>
>>
>>
> ...
>
>  
>> Can anybody please explain if my observation is correct. This is a
>> very
>> critical aspect for my work.
>>
>>
> That is correct - the synonym filter can recognize multi-token
> synonyms
> from consecutive tokens in a stream.
>
>
>
>
>  


>>>
>>>  
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Question-regarding-synonym-tp25720572p25754288.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Solr Timeouts

Is there somewhere other than solrConfig.xml that the autoCommit feature is 
enabled? I've looked through that file and found autocommit to be commented out:







This is what one of my SOLR requests look like:



http://titans:8080/solr/update/extract/?literal.versionId=684936&literal.filingDate=1997-12-04T00:00:00Z&literal.formTypeId=95&literal.companyId=3567904&literal.sourceId=0&resource.name=684936.txt&commit=false



-Original Message-
From: Feak, Todd [mailto:todd.f...@smss.sony.com]
Sent: Monday, October 05, 2009 12:40 PM
To: solr-user@lucene.apache.org
Subject: RE: Solr Timeouts



Actually, ignore my other response.



I believe you are committing, whether you know it or not.



This is in your provided stack trace

org.apache.solr.handler.RequestHandlerUtils.handleCommit(UpdateRequestProcessor,
 SolrParams, boolean) 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(SolrQueryRequest,
 SolrQueryResponse)



I think Yonik gave you additional information for how to make it faster.



-Todd



-Original Message-

From: Giovanni Fernandez-Kincade [mailto:gfernandez-kinc...@capitaliq.com]

Sent: Monday, October 05, 2009 9:30 AM

To: solr-user@lucene.apache.org

Subject: RE: Solr Timeouts



I'm not committing at all actually - I'm waiting for all 6 million to be done.



-Original Message-

From: Feak, Todd [mailto:todd.f...@smss.sony.com]

Sent: Monday, October 05, 2009 12:10 PM

To: solr-user@lucene.apache.org

Subject: RE: Solr Timeouts



How often are you committing?



Every time you commit, Solr will close the old index and open the new one. If 
you are doing this in parallel from multiple jobs (4-5 you mention) then 
eventually the server gets behind and you start to pile up commit requests. 
Once this starts to happen, it will cascade out of control if the rate of 
commits isn't slowed.



-Todd





From: Giovanni Fernandez-Kincade [mailto:gfernandez-kinc...@capitaliq.com]

Sent: Monday, October 05, 2009 9:04 AM

To: solr-user@lucene.apache.org

Subject: Solr Timeouts



Hi,

I'm attempting to index approximately 6 million HTML/Text files using SOLR 
1.4/Tomcat6 on Windows Server 2003 x64. I'm running 64 bit Tomcat and JVM. I've 
fired up 4-5 different jobs that are making indexing requests using the 
ExtractionRequestHandler, and everything works well for about 30-40 minutes, 
after which all indexing requests start timing out. I profiled the server and 
found that all of the threads are getting blocked by this call to flush the 
Lucene index to disk (see below).



This leads me to a few questions:



1.   Is this normal?



2.   Can I reduce the frequency with which this happens somehow? I've 
greatly increased the indexing options in SolrConfig.xml (attached here) to no 
avail.



3.   During these flushes, resource utilization (CPU, I/O, Memory 
Consumption) is significantly down compared to when requests are being handled. 
Is there any way to make this index go faster? I have plenty of bandwidth on 
the machine.



I appreciate any insight you can provide. We're currently using MS SQL 2005 as 
our full-text solution and are pretty much miserable. So far SOLR has been a 
great experience.



Thanks,

Gio.



http-8080-Processor21 [RUNNABLE] CPU time: 9:51

java.io.RandomAccessFile.seek(long)

org.apache.lucene.store.SimpleFSDirectory$SimpleFSIndexInput.readInternal(byte[],
 int, int)

org.apache.lucene.store.BufferedIndexInput.refill()

org.apache.lucene.store.BufferedIndexInput.readByte()

org.apache.lucene.store.IndexInput.readVInt()

org.apache.lucene.index.TermBuffer.read(IndexInput, FieldInfos)

org.apache.lucene.index.SegmentTermEnum.next()

org.apache.lucene.index.SegmentTermEnum.scanTo(Term)

org.apache.lucene.index.TermInfosReader.get(Term, boolean)

org.apache.lucene.index.TermInfosReader.get(Term)

org.apache.lucene.index.SegmentTermDocs.seek(Term)

org.apache.lucene.index.DocumentsWriter.applyDeletes(IndexReader, int)

org.apache.lucene.index.DocumentsWriter.applyDeletes(SegmentInfos)

org.apache.lucene.index.IndexWriter.applyDeletes()

org.apache.lucene.index.IndexWriter.doFlushInternal(boolean, boolean)

org.apache.lucene.index.IndexWriter.doFlush(boolean, boolean)

org.apache.lucene.index.IndexWriter.flush(boolean, boolean, boolean)

org.apache.lucene.index.IndexWriter.closeInternal(boolean)

org.apache.lucene.index.IndexWriter.close(boolean)

org.apache.lucene.index.IndexWriter.close()

org.apache.solr.update.SolrIndexWriter.close()

org.apache.solr.update.DirectUpdateHandler2.closeWriter()

org.apache.solr.update.DirectUpdateHandler2.commit(CommitUpdateCommand)

org.apache.solr.update.processor.RunUpdateProcessor.processCommit(CommitUpdateCommand)

org.apache.solr.handler.RequestHandlerUtils.handleCommit(UpdateRequestProcessor,
 SolrParams, boolean)

org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(SolrQueryRequest,
 SolrQueryResponse)

org.apache.solr.handler.Reque

RE: Solr Timeouts

Actually, ignore my other response. 

I believe you are committing, whether you know it or not. 

This is in your provided stack trace
org.apache.solr.handler.RequestHandlerUtils.handleCommit(UpdateRequestProcessor,
 SolrParams, boolean) 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(SolrQueryRequest,
 SolrQueryResponse)

I think Yonik gave you additional information for how to make it faster.

-Todd

-Original Message-
From: Giovanni Fernandez-Kincade [mailto:gfernandez-kinc...@capitaliq.com] 
Sent: Monday, October 05, 2009 9:30 AM
To: solr-user@lucene.apache.org
Subject: RE: Solr Timeouts

I'm not committing at all actually - I'm waiting for all 6 million to be done. 

-Original Message-
From: Feak, Todd [mailto:todd.f...@smss.sony.com] 
Sent: Monday, October 05, 2009 12:10 PM
To: solr-user@lucene.apache.org
Subject: RE: Solr Timeouts

How often are you committing?

Every time you commit, Solr will close the old index and open the new one. If 
you are doing this in parallel from multiple jobs (4-5 you mention) then 
eventually the server gets behind and you start to pile up commit requests. 
Once this starts to happen, it will cascade out of control if the rate of 
commits isn't slowed.

-Todd


From: Giovanni Fernandez-Kincade [mailto:gfernandez-kinc...@capitaliq.com]
Sent: Monday, October 05, 2009 9:04 AM
To: solr-user@lucene.apache.org
Subject: Solr Timeouts

Hi,
I'm attempting to index approximately 6 million HTML/Text files using SOLR 
1.4/Tomcat6 on Windows Server 2003 x64. I'm running 64 bit Tomcat and JVM. I've 
fired up 4-5 different jobs that are making indexing requests using the 
ExtractionRequestHandler, and everything works well for about 30-40 minutes, 
after which all indexing requests start timing out. I profiled the server and 
found that all of the threads are getting blocked by this call to flush the 
Lucene index to disk (see below).

This leads me to a few questions:

1.   Is this normal?

2.   Can I reduce the frequency with which this happens somehow? I've 
greatly increased the indexing options in SolrConfig.xml (attached here) to no 
avail.

3.   During these flushes, resource utilization (CPU, I/O, Memory 
Consumption) is significantly down compared to when requests are being handled. 
Is there any way to make this index go faster? I have plenty of bandwidth on 
the machine.

I appreciate any insight you can provide. We're currently using MS SQL 2005 as 
our full-text solution and are pretty much miserable. So far SOLR has been a 
great experience.

Thanks,
Gio.

http-8080-Processor21 [RUNNABLE] CPU time: 9:51
java.io.RandomAccessFile.seek(long)
org.apache.lucene.store.SimpleFSDirectory$SimpleFSIndexInput.readInternal(byte[],
 int, int)
org.apache.lucene.store.BufferedIndexInput.refill()
org.apache.lucene.store.BufferedIndexInput.readByte()
org.apache.lucene.store.IndexInput.readVInt()
org.apache.lucene.index.TermBuffer.read(IndexInput, FieldInfos)
org.apache.lucene.index.SegmentTermEnum.next()
org.apache.lucene.index.SegmentTermEnum.scanTo(Term)
org.apache.lucene.index.TermInfosReader.get(Term, boolean)
org.apache.lucene.index.TermInfosReader.get(Term)
org.apache.lucene.index.SegmentTermDocs.seek(Term)
org.apache.lucene.index.DocumentsWriter.applyDeletes(IndexReader, int)
org.apache.lucene.index.DocumentsWriter.applyDeletes(SegmentInfos)
org.apache.lucene.index.IndexWriter.applyDeletes()
org.apache.lucene.index.IndexWriter.doFlushInternal(boolean, boolean)
org.apache.lucene.index.IndexWriter.doFlush(boolean, boolean)
org.apache.lucene.index.IndexWriter.flush(boolean, boolean, boolean)
org.apache.lucene.index.IndexWriter.closeInternal(boolean)
org.apache.lucene.index.IndexWriter.close(boolean)
org.apache.lucene.index.IndexWriter.close()
org.apache.solr.update.SolrIndexWriter.close()
org.apache.solr.update.DirectUpdateHandler2.closeWriter()
org.apache.solr.update.DirectUpdateHandler2.commit(CommitUpdateCommand)
org.apache.solr.update.processor.RunUpdateProcessor.processCommit(CommitUpdateCommand)
org.apache.solr.handler.RequestHandlerUtils.handleCommit(UpdateRequestProcessor,
 SolrParams, boolean)
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(SolrQueryRequest,
 SolrQueryResponse)
org.apache.solr.handler.RequestHandlerBase.handleRequest(SolrQueryRequest, 
SolrQueryResponse)
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(SolrQueryRequest,
 SolrQueryResponse)
org.apache.solr.core.SolrCore.execute(SolrRequestHandler, SolrQueryRequest, 
SolrQueryResponse)
org.apache.solr.servlet.SolrDispatchFilter.execute(HttpServletRequest, 
SolrRequestHandler, SolrQueryRequest, SolrQueryResponse)
org.apache.solr.servlet.SolrDispatchFilter.doFilter(ServletRequest, 
ServletResponse, FilterChain)
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ServletRequest,
 ServletResponse)
org.apache.catalina.core.ApplicationFilterChain.doFilter(ServletRequest, 
Servl

RE: Solr Timeouts

Ok. Guess that isn't a problem. :)

A second consideration... I could see lock contention being an issue with 
multiple clients indexing at once. Is there any disadvantage to serializing the 
clients to remove lock contention?

-Todd

-Original Message-
From: Giovanni Fernandez-Kincade [mailto:gfernandez-kinc...@capitaliq.com] 
Sent: Monday, October 05, 2009 9:30 AM
To: solr-user@lucene.apache.org
Subject: RE: Solr Timeouts

I'm not committing at all actually - I'm waiting for all 6 million to be done. 

-Original Message-
From: Feak, Todd [mailto:todd.f...@smss.sony.com] 
Sent: Monday, October 05, 2009 12:10 PM
To: solr-user@lucene.apache.org
Subject: RE: Solr Timeouts

How often are you committing?

Every time you commit, Solr will close the old index and open the new one. If 
you are doing this in parallel from multiple jobs (4-5 you mention) then 
eventually the server gets behind and you start to pile up commit requests. 
Once this starts to happen, it will cascade out of control if the rate of 
commits isn't slowed.

-Todd


From: Giovanni Fernandez-Kincade [mailto:gfernandez-kinc...@capitaliq.com]
Sent: Monday, October 05, 2009 9:04 AM
To: solr-user@lucene.apache.org
Subject: Solr Timeouts

Hi,
I'm attempting to index approximately 6 million HTML/Text files using SOLR 
1.4/Tomcat6 on Windows Server 2003 x64. I'm running 64 bit Tomcat and JVM. I've 
fired up 4-5 different jobs that are making indexing requests using the 
ExtractionRequestHandler, and everything works well for about 30-40 minutes, 
after which all indexing requests start timing out. I profiled the server and 
found that all of the threads are getting blocked by this call to flush the 
Lucene index to disk (see below).

This leads me to a few questions:

1.   Is this normal?

2.   Can I reduce the frequency with which this happens somehow? I've 
greatly increased the indexing options in SolrConfig.xml (attached here) to no 
avail.

3.   During these flushes, resource utilization (CPU, I/O, Memory 
Consumption) is significantly down compared to when requests are being handled. 
Is there any way to make this index go faster? I have plenty of bandwidth on 
the machine.

I appreciate any insight you can provide. We're currently using MS SQL 2005 as 
our full-text solution and are pretty much miserable. So far SOLR has been a 
great experience.

Thanks,
Gio.

http-8080-Processor21 [RUNNABLE] CPU time: 9:51
java.io.RandomAccessFile.seek(long)
org.apache.lucene.store.SimpleFSDirectory$SimpleFSIndexInput.readInternal(byte[],
 int, int)
org.apache.lucene.store.BufferedIndexInput.refill()
org.apache.lucene.store.BufferedIndexInput.readByte()
org.apache.lucene.store.IndexInput.readVInt()
org.apache.lucene.index.TermBuffer.read(IndexInput, FieldInfos)
org.apache.lucene.index.SegmentTermEnum.next()
org.apache.lucene.index.SegmentTermEnum.scanTo(Term)
org.apache.lucene.index.TermInfosReader.get(Term, boolean)
org.apache.lucene.index.TermInfosReader.get(Term)
org.apache.lucene.index.SegmentTermDocs.seek(Term)
org.apache.lucene.index.DocumentsWriter.applyDeletes(IndexReader, int)
org.apache.lucene.index.DocumentsWriter.applyDeletes(SegmentInfos)
org.apache.lucene.index.IndexWriter.applyDeletes()
org.apache.lucene.index.IndexWriter.doFlushInternal(boolean, boolean)
org.apache.lucene.index.IndexWriter.doFlush(boolean, boolean)
org.apache.lucene.index.IndexWriter.flush(boolean, boolean, boolean)
org.apache.lucene.index.IndexWriter.closeInternal(boolean)
org.apache.lucene.index.IndexWriter.close(boolean)
org.apache.lucene.index.IndexWriter.close()
org.apache.solr.update.SolrIndexWriter.close()
org.apache.solr.update.DirectUpdateHandler2.closeWriter()
org.apache.solr.update.DirectUpdateHandler2.commit(CommitUpdateCommand)
org.apache.solr.update.processor.RunUpdateProcessor.processCommit(CommitUpdateCommand)
org.apache.solr.handler.RequestHandlerUtils.handleCommit(UpdateRequestProcessor,
 SolrParams, boolean)
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(SolrQueryRequest,
 SolrQueryResponse)
org.apache.solr.handler.RequestHandlerBase.handleRequest(SolrQueryRequest, 
SolrQueryResponse)
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(SolrQueryRequest,
 SolrQueryResponse)
org.apache.solr.core.SolrCore.execute(SolrRequestHandler, SolrQueryRequest, 
SolrQueryResponse)
org.apache.solr.servlet.SolrDispatchFilter.execute(HttpServletRequest, 
SolrRequestHandler, SolrQueryRequest, SolrQueryResponse)
org.apache.solr.servlet.SolrDispatchFilter.doFilter(ServletRequest, 
ServletResponse, FilterChain)
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ServletRequest,
 ServletResponse)
org.apache.catalina.core.ApplicationFilterChain.doFilter(ServletRequest, 
ServletResponse)
org.apache.catalina.core.StandardWrapperValve.invoke(Request, Response)
org.apache.catalina.core.StandardContextValve.invoke(Request, Response)
org.apache.catalina.core.StandardHo

Re: Solr Timeouts

On Mon, Oct 5, 2009 at 12:30 PM, Giovanni Fernandez-Kincade
 wrote:
> I'm not committing at all actually - I'm waiting for all 6 million to be done.

You either have solr auto commit set up, or a client is issuing a commit.

-Yonik
http://www.lucidimagination.com



> -Original Message-
> From: Feak, Todd [mailto:todd.f...@smss.sony.com]
> Sent: Monday, October 05, 2009 12:10 PM
> To: solr-user@lucene.apache.org
> Subject: RE: Solr Timeouts
>
> How often are you committing?
>
> Every time you commit, Solr will close the old index and open the new one. If 
> you are doing this in parallel from multiple jobs (4-5 you mention) then 
> eventually the server gets behind and you start to pile up commit requests. 
> Once this starts to happen, it will cascade out of control if the rate of 
> commits isn't slowed.
>
> -Todd
>
> 
> From: Giovanni Fernandez-Kincade [mailto:gfernandez-kinc...@capitaliq.com]
> Sent: Monday, October 05, 2009 9:04 AM
> To: solr-user@lucene.apache.org
> Subject: Solr Timeouts
>
> Hi,
> I'm attempting to index approximately 6 million HTML/Text files using SOLR 
> 1.4/Tomcat6 on Windows Server 2003 x64. I'm running 64 bit Tomcat and JVM. 
> I've fired up 4-5 different jobs that are making indexing requests using the 
> ExtractionRequestHandler, and everything works well for about 30-40 minutes, 
> after which all indexing requests start timing out. I profiled the server and 
> found that all of the threads are getting blocked by this call to flush the 
> Lucene index to disk (see below).
>
> This leads me to a few questions:
>
> 1.       Is this normal?
>
> 2.       Can I reduce the frequency with which this happens somehow? I've 
> greatly increased the indexing options in SolrConfig.xml (attached here) to 
> no avail.
>
> 3.       During these flushes, resource utilization (CPU, I/O, Memory 
> Consumption) is significantly down compared to when requests are being 
> handled. Is there any way to make this index go faster? I have plenty of 
> bandwidth on the machine.
>
> I appreciate any insight you can provide. We're currently using MS SQL 2005 
> as our full-text solution and are pretty much miserable. So far SOLR has been 
> a great experience.
>
> Thanks,
> Gio.
>
> http-8080-Processor21 [RUNNABLE] CPU time: 9:51
> java.io.RandomAccessFile.seek(long)
> org.apache.lucene.store.SimpleFSDirectory$SimpleFSIndexInput.readInternal(byte[],
>  int, int)
> org.apache.lucene.store.BufferedIndexInput.refill()
> org.apache.lucene.store.BufferedIndexInput.readByte()
> org.apache.lucene.store.IndexInput.readVInt()
> org.apache.lucene.index.TermBuffer.read(IndexInput, FieldInfos)
> org.apache.lucene.index.SegmentTermEnum.next()
> org.apache.lucene.index.SegmentTermEnum.scanTo(Term)
> org.apache.lucene.index.TermInfosReader.get(Term, boolean)
> org.apache.lucene.index.TermInfosReader.get(Term)
> org.apache.lucene.index.SegmentTermDocs.seek(Term)
> org.apache.lucene.index.DocumentsWriter.applyDeletes(IndexReader, int)
> org.apache.lucene.index.DocumentsWriter.applyDeletes(SegmentInfos)
> org.apache.lucene.index.IndexWriter.applyDeletes()
> org.apache.lucene.index.IndexWriter.doFlushInternal(boolean, boolean)
> org.apache.lucene.index.IndexWriter.doFlush(boolean, boolean)
> org.apache.lucene.index.IndexWriter.flush(boolean, boolean, boolean)
> org.apache.lucene.index.IndexWriter.closeInternal(boolean)
> org.apache.lucene.index.IndexWriter.close(boolean)
> org.apache.lucene.index.IndexWriter.close()
> org.apache.solr.update.SolrIndexWriter.close()
> org.apache.solr.update.DirectUpdateHandler2.closeWriter()
> org.apache.solr.update.DirectUpdateHandler2.commit(CommitUpdateCommand)
> org.apache.solr.update.processor.RunUpdateProcessor.processCommit(CommitUpdateCommand)
> org.apache.solr.handler.RequestHandlerUtils.handleCommit(UpdateRequestProcessor,
>  SolrParams, boolean)
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(SolrQueryRequest,
>  SolrQueryResponse)
> org.apache.solr.handler.RequestHandlerBase.handleRequest(SolrQueryRequest, 
> SolrQueryResponse)
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(SolrQueryRequest,
>  SolrQueryResponse)
> org.apache.solr.core.SolrCore.execute(SolrRequestHandler, SolrQueryRequest, 
> SolrQueryResponse)
> org.apache.solr.servlet.SolrDispatchFilter.execute(HttpServletRequest, 
> SolrRequestHandler, SolrQueryRequest, SolrQueryResponse)
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(ServletRequest, 
> ServletResponse, FilterChain)
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ServletRequest,
>  ServletResponse)
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ServletRequest, 
> ServletResponse)
> org.apache.catalina.core.StandardWrapperValve.invoke(Request, Response)
> org.apache.catalina.core.StandardContextValve.invoke(Request, Response)
> org.apache.catalina.core.StandardHostValve.invoke(Request, Response)
> org.apache.catalina.valves.Error

RE: Solr Timeouts

I'm not committing at all actually - I'm waiting for all 6 million to be done. 

-Original Message-
From: Feak, Todd [mailto:todd.f...@smss.sony.com] 
Sent: Monday, October 05, 2009 12:10 PM
To: solr-user@lucene.apache.org
Subject: RE: Solr Timeouts

How often are you committing?

Every time you commit, Solr will close the old index and open the new one. If 
you are doing this in parallel from multiple jobs (4-5 you mention) then 
eventually the server gets behind and you start to pile up commit requests. 
Once this starts to happen, it will cascade out of control if the rate of 
commits isn't slowed.

-Todd

From: Giovanni Fernandez-Kincade [mailto:gfernandez-kinc...@capitaliq.com]
Sent: Monday, October 05, 2009 9:04 AM
To: solr-user@lucene.apache.org
Subject: Solr Timeouts

Hi,
I'm attempting to index approximately 6 million HTML/Text files using SOLR 
1.4/Tomcat6 on Windows Server 2003 x64. I'm running 64 bit Tomcat and JVM. I've 
fired up 4-5 different jobs that are making indexing requests using the 
ExtractionRequestHandler, and everything works well for about 30-40 minutes, 
after which all indexing requests start timing out. I profiled the server and 
found that all of the threads are getting blocked by this call to flush the 
Lucene index to disk (see below).

This leads me to a few questions:

1.   Is this normal?

2.   Can I reduce the frequency with which this happens somehow? I've 
greatly increased the indexing options in SolrConfig.xml (attached here) to no 
avail.

3.   During these flushes, resource utilization (CPU, I/O, Memory 
Consumption) is significantly down compared to when requests are being handled. 
Is there any way to make this index go faster? I have plenty of bandwidth on 
the machine.

I appreciate any insight you can provide. We're currently using MS SQL 2005 as 
our full-text solution and are pretty much miserable. So far SOLR has been a 
great experience.

Thanks,
Gio.

http-8080-Processor21 [RUNNABLE] CPU time: 9:51
java.io.RandomAccessFile.seek(long)
org.apache.lucene.store.SimpleFSDirectory$SimpleFSIndexInput.readInternal(byte[],
 int, int)
org.apache.lucene.store.BufferedIndexInput.refill()
org.apache.lucene.store.BufferedIndexInput.readByte()
org.apache.lucene.store.IndexInput.readVInt()
org.apache.lucene.index.TermBuffer.read(IndexInput, FieldInfos)
org.apache.lucene.index.SegmentTermEnum.next()
org.apache.lucene.index.SegmentTermEnum.scanTo(Term)
org.apache.lucene.index.TermInfosReader.get(Term, boolean)
org.apache.lucene.index.TermInfosReader.get(Term)
org.apache.lucene.index.SegmentTermDocs.seek(Term)
org.apache.lucene.index.DocumentsWriter.applyDeletes(IndexReader, int)
org.apache.lucene.index.DocumentsWriter.applyDeletes(SegmentInfos)
org.apache.lucene.index.IndexWriter.applyDeletes()
org.apache.lucene.index.IndexWriter.doFlushInternal(boolean, boolean)
org.apache.lucene.index.IndexWriter.doFlush(boolean, boolean)
org.apache.lucene.index.IndexWriter.flush(boolean, boolean, boolean)
org.apache.lucene.index.IndexWriter.closeInternal(boolean)
org.apache.lucene.index.IndexWriter.close(boolean)
org.apache.lucene.index.IndexWriter.close()
org.apache.solr.update.SolrIndexWriter.close()
org.apache.solr.update.DirectUpdateHandler2.closeWriter()
org.apache.solr.update.DirectUpdateHandler2.commit(CommitUpdateCommand)
org.apache.solr.update.processor.RunUpdateProcessor.processCommit(CommitUpdateCommand)
org.apache.solr.handler.RequestHandlerUtils.handleCommit(UpdateRequestProcessor,
 SolrParams, boolean)
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(SolrQueryRequest,
 SolrQueryResponse)
org.apache.solr.handler.RequestHandlerBase.handleRequest(SolrQueryRequest, 
SolrQueryResponse)
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(SolrQueryRequest,
 SolrQueryResponse)
org.apache.solr.core.SolrCore.execute(SolrRequestHandler, SolrQueryRequest, 
SolrQueryResponse)
org.apache.solr.servlet.SolrDispatchFilter.execute(HttpServletRequest, 
SolrRequestHandler, SolrQueryRequest, SolrQueryResponse)
org.apache.solr.servlet.SolrDispatchFilter.doFilter(ServletRequest, 
ServletResponse, FilterChain)
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ServletRequest,
 ServletResponse)
org.apache.catalina.core.ApplicationFilterChain.doFilter(ServletRequest, 
ServletResponse)
org.apache.catalina.core.StandardWrapperValve.invoke(Request, Response)
org.apache.catalina.core.StandardContextValve.invoke(Request, Response)
org.apache.catalina.core.StandardHostValve.invoke(Request, Response)
org.apache.catalina.valves.ErrorReportValve.invoke(Request, Response)
org.apache.catalina.core.StandardEngineValve.invoke(Request, Response)
org.apache.catalina.connector.CoyoteAdapter.service(Request, Response)
org.apache.coyote.http11.Http11Processor.process(InputStream, OutputStream)
org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(TcpConnection,
 Object[])
org

Re: debugQuery different score for same query. dismax

On Mon, Oct 5, 2009 at 4:42 AM, Julian Davchev  wrote:
> Well,
> Any explanation why I get different scores then?

I didn't have enough context to see if anything was wrong... by
"different scores" do you mean that the debugQuery scores don't match
with the scores in the main document list?  That would be a bug.

But I suspect you just mean that different documents score
differently... that's what is supposed to happen.
In your diff, I see a different fieldNorm factor, which probably means
that the length of the fields are just different.

Sure enough, that explains the different scores:
3.7137468 * .375 / .4375 = 3.1832115

If you don't want length normalization for this field, turn it off by
setting omitNorms=true

-Yonik
http://www.lucidimagination.com



> Yonik Seeley wrote:
>> On Fri, Oct 2, 2009 at 8:16 AM, Julian Davchev  wrote:
>>
>>> It looks for "pari"   in   ancestorName  field   but first row looks in
>>> 241135 records
>>> and the second row it's just 187821 records.
>>>
>>
>> The "in 241135" is just saying that this match is in document #241135.
>>
>> -Yonik
>> http://www.lucidimagination.com
>>
>>
>>
>>>  Which in results give
>>> lower score for the second row.
>>>
>>> Question is what is affecting this thingy cause I would expect same
>>> fieldname same value to give same score.
>>>
>>> It's dismax query...I skipped showing scoring of other fields to simplify.
>>>
>>> Cheers
>>>
>>>
>>> -3.7137468 = (MATCH) weight(ancestorName:pari^35.0 in 241135), product of:
>>> +3.1832116 = (MATCH) weight(ancestorName:pari^35.0 in 187821), product of:
>>>  0.8593 = queryWeight(ancestorName:pari^35.0), product of:
>>>     35.0 = boost
>>>  8.488684 = idf(docFreq=148, numDocs=74979)
>>>     0.0033657781 = queryNorm
>>> -    3.713799 = (MATCH) fieldWeight(ancestorName:pari in 241135),
>>> product of:
>>> +    3.1832564 = (MATCH) fieldWeight(ancestorName:pari in 187821),
>>> product of:
>>>     1.0 = tf(termFreq(ancestorName:pari)=1)
>>>     8.488684 = idf(docFreq=148, numDocs=74979)
>>> -0.4375 = fieldNorm(field=ancestorName, doc=241135)
>>> +0.375 = fieldNorm(field=ancestorName, doc=187821)
>>>
>>>
>
>

Re: Solr Timeouts

On Mon, Oct 5, 2009 at 12:03 PM, Giovanni Fernandez-Kincade
 wrote:
> Hi,
>
> I’m attempting to index approximately 6 million HTML/Text files using SOLR
> 1.4/Tomcat6 on Windows Server 2003 x64. I’m running 64 bit Tomcat and JVM.
> I’ve fired up 4-5 different jobs that are making indexing requests using the
> ExtractionRequestHandler, and everything works well for about 30-40 minutes,
> after which all indexing requests start timing out. I profiled the server
> and found that all of the threads are getting blocked by this call to flush
> the Lucene index to disk (see below).
>
>
>
> This leads me to a few questions:
>
> 1. Is this normal?

Yes... one can't currently add documents when the first part of a
commit is going on (closing the IndexWriter).  The threads will
normally block and then resume after the writer has been successfully
closed.  This is normally fine and you can work around it by
increasing the servlet container timeout.

Due to advances in Lucene, this restriction will probably be lifted in
the next version of Solr (1.5)

> 2. Can I reduce the frequency with which this happens somehow? I’ve
> greatly increased the indexing options in SolrConfig.xml (attached here) to
> no avail.

It looks like Solr is committing because you told it to?

> 3. During these flushes, resource utilization (CPU, I/O, Memory
> Consumption) is significantly down compared to when requests are being
> handled. Is there any way to make this index go faster? I have plenty of
> bandwidth on the machine.

Don't commit until you're done a big indexing run?
If you're using SolrJ, use the StreamingUpdateSolrServer it's much faster!

-Yonik
http://www.lucidimagination.com


> I appreciate any insight you can provide. We’re currently using MS SQL 2005
> as our full-text solution and are pretty much miserable. So far SOLR has
> been a great experience.
>
>
>
> Thanks,
>
> Gio.
>
>
>
> http-8080-Processor21 [RUNNABLE] CPU time: 9:51
>
> java.io.RandomAccessFile.seek(long)
>
> org.apache.lucene.store.SimpleFSDirectory$SimpleFSIndexInput.readInternal(byte[],
> int, int)
>
> org.apache.lucene.store.BufferedIndexInput.refill()
>
> org.apache.lucene.store.BufferedIndexInput.readByte()
>
> org.apache.lucene.store.IndexInput.readVInt()
>
> org.apache.lucene.index.TermBuffer.read(IndexInput, FieldInfos)
>
> org.apache.lucene.index.SegmentTermEnum.next()
>
> org.apache.lucene.index.SegmentTermEnum.scanTo(Term)
>
> org.apache.lucene.index.TermInfosReader.get(Term, boolean)
>
> org.apache.lucene.index.TermInfosReader.get(Term)
>
> org.apache.lucene.index.SegmentTermDocs.seek(Term)
>
> org.apache.lucene.index.DocumentsWriter.applyDeletes(IndexReader, int)
>
> org.apache.lucene.index.DocumentsWriter.applyDeletes(SegmentInfos)
>
> org.apache.lucene.index.IndexWriter.applyDeletes()
>
> org.apache.lucene.index.IndexWriter.doFlushInternal(boolean, boolean)
>
> org.apache.lucene.index.IndexWriter.doFlush(boolean, boolean)
>
> org.apache.lucene.index.IndexWriter.flush(boolean, boolean, boolean)
>
> org.apache.lucene.index.IndexWriter.closeInternal(boolean)
>
> org.apache.lucene.index.IndexWriter.close(boolean)
>
> org.apache.lucene.index.IndexWriter.close()
>
> org.apache.solr.update.SolrIndexWriter.close()
>
> org.apache.solr.update.DirectUpdateHandler2.closeWriter()
>
> org.apache.solr.update.DirectUpdateHandler2.commit(CommitUpdateCommand)
>
> org.apache.solr.update.processor.RunUpdateProcessor.processCommit(CommitUpdateCommand)
>
> org.apache.solr.handler.RequestHandlerUtils.handleCommit(UpdateRequestProcessor,
> SolrParams, boolean)
>
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(SolrQueryRequest,
> SolrQueryResponse)
>
> org.apache.solr.handler.RequestHandlerBase.handleRequest(SolrQueryRequest,
> SolrQueryResponse)
>
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(SolrQueryRequest,
> SolrQueryResponse)
>
> org.apache.solr.core.SolrCore.execute(SolrRequestHandler, SolrQueryRequest,
> SolrQueryResponse)
>
> org.apache.solr.servlet.SolrDispatchFilter.execute(HttpServletRequest,
> SolrRequestHandler, SolrQueryRequest, SolrQueryResponse)
>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(ServletRequest,
> ServletResponse, FilterChain)
>
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ServletRequest,
> ServletResponse)
>
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ServletRequest,
> ServletResponse)
>
> org.apache.catalina.core.StandardWrapperValve.invoke(Request, Response)
>
> org.apache.catalina.core.StandardContextValve.invoke(Request, Response)
>
> org.apache.catalina.core.StandardHostValve.invoke(Request, Response)
>
> org.apache.catalina.valves.ErrorReportValve.invoke(Request, Response)
>
> org.apache.catalina.core.StandardEngineValve.invoke(Request, Response)
>
> org.apache.catalina.connector.CoyoteAdapter.service(Request, Response)
>
> org.apache.coyote.http11.Http11Processor.process(InputStream, OutputStream)
>
> org.apache.coyote

RE: Solr Timeouts

How often are you committing?

Every time you commit, Solr will close the old index and open the new one. If 
you are doing this in parallel from multiple jobs (4-5 you mention) then 
eventually the server gets behind and you start to pile up commit requests. 
Once this starts to happen, it will cascade out of control if the rate of 
commits isn't slowed.

-Todd


From: Giovanni Fernandez-Kincade [mailto:gfernandez-kinc...@capitaliq.com]
Sent: Monday, October 05, 2009 9:04 AM
To: solr-user@lucene.apache.org
Subject: Solr Timeouts

Hi,
I'm attempting to index approximately 6 million HTML/Text files using SOLR 
1.4/Tomcat6 on Windows Server 2003 x64. I'm running 64 bit Tomcat and JVM. I've 
fired up 4-5 different jobs that are making indexing requests using the 
ExtractionRequestHandler, and everything works well for about 30-40 minutes, 
after which all indexing requests start timing out. I profiled the server and 
found that all of the threads are getting blocked by this call to flush the 
Lucene index to disk (see below).

This leads me to a few questions:

1.   Is this normal?

2.   Can I reduce the frequency with which this happens somehow? I've 
greatly increased the indexing options in SolrConfig.xml (attached here) to no 
avail.

3.   During these flushes, resource utilization (CPU, I/O, Memory 
Consumption) is significantly down compared to when requests are being handled. 
Is there any way to make this index go faster? I have plenty of bandwidth on 
the machine.

I appreciate any insight you can provide. We're currently using MS SQL 2005 as 
our full-text solution and are pretty much miserable. So far SOLR has been a 
great experience.

Thanks,
Gio.

http-8080-Processor21 [RUNNABLE] CPU time: 9:51
java.io.RandomAccessFile.seek(long)
org.apache.lucene.store.SimpleFSDirectory$SimpleFSIndexInput.readInternal(byte[],
 int, int)
org.apache.lucene.store.BufferedIndexInput.refill()
org.apache.lucene.store.BufferedIndexInput.readByte()
org.apache.lucene.store.IndexInput.readVInt()
org.apache.lucene.index.TermBuffer.read(IndexInput, FieldInfos)
org.apache.lucene.index.SegmentTermEnum.next()
org.apache.lucene.index.SegmentTermEnum.scanTo(Term)
org.apache.lucene.index.TermInfosReader.get(Term, boolean)
org.apache.lucene.index.TermInfosReader.get(Term)
org.apache.lucene.index.SegmentTermDocs.seek(Term)
org.apache.lucene.index.DocumentsWriter.applyDeletes(IndexReader, int)
org.apache.lucene.index.DocumentsWriter.applyDeletes(SegmentInfos)
org.apache.lucene.index.IndexWriter.applyDeletes()
org.apache.lucene.index.IndexWriter.doFlushInternal(boolean, boolean)
org.apache.lucene.index.IndexWriter.doFlush(boolean, boolean)
org.apache.lucene.index.IndexWriter.flush(boolean, boolean, boolean)
org.apache.lucene.index.IndexWriter.closeInternal(boolean)
org.apache.lucene.index.IndexWriter.close(boolean)
org.apache.lucene.index.IndexWriter.close()
org.apache.solr.update.SolrIndexWriter.close()
org.apache.solr.update.DirectUpdateHandler2.closeWriter()
org.apache.solr.update.DirectUpdateHandler2.commit(CommitUpdateCommand)
org.apache.solr.update.processor.RunUpdateProcessor.processCommit(CommitUpdateCommand)
org.apache.solr.handler.RequestHandlerUtils.handleCommit(UpdateRequestProcessor,
 SolrParams, boolean)
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(SolrQueryRequest,
 SolrQueryResponse)
org.apache.solr.handler.RequestHandlerBase.handleRequest(SolrQueryRequest, 
SolrQueryResponse)
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(SolrQueryRequest,
 SolrQueryResponse)
org.apache.solr.core.SolrCore.execute(SolrRequestHandler, SolrQueryRequest, 
SolrQueryResponse)
org.apache.solr.servlet.SolrDispatchFilter.execute(HttpServletRequest, 
SolrRequestHandler, SolrQueryRequest, SolrQueryResponse)
org.apache.solr.servlet.SolrDispatchFilter.doFilter(ServletRequest, 
ServletResponse, FilterChain)
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ServletRequest,
 ServletResponse)
org.apache.catalina.core.ApplicationFilterChain.doFilter(ServletRequest, 
ServletResponse)
org.apache.catalina.core.StandardWrapperValve.invoke(Request, Response)
org.apache.catalina.core.StandardContextValve.invoke(Request, Response)
org.apache.catalina.core.StandardHostValve.invoke(Request, Response)
org.apache.catalina.valves.ErrorReportValve.invoke(Request, Response)
org.apache.catalina.core.StandardEngineValve.invoke(Request, Response)
org.apache.catalina.connector.CoyoteAdapter.service(Request, Response)
org.apache.coyote.http11.Http11Processor.process(InputStream, OutputStream)
org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(TcpConnection,
 Object[])
org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(Socket, TcpConnection, 
Object[])
org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(Object[])
org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run()
java.lang.Thread.run()

Re: A little help with indexing joined words

>
> We have indexed a product database and have come across some search terms
> where zero results are returned.  There are products in the index with
> 'Borderlands xxx xxx', 'Dragonfly xx xxx' in the title.  Searches for
> 'Borderland'  or 'Border Land' and 'Dragon Fly' return zero results
> respectively.
>
"Borderland" should have worked for a regular text field. For all other
desired matches you can use EdgeNGramTokenizerFactory.

Cheers
Avlesh

On Mon, Oct 5, 2009 at 7:51 PM, Andrew McCombe  wrote:

> Hi
> I am hoping someone can point me in the right direction with regards to
> indexing words that are concatenated together to make other words or
> product
> names.
>
> We have indexed a product database and have come across some search terms
> where zero results are returned.  There are products in the index with
> 'Borderlands xxx xxx', 'Dragonfly xx xxx' in the title.  Searches for
> 'Borderland'  or 'Border Land' and 'Dragon Fly' return zero results
> respectively.
>
> Where do I look to resolve this?  The product name field is indexed using a
> text field type.
>
> Thanks in advance
> Andrew
>

about modifying querystring parameters submitted to solr

2009-10-05 Thread gdeconto


sorry if this is a very simple question, but I am stuck (and online searches
for this info havent been fruitful).

Lets say that, in certain circumstances, I want to change the field names
and/or field query values being passed to SOLR.

For example, lets say my unmodified query is
"http://localhost:8994/solr/select?q=xxx:[* TO 3] AND yyy:[3 TO
*]&defType=myQParser" and (JUST for the sake of argument) lets say I want to
rewrite it as "http://localhost:8994/solr/select?q=aaa:[1 TO 2] AND bbb:[3
TO 10]&defType=myQParser".

I think I can do it by extending QParserPlugin, and overriding the
createParser method (see my code snippet below). The qstr parameter
apparently contains the parts I want to examine and/or modify.

now to my questions:
1. is that the correct location to do this sort of manipulation?
2. is there an existing method for parsing out the fields and their
parameters? i.e. to break a qstr of "xxx:[* TO 3] AND yyy:[3 TO *]" into an
array something like  x[0][0] = "xxx", x[0][1]="* TO 3", x[1][0] = "yyy",
x[1][1]="3 TO *".  Or possibly even finer granularity than that.  I could
write it myself but its much nicer not to have to (especially since the
queries could be very complex).

thanks in advance for any help.

package com.topproducer.rentals.solr.search;

import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.search.Query;
import org.apache.solr.common.params.SolrParams;
import org.apache.solr.common.util.NamedList;
import org.apache.solr.request.SolrQueryRequest;
import org.apache.solr.search.QParser;
import org.apache.solr.search.QParserPlugin;

public class myQParserPlugin extends QParserPlugin {

@Override
public QParser createParser(String qstr, SolrParams localParams,
SolrParams params, SolrQueryRequest req)
{
return new QParser(qstr, localParams, params, req) {
  QParser baseParser;

  public Query parse() throws ParseException {
StringBuilder queryBuilder = new StringBuilder();

// extract and/or view and/or change qstr content
here
// ..
// is there an existing function/method to parse
qstr into its component parts?
// i.e. to break "?q=xxx:[1 TO 3] AND yyy:[3 TO *]"
into something like:
// x[0][0] = "xxx", x[0][1]="1 TO 3"
// x[1][0] = "xxx", x[1][1]="3 TO *"

// after modifying qstr, store it into queryBuilder
here
queryBuild.append(new_qstr);


// prepare queryBuilder for any additional solr
handling
baseParser = subQuery(queryBuilder.toString(),
null);
Query q = baseParser.parse();
return q;
  }


  public String[] getDefaultHighlightFields() {
return baseParser.getDefaultHighlightFields();
  }

   
  public Query getHighlightQuery() throws ParseException {
return baseParser.getHighlightQuery();
  }

  public void addDebugInfo(NamedList debugInfo) {
baseParser.addDebugInfo(debugInfo);
  }
};
  }

@Override
public void init(NamedList arg0) {
   
// TODO Auto-generated method stub
   
}
} 
-- 
View this message in context: 
http://www.nabble.com/about-modifying-querystring-parameters-submitted-to-solr-tp25752898p25752898.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr Trunk Heap Space Issues

2009-10-05 Thread Marc Sturlese


I think it doesn't make sense to enable warming if your solr instance is just
for indexing pourposes (it changes if you use it for search aswell). You
could comment the caches aswell from solrconfig.xml
Setting queryResultWindowSize and queryResultMaxDocsCached to sero maybe
could help... (but if caches and warming are removed from solrconfig.xml I
think these two parameters do nothing)

Jeffery Newburn wrote:
> 
> Ah yes we do have some warming queries which would look like a search. 
> Did
> that side change enough to push up the memory limits where we would run
> out
> like this?  Also, would FastLRU cache make a difference?
> -- 
> Jeff Newburn
> Software Engineer, Zappos.com
> jnewb...@zappos.com - 702-943-7562
> 
> 
>> From: Yonik Seeley 
>> Reply-To: 
>> Date: Fri, 2 Oct 2009 00:53:46 -0400
>> To: 
>> Subject: Re: Solr Trunk Heap Space Issues
>> 
>> On Thu, Oct 1, 2009 at 8:45 PM, Jeffery Newburn 
>> wrote:
>>> I loaded the jvm and started indexing. It is a test server so unless
>>> some
>>> errant query came in then no searching. Our instance has only 512mb but
>>> my
>>> concern is the obvious memory requirement leap since it worked before.
>>> What
>>> other data would be helpful with this?
>> 
>> Interesting... not too much should have changed for memory
>> requirements on the indexing side.
>> TokenStreams are now reused (and hence cached) per thread... but that
>> normally wouldn't amount to much.
>> 
>> There was recently another bug where compound file format was being
>> used regardless of the config settings... but I think that was fixed
>> on the 29th.
>> 
>> Maybe you were already close to the limit required?
>> Also, your heap dump did show LRUCache taking up 170MB, and only
>> searches populate that (perhaps you have warming searches configured
>> on this server?)
>> 
>> -Yonik
>> http://www.lucidimagination.com
>> 
>> 
>> 
>> 
>> 
>>> 
>>> 
>>> On Oct 1, 2009, at 5:14 PM, "Mark Miller"  wrote:
>>> 
 Jeff Newburn wrote:
> 
> Ok I was able to get a heap dump from the GC Limit error.
> 
> 1 instance of LRUCache is taking 170mb
> 1 instance of SchemaIndex is taking 56Mb
> 4 instances of SynonymMap is taking 112mb
> 
> There is no searching going on during this index update process.
> 
> Any ideas what on earth is going on?  Like I said my May version did
> this
> without any problems whatsoever.
> 
> 
 Had any searching gone on though? Even if its not occurring during the
 indexing, you will still have the data structure loaded if searches had
 occurred.
 
 What heap size do you have - that doesn't look like much data to me ...
 
 --
 - Mark
 
 http://www.lucidimagination.com
 
 
 
>>> 
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Solr-Trunk-Heap-Space-Issues-tp25701422p25752521.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: search by some functionality

2009-10-05 Thread Elaine Li

Hi Chantal,

I thought about that - taking care of the comparison at the index
time. But the user's input scenarios are countless. That method will
not cover all the cases.
Doing comparison on the fly is better. I am just confused which way to
go since I have not done much customization of solr by now.

I see customized function query and customized request handler are two
options. I wish I can get more details about how to do that. I
appreciate if i can get some examples.

Thanks.

Elaine

On Mon, Oct 5, 2009 at 10:35 AM, Chantal Ackermann
 wrote:
> Hi Elaine,
>
> couldn't you do that comparison at index time and store the result in an
> additional field? At query time you could include that other field as
> condition.
>
> Cheers,
> Chantal
>
> Elaine Li schrieb:
>>
>> Hi Shalin,
>>
>> Thanks for your attention.
>>
>> I am implementing a language translation search. The field1 and field2
>> are two language's sentence pair. field3 is a table of indexes of
>> words in field1 and field2. The table was created by some independent
>> algorithm. If string1 and string2 can be found aligned in the table,
>> then it is a hit. Otherwise, it should not return.
>>
>> Hope I clarified what i need to achieve. Your help is greatly appreciated!
>>
>> Thanks!
>>
>> Elaine
>>
>> On Mon, Oct 5, 2009 at 5:57 AM, Shalin Shekhar Mangar
>>  wrote:
>>>
>>> On Sat, Oct 3, 2009 at 1:16 AM, Elaine Li 
>>> wrote:
>>>
 Hi,

 My doc has three fields, say field1, field2, field3.

 My search would be q=field1:string1 && field2:string2. I also need to
 do some computation and comparison of the string1 and string2 with the
 contents in field3 and then determine if it is a hit.

 What can I do to implement this?

>>> What exactly are you trying to achieve? What is the
>>> computation/comparison
>>> that you need to do?
>>>
>>> --
>>> Regards,
>>> Shalin Shekhar Mangar.
>>>
>

Re: search by some functionality

2009-10-05 Thread Elaine Li

Hi Sandeep,

I read about this chapter before. It did not mention how to create my
own customized function.
Can you point me to some instructions?

Thanks.

Elaine

On Mon, Oct 5, 2009 at 10:15 AM, Sandeep Tagore
 wrote:
>
> Hi Elaine,
> You can make use of  http://wiki.apache.org/solr/FunctionQuery Function
> Query  to achieve this. You can do the computations in your customized
> function to determine whether it is a hit or not.
>
> Sandeep
>
> -
> Sandeep Tagore
>
> --
> View this message in context: 
> http://www.nabble.com/search-by-some-functionality-tp25721533p25751627.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>

Re: Always spellcheck (suggest)


Shalin,


Thanks for the clarification. That explains a lot. I should have looked 
at the lucene documentation.



On 10/05/2009 05:28 AM, Shalin Shekhar Mangar wrote:

On Mon, Oct 5, 2009 at 10:24 AM, Christian Zambranowrote:

   

I am really surprised that a query for "behaviour" returns "behavior" as a
suggestion only when the parameter "spellcheck.onlyMorePopular=true" is
present. I re-read the documentation and I see nothing that will imply that
the parameter onlyMorePopular will do anything else but filter the
suggestions solr will return.

Maybe somebody else can shed some light on this.


 

Yeah, that is true. All this is actually done in the Lucene SpellChecker.
Solr's component is a wrapper over it with some extra features. I've added a
clarification to the wiki page.

Re: search by some functionality

2009-10-05 Thread Chantal Ackermann


Hi Elaine,

couldn't you do that comparison at index time and store the result in an 
additional field? At query time you could include that other field as 
condition.


Cheers,
Chantal

Elaine Li schrieb:

Hi Shalin,

Thanks for your attention.

I am implementing a language translation search. The field1 and field2
are two language's sentence pair. field3 is a table of indexes of
words in field1 and field2. The table was created by some independent
algorithm. If string1 and string2 can be found aligned in the table,
then it is a hit. Otherwise, it should not return.

Hope I clarified what i need to achieve. Your help is greatly appreciated!

Thanks!

Elaine

On Mon, Oct 5, 2009 at 5:57 AM, Shalin Shekhar Mangar
 wrote:

On Sat, Oct 3, 2009 at 1:16 AM, Elaine Li  wrote:


Hi,

My doc has three fields, say field1, field2, field3.

My search would be q=field1:string1 && field2:string2. I also need to
do some computation and comparison of the string1 and string2 with the
contents in field3 and then determine if it is a hit.

What can I do to implement this?


What exactly are you trying to achieve? What is the computation/comparison
that you need to do?

--
Regards,
Shalin Shekhar Mangar.

Re: Limit of a one-server-SOLR-installation

2009-10-05 Thread Mark Miller

How many unique fields are you sorting and faceting on?

Without knowing much, based on what you have said, for a single machine
I would recommend at least 16GB of RAM for your setup. 32GB would be
even better. 17 million docs is def doable on a single server, but if
you are faceting/sorting on multiple fields, 8 GB of RAM is def on the
low end - especially with a 60 some GB index - you want to get a lot of
that cached in the filesystem cache - but if you are only leaving 2 GB
of RAM for the OS, there likely is very little available for the
filesystem cache - even if you can fit into that RAM (which you likely
can't), your performance will suffer.

If you want to shard across machines (if for some odd reason its easier
to get another machine rather than more RAM), I'd just look into Solr
distributed search. Its scalable to over a billion docs.

Thomas Koch wrote:
> Hi Gasol Wu,
>
> thanks for your reply. I tried to make the config and syslog shorter and more 
> readable.
>
> solrconfig.xml (shortened):
>
> 
>   
> false
> 15
> 1500
> 2147483647
> 1
> 1000
> 1
>   
>
>   
> false
> 10
> 1000
> 2147483647
> 1
>   
>
>   
>
>   
>class="solr.LRUCache"
>   size="512"
>   initialSize="512"
>   autowarmCount="0"/>
>
>class="solr.LRUCache"
>   size="512"
>   initialSize="512"
>   autowarmCount="0"/>
>
>class="solr.LRUCache"
>   size="512"
>   initialSize="512"
>   autowarmCount="0"/>
>
> true
> 10
> 
> 
> false
> 4
>   
>
>   
>  multipartUploadLimitInKB="2048" />
>   
>
>   
>  
>explicit
>  
>   
>
>   
> 
>  explicit
>  0.01
>  
> text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4
>  
>  
> text^0.2 features^1.1 name^1.5 manu^1.4 manu_exact^1.9
>  
>  
> ord(poplarity)^0.5 recip(rord(price),1,1000,1000)^0.3
>  
>  
> id,name,price,score
>  
>  
> 2<-1 5<-2 6<90%
>  
>  100
>  *:*
> 
>   
>
>   
> 
>  explicit
>  text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0
>  2<-1 5<-2 6<90%
>  incubationdate_dt:[* TO NOW/DAY-1MONTH]^2.2
> 
> 
>   inStock:true
> 
> 
>   cat
>   manu_exact
>   price:[* TO 500]
>   price:[500 TO *]
> 
>   
>   
>   
>  
> inStock:true
>  
>  
> text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4
>  
>  
> 2<-1 5<-2 6<90%
>  
>   
>
>class="org.apache.solr.request.XSLTResponseWriter">
> 5
>
> 
>
>
> syslog (shortened and formated):
>
> o.a.coyote.http11.Http11Protocol init
> INFO: Initializing Coyote HTTP/1.1 on http-8080
> o.a.catalina.startup.Catalina load
> INFO: Initialization processed in 416 ms
> o.a.catalina.core.StandardService start
> INFO: Starting service Catalina
> o.a.catalina.core.StandardEngine start
> INFO: Starting Servlet Engine: Apache Tomcat/6.0.20
> o.a.s.servlet.SolrDispatchFilter init
> INFO: SolrDispatchFilter.init()
> o.a.s.core.SolrResourceLoader locateInstanceDir
> INFO: Using JNDI solr.home: /usr/share/solr
> o.a.s.core.CoreContainer$Initializer initialize
> INFO: looking for solr.xml: /usr/share/solr/solr.xml
> o.a.s.core.SolrResourceLoader 
> INFO: Solr home set to '/usr/share/solr/'
> o.a.s.core.SolrResourceLoader createClassLoader
> INFO: Reusing parent classloader
> o.a.s.core.SolrResourceLoader locateInstanceDir
> INFO: Using JNDI solr.home: /usr/share/solr
> o.a.s.core.SolrResourceLoader 
> INFO: Solr home set to '/usr/share/solr/'
> o.a.s.core.SolrResourceLoader createClassLoader
> INFO: Reusing parent classloader
> o.a.s.core.SolrConfig 
> INFO: Loaded SolrConfig: solrconfig.xml
> o.a.s.core.SolrCore 
> INFO: Opening new SolrCore at /usr/share/solr/, 
> dataDir=/var/lib/solr/data/
> o.a.s.schema.IndexSchema readSchema
> INFO: Reading Solr Schema
> o.a.s.schema.IndexSchema readSchema
> INFO: Schema name=memoarticle
> o.a.s.schema.IndexSchema readSchema
> INFO: default search field is catchalltext
> o.a.s.schema.IndexSchema readSchema
> INFO: query parser default operator is AND
> o.a.s.schema.IndexSchema readSchema
> INFO: unique key field: id
> o.a.s.core.SolrCore 
> INFO: JMX monitoring not detected for core: null
> o.a.s.core.SolrCore parseListener
> INFO: Searching for listeners: //listen...@event="firstSearcher"]
> o.a.s.core.SolrCore parseListener
> INFO: Searching for listeners: //listen...@event="newSearcher"]
> o.a.s.request.XSLTResponseWriter init
> INFO: xsltCacheLifetimeSeconds=5
> o.a.s.core.RequestHandlers$1 create
> INFO: adding lazy requestHandler: solr.SpellCheckerRequestHandler
> o.a.s.core.RequestHandlers$1 create
> INFO: adding lazy requestHandler: solr.CSVRequestHandler
> o.a.s.core.SolrCore initDeprecatedSupport
> WARN

A little help with indexing joined words

2009-10-05 Thread Andrew McCombe

Hi
I am hoping someone can point me in the right direction with regards to
indexing words that are concatenated together to make other words or product
names.

We have indexed a product database and have come across some search terms
where zero results are returned.  There are products in the index with
'Borderlands xxx xxx', 'Dragonfly xx xxx' in the title.  Searches for
'Borderland'  or 'Border Land' and 'Dragon Fly' return zero results
respectively.

Where do I look to resolve this?  The product name field is indexed using a
text field type.

Thanks in advance
Andrew

Re: Stopping Solr

2009-10-05 Thread Mark Miller

Sandeep Tagore wrote:
> Hello Everyone,
> I use "java -jar start.jar" command to start Solr. And when ever i want to
> stop it, I kill the process. 
> Is there any command to stop it?
>
> Thanks in advance.
>
> Sandeep
>
> -
> Sandeep Tagore
>
>   
Just look up how jetty works - its not Solr specific.

One option is start like:

java -DSTOP.PORT={port} -DSTOP.KEY=secret -jar start.jar

Pick a port and a key (I used secret).

Then to stop:

java -DSTOP.PORT={port} -DSTOP.KEY=secret -jar start.jar --stop


-- 
- Mark

http://www.lucidimagination.com

Re: search by some functionality

2009-10-05 Thread Sandeep Tagore


Hi Elaine,
You can make use of  http://wiki.apache.org/solr/FunctionQuery Function
Query  to achieve this. You can do the computations in your customized
function to determine whether it is a hit or not.

Sandeep

-
Sandeep Tagore

-- 
View this message in context: 
http://www.nabble.com/search-by-some-functionality-tp25721533p25751627.html
Sent from the Solr - User mailing list archive at Nabble.com.

Stopping Solr

2009-10-05 Thread Sandeep Tagore


Hello Everyone,
I use "java -jar start.jar" command to start Solr. And when ever i want to
stop it, I kill the process. 
Is there any command to stop it?

Thanks in advance.

Sandeep

-
Sandeep Tagore

-- 
View this message in context: 
http://www.nabble.com/Stopping-Solr-tp25751461p25751461.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr Trunk Heap Space Issues

Looks like you have a huge document cache, and the warming query must
have a really high "rows".
Can you lower the rows to something like 10 on the master?

-Yonik
http://www.lucidimagination.com


On Fri, Oct 2, 2009 at 11:28 AM, Jeff Newburn  wrote:
> The warmers return 11 fields:
> 3 Strings
> 2 booleans
> 2 doubles
> 2 longs
> 1 sint (solr.SortableIntField)
>
> Let me know if you need the fields actually be searched on.
>
> name:  fieldCache
> class:  org.apache.solr.search.SolrFieldCacheMBean
> version:  1.0
> description:  Provides introspection of the Lucene FieldCache, this is
> **NOT** a cache that is managed by Solr.
> stats: entries_count :  0
> insanity_count :  0
>
> name:  documentCache
> class:  org.apache.solr.search.LRUCache
> version:  1.0
> description:  LRU Cache(maxSize=10, initialSize=75000)
> stats: lookups :  22620
> hits :  337
> hitratio :  0.01
> inserts :  22282
> evictions :  0
> size :  22282
> warmupTime :  0
> cumulative_lookups :  22620
> cumulative_hits :  337
> cumulative_hitratio :  0.01
> cumulative_inserts :  22282
> cumulative_evictions :  0
>
>
> name:  fieldValueCache
> class:  org.apache.solr.search.FastLRUCache
> version:  1.0
> description:  Concurrent LRU Cache(maxSize=1, initialSize=10,
> minSize=9000, acceptableSize=9500, cleanupThread=false)
> stats: lookups :  0
> hits :  0
> hitratio :  0.00
> inserts :  0
> evictions :  0
> size :  0
> warmupTime :  0
> cumulative_lookups :  0
> cumulative_hits :  0
> cumulative_hitratio :  0.00
> cumulative_inserts :  0
> cumulative_evictions :  0
>
> --
> Jeff Newburn
> Software Engineer, Zappos.com
> jnewb...@zappos.com - 702-943-7562
>
>
>> From: Yonik Seeley 
>> Reply-To: 
>> Date: Fri, 2 Oct 2009 10:04:27 -0400
>> To: 
>> Subject: Re: Solr Trunk Heap Space Issues
>>
>> On Fri, Oct 2, 2009 at 10:02 AM, Mark Miller  wrote:
>>> Jeff Newburn wrote:
 that side change enough to push up the memory limits where we would run out
 like this?

>>> Yes - now give us the FieldCache section from the stats section please :)
>>
>> And the fieldValueCache section too (used for multi-valued faceting).
>>
>> -Yonik
>> http://www.lucidimagination.com
>
>

Re: search by some functionality

2009-10-05 Thread Elaine Li

Hi Shalin,

Thanks for your attention.

I am implementing a language translation search. The field1 and field2
are two language's sentence pair. field3 is a table of indexes of
words in field1 and field2. The table was created by some independent
algorithm. If string1 and string2 can be found aligned in the table,
then it is a hit. Otherwise, it should not return.

Hope I clarified what i need to achieve. Your help is greatly appreciated!

Thanks!

Elaine

On Mon, Oct 5, 2009 at 5:57 AM, Shalin Shekhar Mangar
 wrote:
> On Sat, Oct 3, 2009 at 1:16 AM, Elaine Li  wrote:
>
>> Hi,
>>
>> My doc has three fields, say field1, field2, field3.
>>
>> My search would be q=field1:string1 && field2:string2. I also need to
>> do some computation and comparison of the string1 and string2 with the
>> contents in field3 and then determine if it is a hit.
>>
>> What can I do to implement this?
>>
>
> What exactly are you trying to achieve? What is the computation/comparison
> that you need to do?
>
> --
> Regards,
> Shalin Shekhar Mangar.
>

Re: Keepwords Schema

2009-10-05 Thread Alexey Serba

Probably you want to use
- multivalued field 'authors'

  login.php
alex
brian
...
  

- return facets for this field
- you can filter unwanted authors whether during indexing process or post
process returned search results

On Fri, Oct 2, 2009 at 4:35 PM, Shalin Shekhar Mangar <
shalinman...@gmail.com> wrote:

> On Thu, Oct 1, 2009 at 7:37 PM, matrix_psj  wrote:
>
> >
> >
> > An example:
> > My schema is about web files. Part of the syntax is a text field of
> authors
> > that have worked on each file, e.g.
> > 
> >login.php
> >   2009-01-01
> >   alex, brian, carl carlington, dave alpha, eddie, dave
> > beta
> > 
> >
> > When I perform a search and get 20 web files back, I would like a facet
> of
> > the individual authors, but only if there name appears in a
> > public_authors.txt file.
> >
> > So if the public_authors.txt file contained:
> > Anna,
> > Bob,
> > Carl Carlington,
> > Dave Alpha,
> > Elvis,
> > Eddie,
> >
> > The facet returned would be:
> > Carl Carlington
> > Dave Alpha
> > Eddie
> >
> >
> >
> > Not sure if that makes sense? If it does, could someone explain to me the
> > schema fieldtype declarations that would bring back this sort of results.
> >
> >
> If I'm understanding you correctly - You want to facet on a field (with
> facet=true&facet.field=authors) but you want to show only certain
> whitelisted facet values in the response.
>
> If that is correct then, you can remove the authors which are not in the
> whitelist during indexing time. You can do this by adding
> KeepWordFilterFactory to your field type:
>
>  ignoreCase="true" />
>
> --
> Regards,
> Shalin Shekhar Mangar.
>

Re: Solr configuration file

2009-10-05 Thread Koji Sekiguchi


Both of those are parameters for dismax query, described bolow:

http://wiki.apache.org/solr/DisMaxRequestHandler

Koji

bhaskar chandrasekar wrote:

Hi,
 
In my Solrconfig file, can any one let me know what the below does 
 
str name="qf" and st name="mm" represents?.

in the below mentioned.
 



 dismax
 explicit
 text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0
 2<-1 5<-2 6<90%
 incubationdate_dt:[* TO NOW/DAY-1MONTH]^2.2

 
ex:text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0.What ^ and numeric part represents?.

Does it have any thing to do with ranking of web pages?.
 
Regards

Bhaskar

Re: yellow pages navigation kind menu. howto take every 100th row from resultset

2009-10-05 Thread Alexey Serba

It seems that you need Faceted
Search

On Fri, Oct 2, 2009 at 3:35 PM, Julian Davchev  wrote:
> Hi,
>
> Long story short:   how can I take every 100th row from solr resultset.
> What would syntax for this be.
>
> Long story:
>
> Currently I have lots of say documents(articles) indexed. They all have
> field title with corresponding value.
>
> atitle
> btitle
> .
> *title
>
> How do I build menu   so I can search of those?
> I cannot just hardcode  ABC  Dmeaning all starting
> with A all starting with B etc...cause there are unicode characters
> and english alphabet will just not cut it...
>
> So my idea is to make ranges like
>
> [atitle - mtitle][mtitle - ltitle] ...etc etc   (based on
> actual title names I got)
>
>
> Questions is how do I figure out what those  atitle-mtitle is (like get
> from solr query every 100th record)
> Two solutions I found:
> 1. get all stuff and do it server side (huge load as it's thousands
> record we talk about)
> 2. use solr sort and &start and make N calls until   resulted rows <
> 100.But this will mean quite a load as well as there lots of records.
>
> Any pointers?
> Thanks
>
>
>

Re: Limit of a one-server-SOLR-installation

2009-10-05 Thread Thomas Koch

Hi Gasol Wu,

thanks for your reply. I tried to make the config and syslog shorter and more 
readable.

solrconfig.xml (shortened):


  
false
15
1500
2147483647
1
1000
1
  

  
false
10
1000
2147483647
1
  

  

  






true
10


false
4
  

  

  

  
 
   explicit
 
  

  

 explicit
 0.01
 
text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4
 
 
text^0.2 features^1.1 name^1.5 manu^1.4 manu_exact^1.9
 
 
ord(poplarity)^0.5 recip(rord(price),1,1000,1000)^0.3
 
 
id,name,price,score
 
 
2<-1 5<-2 6<90%
 
 100
 *:*

  

  

 explicit
 text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0
 2<-1 5<-2 6<90%
 incubationdate_dt:[* TO NOW/DAY-1MONTH]^2.2


  inStock:true


  cat
  manu_exact
  price:[* TO 500]
  price:[500 TO *]

  
  
  
 
inStock:true
 
 
text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4
 
 
2<-1 5<-2 6<90%
 
  

  
5
   



syslog (shortened and formated):

o.a.coyote.http11.Http11Protocol init
INFO: Initializing Coyote HTTP/1.1 on http-8080
o.a.catalina.startup.Catalina load
INFO: Initialization processed in 416 ms
o.a.catalina.core.StandardService start
INFO: Starting service Catalina
o.a.catalina.core.StandardEngine start
INFO: Starting Servlet Engine: Apache Tomcat/6.0.20
o.a.s.servlet.SolrDispatchFilter init
INFO: SolrDispatchFilter.init()
o.a.s.core.SolrResourceLoader locateInstanceDir
INFO: Using JNDI solr.home: /usr/share/solr
o.a.s.core.CoreContainer$Initializer initialize
INFO: looking for solr.xml: /usr/share/solr/solr.xml
o.a.s.core.SolrResourceLoader 
INFO: Solr home set to '/usr/share/solr/'
o.a.s.core.SolrResourceLoader createClassLoader
INFO: Reusing parent classloader
o.a.s.core.SolrResourceLoader locateInstanceDir
INFO: Using JNDI solr.home: /usr/share/solr
o.a.s.core.SolrResourceLoader 
INFO: Solr home set to '/usr/share/solr/'
o.a.s.core.SolrResourceLoader createClassLoader
INFO: Reusing parent classloader
o.a.s.core.SolrConfig 
INFO: Loaded SolrConfig: solrconfig.xml
o.a.s.core.SolrCore 
INFO: Opening new SolrCore at /usr/share/solr/, 
dataDir=/var/lib/solr/data/
o.a.s.schema.IndexSchema readSchema
INFO: Reading Solr Schema
o.a.s.schema.IndexSchema readSchema
INFO: Schema name=memoarticle
o.a.s.schema.IndexSchema readSchema
INFO: default search field is catchalltext
o.a.s.schema.IndexSchema readSchema
INFO: query parser default operator is AND
o.a.s.schema.IndexSchema readSchema
INFO: unique key field: id
o.a.s.core.SolrCore 
INFO: JMX monitoring not detected for core: null
o.a.s.core.SolrCore parseListener
INFO: Searching for listeners: //listen...@event="firstSearcher"]
o.a.s.core.SolrCore parseListener
INFO: Searching for listeners: //listen...@event="newSearcher"]
o.a.s.request.XSLTResponseWriter init
INFO: xsltCacheLifetimeSeconds=5
o.a.s.core.RequestHandlers$1 create
INFO: adding lazy requestHandler: solr.SpellCheckerRequestHandler
o.a.s.core.RequestHandlers$1 create
INFO: adding lazy requestHandler: solr.CSVRequestHandler
o.a.s.core.SolrCore initDeprecatedSupport
WARNUNG: solrconfig.xml uses deprecated , Please 
update your config to use the ShowFileRequestHandler.
o.a.s.core.SolrCore initDeprecatedSupport
WARNUNG: adding ShowFileRequestHandler with hidden files: [SCRIPTS.CONF, 
PROTWORDS.TXT, STOPWORDS.TXT, SPELLINGS.TXT, XSLT, SYNONYMS.TXT, ELEVATE.XML]
o.a.s.search.SolrIndexSearcher 
INFO: Opening searc...@484845aa main
o.a.s.update.DirectUpdateHandler2$CommitTracker 
INFO: AutoCommit: disabled
o.a.s.handler.component.SearchHandler inform
INFO: Adding  
component:org.apache.solr.handler.component.querycompon...@4c650892 , 
component:org.apache.solr.handler.component.facetcompon...@7d15d06c , 
component:org.apache.solr.handler.component.morelikethiscompon...@2326a29c , 
component:org.apache.solr.handler.component.highlightcompon...@3d7dc1cb , 
debug component:org.apache.solr.handler.component.debugcompon...@b3e15f7 , 
component:org.apache.solr.handler.component.querycompon...@4c650892 , 
component:org.apache.solr.handler.component.facetcompon...@7d15d06c , 
component:org.apache.solr.handler.component.morelikethiscompon...@2326a29c , 
component:org.apache.solr.handler.component.highlightcompon...@3d7dc1cb , 
debug component:org.apache.solr.handler.component.debugcompon...@b3e15f7 , 
component:org.apache.solr.handler.component.querycompon...@4c650892 , 
component:org.apache.solr.handler.component.facetcompon...@7d15d06c , 
component:org.apache.solr.handler.component.morelikethiscompon...@2326a29c , 
component:org.apache.solr.handler.component.highlightcompon...@3d7dc1cb , 
debug component:org.apache.solr.h

Solr configuration file

2009-10-05 Thread bhaskar chandrasekar

Hi,
 
In my Solrconfig file, can any one let me know what the below does 
 
str name="qf" and st name="mm" represents?.
in the below mentioned.
 

    
 dismax
 explicit
 text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0
 2<-1 5<-2 6<90%
 incubationdate_dt:[* TO NOW/DAY-1MONTH]^2.2
    
 
ex:text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0.What ^ and numeric part 
represents?.
Does it have any thing to do with ranking of web pages?.
 
Regards
Bhaskar

Re: Always spellcheck (suggest)

On Mon, Oct 5, 2009 at 10:24 AM, Christian Zambrano wrote:

> I am really surprised that a query for "behaviour" returns "behavior" as a
> suggestion only when the parameter "spellcheck.onlyMorePopular=true" is
> present. I re-read the documentation and I see nothing that will imply that
> the parameter onlyMorePopular will do anything else but filter the
> suggestions solr will return.
>
> Maybe somebody else can shed some light on this.
>
>
Yeah, that is true. All this is actually done in the Lucene SpellChecker.
Solr's component is a wrapper over it with some extra features. I've added a
clarification to the wiki page.

-- 
Regards,
Shalin Shekhar Mangar.

Re: Merging multicore indexes

On Sun, Oct 4, 2009 at 8:05 PM, Paul Rosen wrote:

> Hi,
>
> I've been trying to experiment with merging, but have been running into
> some problems.
>
> First, I'm using ruby and the solr-ruby-0.0.7 gem. It looks like there is
> no support in that gem for merging. Have I overlooked something?
>
>
> Second, I was attempting to just follow the instructions in
> http://wiki.apache.org/solr/MergingSolrIndexes so I could see merging
> work. I just tried putting the sample url in the address bar of my browser,
> but it just sent me to the admin page. (It does the same thing as if I had
> left off all the parameters.) Here is the URL I constructed:
>
>
> http://localhost:8983/solr/merged/admin/?action=mergeindexes&core=merged&indexDir=/Users/my/path/solr_1.4/solr/data/reindexed_marc/index&indexDir=/Users/my/path/solr_1.4/solr/data/reindexed_rdf/index
>
> Why didn't that work? Do I have to POST that instead of using GET?
>

The path on the wiki page was wrong. You need to use the adminPath in the
url. Look at the adminPath attribute in solr.xml. It is typically
/admin/cores

So the correct path for you would be:

http://localhost:8983/solr/admin/cores?action=mergeindexes&core=merged&indexDir=/Users/my/path/solr_1.4/solr/data/reindexed_marc/index&indexDir=/Users/my/path/solr_1.4/solr/data/reindexed_rdf/index

I've fixed the wiki too.


> Alternately, is there a way to specify merging from the admin interface?
>
> Third, I've googled for info about merging and not come up with any
> solutions, but I did see a possible concern:
>
> Is it true that after merging, that your index can have duplicate
> documents? If so, then I need to create a step after merging for deleting
> the old copy of everything I merged.
>
>
Yes it can have duplicate documents. Merge is handled by Lucene which does
not have the concept of a uniqueKey. I'm not sure how you can do that in a
separate step.


> Given all the above, I'm wondering if it would make more sense to just
> retrieve each document from the old index and add it to the new index and
> forget about merging. I know that would be a slow process, but I'm not sure
> how much slower that would be than doing the merge (how long does that
> take?), then going through the entire index and eliminating duplicates.
>
>
It could be slow. But if in the end you need to merge, can you skip the
intermediate lucene index completely?

-- 
Regards,
Shalin Shekhar Mangar.

Re: search by some functionality

On Sat, Oct 3, 2009 at 1:16 AM, Elaine Li  wrote:

> Hi,
>
> My doc has three fields, say field1, field2, field3.
>
> My search would be q=field1:string1 && field2:string2. I also need to
> do some computation and comparison of the string1 and string2 with the
> contents in field3 and then determine if it is a hit.
>
> What can I do to implement this?
>

What exactly are you trying to achieve? What is the computation/comparison
that you need to do?

-- 
Regards,
Shalin Shekhar Mangar.

Re: Question about PatternReplace filter and automatic Synonym generation