date:20090803

Re: change sort order for MoreLikeThis

2009-08-03 Thread Avlesh Singh

You can boost the similarity field matches, if you want. Look for mlt.boost
at http://wiki.apache.org/solr/MoreLikeThis

Cheers
Avlesh

On Mon, Aug 3, 2009 at 11:33 AM, Renz Daluz  wrote:

> Hi,
>
> I'm looking at changing the result order when searching by MLT. I tried the
> sort=, but it's not working. I check the wiki and can't find
> anything. Is there a way to do this?
>
> Thanks,
> /Laurence
>

Re: How to get a stack trace

2009-08-03 Thread Nicolae Mihalache

Hello,

Thanks for your answers and suggestions. I can get a heap dump also with the
jmap command. The resulting file is so big that jhat gets out of memory
errors itself when reading the dump.

I traced back my problem (using the heap dump analyser from yourkit.com) to
the FieldCache. In fact it had nothing to do with the index optimization but
with some queries running in background from cron doing faceted search. I
will come back with another email about that.

But nevertheless, wouldn't be a good idea to print the stack trace when
SEVERE errors are encountered?

nicolae

On Mon, Aug 3, 2009 at 3:20 AM, Bill Au  wrote:

> Your heap may be just too small or you may have a memory leak.  A stack
> trace may not help you since the thread encountered the OutOfMemoryError
> may
> not be where the memory leak is.  A heap dump will tell you what's using up
> all the memory in your heap.
> Bill
>
> On Thu, Jul 30, 2009 at 3:54 PM, Nicolae Mihalache  >wrote:
>
> > Hello,
> >
> > I'm a new user of solr but I have worked a bit with Lucene before. I get
> > some out of memory exception when optimizing the index through Solr and I
> > would like to find out why.
> > However, the only message I get on standard output is:
> > Jul 30, 2009 9:20:22 PM org.apache.solr.common.SolrException log
> > SEVERE: java.lang.OutOfMemoryError: Java heap space
> >
> > Is there a way to get a stack trace for this exception? I had a look into
> > the java.util.logging options and didn't find anything.
> >
> > My solr runs in some standard configuration inside jetty.
> > Any suggestion would be appreciated.
> >
> > Thanks,
> > nicolae
> >
> >
> >
> >
>

faceted search cache and optimisations

2009-08-03 Thread Nicolae Mihalache

Hello,

I'm using faceted search (perhaps in a dumb way) to collect some statistics
for my index. I have documents in various languages, one of the field is
"language" and I simply want to see how many documents I have for each
language. I have noticed that the search builds a int[maxDoc] array and then
traverses the array to count. If facet.method=enum (discovered later) is
used, the things are still counted in a different way. But for this case
where all the documents are retrieved, the information is already available
in the lucene index.
So, I think it would be a good optimization to detect these cases (i.e. no
filtering) and just return the number from the index instead of counting the
docs again.

Another issue: there is no way currently to disable the caching of the
int[maxDoc], is there? If there are many fields to be faceted, this can
quikly lead to out of memory situations. I think it would be good to give
the option (as part of the query) to disable the caching, even if it is
slow, at least it works and is useful for non-interactive processing.

And another possibe optimization for the int[maxDoc] inspired from the
column stored databases: the way they do it is to find the minimum number of
bits to represent a value. If for example my language field has 30 possible
values (i.e. I have docs in 30 languages), I only need 5 bits for each doc
(instead of int=32 bits). Then I can represent the whole int[maxDoc] in less
than 1/6 of the space required now.
What's even better, sometimes the documents can be partitioned such that not
all the values of a field are represented in the same partition.
For example let's assume that I have a field called doc_generation_date. If
I harverst the documents each three days, and I consider a partition as
having the same three days of data, for each partition I will basically have
only three possible values for the doc_generation_date. That means that I
only need to have 2 bits for each document plus a table for each partition
that maps from the partition value id (one of the three values represented
on two bits) to the index value id (that is the id stored in the lucene
index).
Of course, for the language field above, the partitioning would not help
unless I index successively only english docs, then only french, etc.
And also it wouldn't work just like that for multi-value fields.

nicolae

Re: Create new (sub)index from existing one

2009-08-03 Thread Chantal Ackermann


Hi Otis,

thank you for your answer. I will have a look at the link.
It can happen that search queries include several languages, and I 
thought at first, that creating a multilingual (sub-)index from a mix of 
the existing core indexes could be a way. But that is not a good 
solution, isn't it? I do have to decide on a language when posting the 
query to SOLR in order to apply the correct analyzers to the query 
input. So, it would be better to keep the languages separate and rather 
post several queries for each language, should that need arise. That is 
still fast enough, I think.


Cheers,
Chantal

Otis Gospodnetic schrieb:

Hi Chantal,

If you are really dealing with languages (i.e. not very dynamic or large list 
of cores), could you simply copy a core and then delete all documents that 
don't satisfy the language criteria? (e.g. copy the main core with docs in all 
languages, and then delete all docs that are not in, say, Japanese).

Otis
P.S.
May want to consider 
http://www.sematext.com/products/multilingual-indexer/index.html




- Original Message 

From: Chantal Ackermann 
To: "solr-user@lucene.apache.org" 
Sent: Tuesday, July 28, 2009 4:54:27 AM
Subject: Create new (sub)index from existing one

Hi all,

I was wondering whether it is possible to create a new index out of existing
ones.

Say, you have multiple indexes (multicore) that are based on language. There are
client applications that should see only a subset of the data contained in the
indexes. Instead of constantly having to restrict the queries coming from that
client - would it be possible and would it make sense to create a subset of the
indexes that contains only the data that that client should see? A bit like a
cache.

Sorry, if that is answered already somewhere - I couldn't find anything
searching Google. Maybe I was looking for the wrong keywords.

Cheers,
Chantal


-- Chantal Ackermann




--
Chantal Ackermann
Consultant

mobil+49 (176) 10 00 09 45
emailchantal.ackerm...@btelligent.de



b.telligent GmbH & Co. KG
Lichtenbergstraße 8
D-85748 Garching / München

fon   +49 (89) 54 84 25 60
fax+49 (89) 54 84 25 69
web  www.btelligent.de

Registered in Munich: HRA 84393
Managing Director: b.telligent Verwaltungs GmbH, HRB 153164 represented 
by Sebastian Amtage and Klaus Blaschek

USt.Id.-Nr. DE814054803



Confidentiality Note
This email is intended only for the use of the individual or entity to 
which it is addressed, and may contain information that is privileged, 
confidential and exempt from disclosure under applicable law. If the 
reader of this email message is not the intended recipient, or the 
employee or agent responsible for delivery of the message to the 
intended recipient, you are hereby notified that any dissemination, 
distribution or copying of this communication is prohibited. If you have 
received this email in error, please notify us immediately by telephone 
at +49 (0) 89 54 84 25 60. Thank you.

Re: How to get a stack trace

2009-08-03 Thread Chantal Ackermann


kill -3 [JVM-process-id] (SIGQUIT)
sends a signal to the JVM to dump all thread stacks. It does NOT kill 
it. It's only for outputting the stacks. You can see whether there are 
any threads that lock up.
This produces a LOT of output in the main logfile (e.g. catalina.out for 
Tomcat).


(see for example
http://www.unixville.com/~moazam/stories/2004/05/18/debuggingHangsInTheJvm.html
and others)


Otis Gospodnetic schrieb:

Nicolae,

You may be able to figure things out from the heap dump.  You'll need to start 
the JVM like this, for example:

java -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/heap ...

 Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



- Original Message 

From: Nicolae Mihalache 
To: solr-user@lucene.apache.org
Sent: Thursday, July 30, 2009 3:54:41 PM
Subject: How to get a stack trace

Hello,

I'm a new user of solr but I have worked a bit with Lucene before. I get some
out of memory exception when optimizing the index through Solr and I would like
to find out why.
However, the only message I get on standard output is:
Jul 30, 2009 9:20:22 PM org.apache.solr.common.SolrException log
SEVERE: java.lang.OutOfMemoryError: Java heap space

Is there a way to get a stack trace for this exception? I had a look into the
java.util.logging options and didn't find anything.

My solr runs in some standard configuration inside jetty.
Any suggestion would be appreciated.

Thanks,
nicolae




--
Chantal Ackermann
Consultant

mobil+49 (176) 10 00 09 45
emailchantal.ackerm...@btelligent.de



b.telligent GmbH & Co. KG
Lichtenbergstraße 8
D-85748 Garching / München

fon   +49 (89) 54 84 25 60
fax+49 (89) 54 84 25 69
web  www.btelligent.de

Registered in Munich: HRA 84393
Managing Director: b.telligent Verwaltungs GmbH, HRB 153164 represented 
by Sebastian Amtage and Klaus Blaschek

USt.Id.-Nr. DE814054803



Confidentiality Note
This email is intended only for the use of the individual or entity to 
which it is addressed, and may contain information that is privileged, 
confidential and exempt from disclosure under applicable law. If the 
reader of this email message is not the intended recipient, or the 
employee or agent responsible for delivery of the message to the 
intended recipient, you are hereby notified that any dissemination, 
distribution or copying of this communication is prohibited. If you have 
received this email in error, please notify us immediately by telephone 
at +49 (0) 89 54 84 25 60. Thank you.

Re: How to get a stack trace

2009-08-03 Thread Nicolae Mihalache

Thanks, I will try maybe together with the  -XX:OnOutOfMemoryError="; ":
http://java.sun.com/javase/technologies/hotspot/vmoptions.jsp
But now I know where the problem came from, so maybe next time.


On Mon, Aug 3, 2009 at 11:08 AM, Chantal Ackermann <
chantal.ackerm...@btelligent.de> wrote:

> kill -3 [JVM-process-id] (SIGQUIT)
> sends a signal to the JVM to dump all thread stacks. It does NOT kill it.
> It's only for outputting the stacks. You can see whether there are any
> threads that lock up.
> This produces a LOT of output in the main logfile (e.g. catalina.out for
> Tomcat).
>
> (see for example
>
> http://www.unixville.com/~moazam/stories/2004/05/18/debuggingHangsInTheJvm.html
> and others)
>
>
> Otis Gospodnetic schrieb:
>
>> Nicolae,
>>
>> You may be able to figure things out from the heap dump.  You'll need to
>> start the JVM like this, for example:
>>
>> java -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/heap ...
>>
>>  Otis
>> --
>> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
>> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
>>
>>
>>
>> - Original Message 
>>
>>> From: Nicolae Mihalache 
>>> To: solr-user@lucene.apache.org
>>> Sent: Thursday, July 30, 2009 3:54:41 PM
>>> Subject: How to get a stack trace
>>>
>>> Hello,
>>>
>>> I'm a new user of solr but I have worked a bit with Lucene before. I get
>>> some
>>> out of memory exception when optimizing the index through Solr and I
>>> would like
>>> to find out why.
>>> However, the only message I get on standard output is:
>>> Jul 30, 2009 9:20:22 PM org.apache.solr.common.SolrException log
>>> SEVERE: java.lang.OutOfMemoryError: Java heap space
>>>
>>> Is there a way to get a stack trace for this exception? I had a look into
>>> the
>>> java.util.logging options and didn't find anything.
>>>
>>> My solr runs in some standard configuration inside jetty.
>>> Any suggestion would be appreciated.
>>>
>>> Thanks,
>>> nicolae
>>>
>>
>>
> --
> Chantal Ackermann
> Consultant
>
> mobil+49 (176) 10 00 09 45
> emailchantal.ackerm...@btelligent.de
>
>
> 
>
> b.telligent GmbH & Co. KG
> Lichtenbergstraße 8
> D-85748 Garching / München
>
> fon   +49 (89) 54 84 25 60
> fax+49 (89) 54 84 25 69
> web  www.btelligent.de
>
> Registered in Munich: HRA 84393
> Managing Director: b.telligent Verwaltungs GmbH, HRB 153164 represented by
> Sebastian Amtage and Klaus Blaschek
> USt.Id.-Nr. DE814054803
>
>
>
> Confidentiality Note
> This email is intended only for the use of the individual or entity to
> which it is addressed, and may contain information that is privileged,
> confidential and exempt from disclosure under applicable law. If the reader
> of this email message is not the intended recipient, or the employee or
> agent responsible for delivery of the message to the intended recipient, you
> are hereby notified that any dissemination, distribution or copying of this
> communication is prohibited. If you have received this email in error,
> please notify us immediately by telephone at +49 (0) 89 54 84 25 60. Thank
> you.
>

Re: No wildcards with solr.ASCIIFoldingFilterFactory?

2009-08-03 Thread vladimirneu


Dear Mark,

Thank you very much for your suggestions. It works really great. I construct
a new plugin based on AnalyzingQueryParser with getPrefixQuery,
getRangeQuery, getWildcardQuery methods from Lucene. With this plugin we can
now search with leading wildcards, case sensitive and by analyzed wildcard
search. I apply the patch for everybody who want it. Run ant and register
the new parser in solrconfig.xml



Do searching with the new parser:
http://localhost:8080/solr_be/select/?q={!aqparser}Wożni*

Sincerely,

Vladimir

http://www.nabble.com/file/p24787962/aqparser.patch aqparser.patch 
-- 
View this message in context: 
http://www.nabble.com/No-wildcards-with-solr.ASCIIFoldingFilterFactory--tp24162104p24787962.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr failing on "y" charakter in string?

2009-08-03 Thread gateway0


Ok thanks you´re right.

But the thing is my users will often search for expressions like: "Harr" or
"har" etc.. 

So I thought I automatically add the wildcard "*" to every request.

If that too gets me into trouble "Harr*"=>no result "harry*"=>no result

What should I do?



Otis Gospodnetic wrote:
> 
> I believe it's because wildcard queries are not stemmed.  During indexing
> "harry" probably got stemmed to "harr", so now "harry*" doesn't match,
> because there is no "harry" token in that string, only "harr".  Why
> wildcard queries are not analyzed is described in the Lucene FAQ on the
> Lucene Wiki.
> 
> You could also try searching for kunde:Harr* for example (not the
> upper-case Harr).  I bet it won't result in a hit for the same reason - at
> index time you probably lower-case tokens with LowerCaseFilter(Factory),
> and if you search for Harr*, the lower-casing won't happen because the
> query string with the wildcard character isn't analyzed.
> 
> Otis
> --
> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
> 
> 
> 
> - Original Message 
>> From: gateway0 
>> To: solr-user@lucene.apache.org
>> Sent: Sunday, August 2, 2009 7:30:19 PM
>> Subject: Solr failing on "y" charakter in string?
>> 
>> 
>> Hi,
>> 
>> I have the following setting:
>> schema.xml:
>> ""
>> the "text" field-type was updated with the "preserveOriginal=1" option in
>> the schema
>> 
>> I have the following string indexd in the field "kunde"
>> "Harry Heim KG"
>> 
>> Now when I search for "kunde:harry*" it gives me an empty result.
>> 
>> When I search for "kunde:harry" I get the right result. Also
>> "kunde:harr*"
>> works just fine.
>> 
>> The strange thing is that with every other string (for example
>> "kunde:heim*") I will get the right result. 
>> 
>> So why not on "harry*" with an "y*" at the end?
>> 
>> kind regards, S.
>> -- 
>> View this message in context: 
>> http://www.nabble.com/Solr-failing-on-%22y%22-charakter-in-string--tp24783211p24783211.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Solr-failing-on-%22y%22-charakter-in-string--tp24783211p24789070.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr failing on "y" charakter in string?

2009-08-03 Thread Avlesh Singh

The easiest thing to do would be to create a new field in your schema which
only has a lowercasefilter applied to it. While searching perform searches
across the two fields. You'll get desired results.

You can use the  directive in your schema.xml for copying data
from your original field into the new field.

Cheers
Avlesh

On Mon, Aug 3, 2009 at 4:51 PM, gateway0  wrote:

>
> Ok thanks you´re right.
>
> But the thing is my users will often search for expressions like: "Harr" or
> "har" etc..
>
> So I thought I automatically add the wildcard "*" to every request.
>
> If that too gets me into trouble "Harr*"=>no result "harry*"=>no result
>
> What should I do?
>
>
>
> Otis Gospodnetic wrote:
> >
> > I believe it's because wildcard queries are not stemmed.  During indexing
> > "harry" probably got stemmed to "harr", so now "harry*" doesn't match,
> > because there is no "harry" token in that string, only "harr".  Why
> > wildcard queries are not analyzed is described in the Lucene FAQ on the
> > Lucene Wiki.
> >
> > You could also try searching for kunde:Harr* for example (not the
> > upper-case Harr).  I bet it won't result in a hit for the same reason -
> at
> > index time you probably lower-case tokens with LowerCaseFilter(Factory),
> > and if you search for Harr*, the lower-casing won't happen because the
> > query string with the wildcard character isn't analyzed.
> >
> > Otis
> > --
> > Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> > Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
> >
> >
> >
> > - Original Message 
> >> From: gateway0 
> >> To: solr-user@lucene.apache.org
> >> Sent: Sunday, August 2, 2009 7:30:19 PM
> >> Subject: Solr failing on "y" charakter in string?
> >>
> >>
> >> Hi,
> >>
> >> I have the following setting:
> >> schema.xml:
> >> ""
> >> the "text" field-type was updated with the "preserveOriginal=1" option
> in
> >> the schema
> >>
> >> I have the following string indexd in the field "kunde"
> >> "Harry Heim KG"
> >>
> >> Now when I search for "kunde:harry*" it gives me an empty result.
> >>
> >> When I search for "kunde:harry" I get the right result. Also
> >> "kunde:harr*"
> >> works just fine.
> >>
> >> The strange thing is that with every other string (for example
> >> "kunde:heim*") I will get the right result.
> >>
> >> So why not on "harry*" with an "y*" at the end?
> >>
> >> kind regards, S.
> >> --
> >> View this message in context:
> >>
> http://www.nabble.com/Solr-failing-on-%22y%22-charakter-in-string--tp24783211p24783211.html
> >> Sent from the Solr - User mailing list archive at Nabble.com.
> >
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Solr-failing-on-%22y%22-charakter-in-string--tp24783211p24789070.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>

Copy Field Question

2009-08-03 Thread Chantal Ackermann


Dear all,

before searching through the source code - maybe one of you can answer 
this easily:


When and based on what are the tokenizer and filters applied when 
copying fields? Can it happen that fields are analyzed twice (once when 
creating the first field, and a second time when they are copied to the 
another field)?



Here an example from my current setup:
I have the following types defined, in schema.xml:





	words="stopwords_de.txt" />
	generateNumberParts="1" catenateWords="1" catenateNumbers="1" 
catenateAll="0" splitOnCaseChange="1" />







	words="stopwords_de.txt" />
	generateNumberParts="1" catenateWords="0" catenateNumbers="0" 
catenateAll="0" splitOnCaseChange="1" />







Used for those fields:

required="true" />
required="false" />
required="false" />
required="false" />


Which are used to populate this field using the copy field directive:



like that (that is what I do, now, at least):







I am copying fields with different types to all_text_de, e.g. title is 
different from subtitle_text_de. Is the valued copied to the destination 
field the raw (input) value or the already analyzed one?



Thanks!
Chantal


--
Chantal Ackermann

Re: Copy Field Question

2009-08-03 Thread Mark Miller

Its the pre-analyzed form thats copied. The field that its copied to will
determine the analyzer/filters for that field.
If you want to check out the code doing it, its
in org.apache.solr.update.DocumentBuilder

-- 
- Mark

http://www.lucidimagination.com

On Mon, Aug 3, 2009 at 8:12 AM, Chantal Ackermann <
chantal.ackerm...@btelligent.de> wrote:

> Dear all,
>
> before searching through the source code - maybe one of you can answer this
> easily:
>
> When and based on what are the tokenizer and filters applied when copying
> fields? Can it happen that fields are analyzed twice (once when creating the
> first field, and a second time when they are copied to the another field)?
>
>
> Here an example from my current setup:
> I have the following types defined, in schema.xml:
>
>  positionIncrementGap="100">
>
>
>
> words="stopwords_de.txt" />
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" />
>
> />
>
>
>
>
> words="stopwords_de.txt" />
> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" />
>
> />
>
>
> 
>
> Used for those fields:
>
>  required="true" />
>  required="false" />
>  required="false" />
>  required="false" />
>
> Which are used to populate this field using the copy field directive:
>
> multiValued="true" />
>
> like that (that is what I do, now, at least):
>
> 
> 
> 
> 
>
>
> I am copying fields with different types to all_text_de, e.g. title is
> different from subtitle_text_de. Is the valued copied to the destination
> field the raw (input) value or the already analyzed one?
>
>
> Thanks!
> Chantal
>
>
> --
> Chantal Ackermann
>

Re: Copy Field Question

2009-08-03 Thread Chantal Ackermann


Thanks, Mark!


Mark Miller schrieb:

Its the pre-analyzed form thats copied. The field that its copied to will
determine the analyzer/filters for that field.
If you want to check out the code doing it, its
in org.apache.solr.update.DocumentBuilder

--
- Mark

http://www.lucidimagination.com

On Mon, Aug 3, 2009 at 8:12 AM, Chantal Ackermann <
chantal.ackerm...@btelligent.de> wrote:


Dear all,

before searching through the source code - maybe one of you can answer this
easily:

When and based on what are the tokenizer and filters applied when copying
fields? Can it happen that fields are analyzed twice (once when creating the
first field, and a second time when they are copied to the another field)?


Here an example from my current setup:
I have the following types defined, in schema.xml:


   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   


Used for those fields:






Which are used to populate this field using the copy field directive:



like that (that is what I do, now, at least):







I am copying fields with different types to all_text_de, e.g. title is
different from subtitle_text_de. Is the valued copied to the destination
field the raw (input) value or the already analyzed one?


Thanks!
Chantal


--
Chantal Ackermann



--
Chantal Ackermann
Consultant

mobil+49 (176) 10 00 09 45
emailchantal.ackerm...@btelligent.de



b.telligent GmbH & Co. KG
Lichtenbergstraße 8
D-85748 Garching / München

fon   +49 (89) 54 84 25 60
fax+49 (89) 54 84 25 69
web  www.btelligent.de

Registered in Munich: HRA 84393
Managing Director: b.telligent Verwaltungs GmbH, HRB 153164 represented 
by Sebastian Amtage and Klaus Blaschek

USt.Id.-Nr. DE814054803



Confidentiality Note
This email is intended only for the use of the individual or entity to 
which it is addressed, and may contain information that is privileged, 
confidential and exempt from disclosure under applicable law. If the 
reader of this email message is not the intended recipient, or the 
employee or agent responsible for delivery of the message to the 
intended recipient, you are hereby notified that any dissemination, 
distribution or copying of this communication is prohibited. If you have 
received this email in error, please notify us immediately by telephone 
at +49 (0) 89 54 84 25 60. Thank you.

Re: Solr failing on "y" charakter in string?

2009-08-03 Thread gateway0


Ok still not working with new field "text_two":
"text:"Har*" text_two:"Har*""
==> result 0

Schema Updates:
"

  


  
  


 






"

This is what you suggested, right?

kind regards, S.



gateway0 wrote:
> 
> Hi,
> 
> I have the following setting:
> schema.xml:
> ""
> the "text" field-type was updated with the "preserveOriginal=1" option in
> the schema
> 
> I have the following string indexd in the field "kunde"
> "Harry Heim KG"
> 
> Now when I search for "kunde:harry*" it gives me an empty result.
> 
> When I search for "kunde:harry" I get the right result. Also "kunde:harr*"
> works just fine.
> 
> The strange thing is that with every other string (for example
> "kunde:heim*") I will get the right result. 
> 
> So why not on "harry*" with an "y*" at the end?
> 
> kind regards, S.
> 

-- 
View this message in context: 
http://www.nabble.com/Solr-failing-on-%22y%22-charakter-in-string--tp24783211p24790774.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr failing on "y" charakter in string?

2009-08-03 Thread gateway0


Ok still not working with new field "text_two":
"text:"Har*" text_two:"Har*""
==> result 0

Schema Updates:
"

  
  

  
  
  
  
  






"

This is what you suggested, right?

kind regards, S. 



Avlesh Singh wrote:
> 
> The easiest thing to do would be to create a new field in your schema
> which
> only has a lowercasefilter applied to it. While searching perform searches
> across the two fields. You'll get desired results.
> 
> You can use the  directive in your schema.xml for copying data
> from your original field into the new field.
> 
> Cheers
> Avlesh
> 
> On Mon, Aug 3, 2009 at 4:51 PM, gateway0  wrote:
> 
>>
>> Ok thanks you´re right.
>>
>> But the thing is my users will often search for expressions like: "Harr"
>> or
>> "har" etc..
>>
>> So I thought I automatically add the wildcard "*" to every request.
>>
>> If that too gets me into trouble "Harr*"=>no result "harry*"=>no result
>>
>> What should I do?
>>
>>
>>
>> Otis Gospodnetic wrote:
>> >
>> > I believe it's because wildcard queries are not stemmed.  During
>> indexing
>> > "harry" probably got stemmed to "harr", so now "harry*" doesn't match,
>> > because there is no "harry" token in that string, only "harr".  Why
>> > wildcard queries are not analyzed is described in the Lucene FAQ on the
>> > Lucene Wiki.
>> >
>> > You could also try searching for kunde:Harr* for example (not the
>> > upper-case Harr).  I bet it won't result in a hit for the same reason -
>> at
>> > index time you probably lower-case tokens with
>> LowerCaseFilter(Factory),
>> > and if you search for Harr*, the lower-casing won't happen because the
>> > query string with the wildcard character isn't analyzed.
>> >
>> > Otis
>> > --
>> > Sematext is hiring -- http://sematext.com/about/jobs.html?mls
>> > Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
>> >
>> >
>> >
>> > - Original Message 
>> >> From: gateway0 
>> >> To: solr-user@lucene.apache.org
>> >> Sent: Sunday, August 2, 2009 7:30:19 PM
>> >> Subject: Solr failing on "y" charakter in string?
>> >>
>> >>
>> >> Hi,
>> >>
>> >> I have the following setting:
>> >> schema.xml:
>> >> ""
>> >> the "text" field-type was updated with the "preserveOriginal=1" option
>> in
>> >> the schema
>> >>
>> >> I have the following string indexd in the field "kunde"
>> >> "Harry Heim KG"
>> >>
>> >> Now when I search for "kunde:harry*" it gives me an empty result.
>> >>
>> >> When I search for "kunde:harry" I get the right result. Also
>> >> "kunde:harr*"
>> >> works just fine.
>> >>
>> >> The strange thing is that with every other string (for example
>> >> "kunde:heim*") I will get the right result.
>> >>
>> >> So why not on "harry*" with an "y*" at the end?
>> >>
>> >> kind regards, S.
>> >> --
>> >> View this message in context:
>> >>
>> http://www.nabble.com/Solr-failing-on-%22y%22-charakter-in-string--tp24783211p24783211.html
>> >> Sent from the Solr - User mailing list archive at Nabble.com.
>> >
>> >
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Solr-failing-on-%22y%22-charakter-in-string--tp24783211p24789070.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Solr-failing-on-%22y%22-charakter-in-string--tp24783211p24790836.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: How to get a stack trace

2009-08-03 Thread Smiley, David W.

I much prefer "jstack" (which comes with the JDK) which is for exactly this 
purpose.  I always forget the right number for using kill.  When using jstack, 
the stack goes to the current terminal session, not Solr's output -- very 
convenient.

~ David Smiley


From: Nicolae Mihalache [xproma...@gmail.com]
Sent: Monday, August 03, 2009 5:21 AM
To: solr-user@lucene.apache.org
Subject: Re: How to get a stack trace

Thanks, I will try maybe together with the  -XX:OnOutOfMemoryError="; ":
http://java.sun.com/javase/technologies/hotspot/vmoptions.jsp
But now I know where the problem came from, so maybe next time.


On Mon, Aug 3, 2009 at 11:08 AM, Chantal Ackermann <
chantal.ackerm...@btelligent.de> wrote:

> kill -3 [JVM-process-id] (SIGQUIT)
> sends a signal to the JVM to dump all thread stacks. It does NOT kill it.
> It's only for outputting the stacks. You can see whether there are any
> threads that lock up.
> This produces a LOT of output in the main logfile (e.g. catalina.out for
> Tomcat).
>
> (see for example
>
> http://www.unixville.com/~moazam/stories/2004/05/18/debuggingHangsInTheJvm.html
> and others)
>
>
> Otis Gospodnetic schrieb:
>
>> Nicolae,
>>
>> You may be able to figure things out from the heap dump.  You'll need to
>> start the JVM like this, for example:
>>
>> java -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/heap ...
>>
>>  Otis
>> --
>> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
>> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
>>
>>
>>
>> - Original Message 
>>
>>> From: Nicolae Mihalache 
>>> To: solr-user@lucene.apache.org
>>> Sent: Thursday, July 30, 2009 3:54:41 PM
>>> Subject: How to get a stack trace
>>>
>>> Hello,
>>>
>>> I'm a new user of solr but I have worked a bit with Lucene before. I get
>>> some
>>> out of memory exception when optimizing the index through Solr and I
>>> would like
>>> to find out why.
>>> However, the only message I get on standard output is:
>>> Jul 30, 2009 9:20:22 PM org.apache.solr.common.SolrException log
>>> SEVERE: java.lang.OutOfMemoryError: Java heap space
>>>
>>> Is there a way to get a stack trace for this exception? I had a look into
>>> the
>>> java.util.logging options and didn't find anything.
>>>
>>> My solr runs in some standard configuration inside jetty.
>>> Any suggestion would be appreciated.
>>>
>>> Thanks,
>>> nicolae
>>>
>>
>>
> --
> Chantal Ackermann
> Consultant
>
> mobil+49 (176) 10 00 09 45
> emailchantal.ackerm...@btelligent.de
>
>
> 
>
> b.telligent GmbH & Co. KG
> Lichtenbergstraße 8
> D-85748 Garching / München
>
> fon   +49 (89) 54 84 25 60
> fax+49 (89) 54 84 25 69
> web  www.btelligent.de
>
> Registered in Munich: HRA 84393
> Managing Director: b.telligent Verwaltungs GmbH, HRB 153164 represented by
> Sebastian Amtage and Klaus Blaschek
> USt.Id.-Nr. DE814054803
>
>
>
> Confidentiality Note
> This email is intended only for the use of the individual or entity to
> which it is addressed, and may contain information that is privileged,
> confidential and exempt from disclosure under applicable law. If the reader
> of this email message is not the intended recipient, or the employee or
> agent responsible for delivery of the message to the intended recipient, you
> are hereby notified that any dissemination, distribution or copying of this
> communication is prohibited. If you have received this email in error,
> please notify us immediately by telephone at +49 (0) 89 54 84 25 60. Thank
> you.
>

RE: Solr failing on "y" charakter in string?

2009-08-03 Thread Ensdorf Ken

> Ok still not working with new field "text_two":
> "text:"Har*" text_two:"Har*""
> ==> result 0
>
> Schema Updates:
> "
>  positionIncrementGap="100">
>   
>   
> 
>   
>   
>   
>   
>   
> 
>
>
>  multiValued="true"/>
>
> 
> "

I'm pretty sure the query string needs to be lower-case, since a wildcard query 
is not analyzed.

I think what Avlesh was suggesting was more like this:

"text:"Har" text_two:"har*""

So the original field would be for a regular query containing whatever the user 
entered and would undergo the usual analysis for searching, and the secondary 
field would be used to construct a wildcard query which would strictly serve 
the "begins-with" case.

-Ken

Re: faceted search cache and optimisations

2009-08-03 Thread Yonik Seeley

On Mon, Aug 3, 2009 at 4:45 AM, Nicolae Mihalache wrote:
> Hello,
>
> I'm using faceted search (perhaps in a dumb way) to collect some statistics
> for my index. I have documents in various languages, one of the field is
> "language" and I simply want to see how many documents I have for each
> language. I have noticed that the search builds a int[maxDoc] array and then
> traverses the array to count. If facet.method=enum (discovered later) is
> used, the things are still counted in a different way. But for this case
> where all the documents are retrieved, the information is already available
> in the lucene index.

> So, I think it would be a good optimization to detect these cases (i.e. no
> filtering) and just return the number from the index instead of counting the
> docs again.

That would require
 - a base query that matched the entire index
 - no filters
 - no deletions in the index

If you want those numbers, see the terms component.

> Another issue: there is no way currently to disable the caching of the
> int[maxDoc], is there?

use facet.method=enum... the number of filters cached can be
controlled by the filterCache.
You can also prevent the filterCache from being used via the
facet.enum.cache.minDf param.

-Yonik

ClassCastException from custom request handler

2009-08-03 Thread James Brady

Hi,
I'm creating a custom request handler to return a list of live cores in
Solr.

On startup, I get this exception for each core:

Jul 31, 2009 5:20:39 PM org.apache.solr.common. SolrException log
SEVERE: java.lang.ClassCastException: LiveCoresHandler
at
org.apache.solr.core.RequestHandlers$1.create(RequestHandlers.java:152)
at
org.apache.solr.core.RequestHandlers$1.create(RequestHandlers.java:161)
at
org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:140)
at
org.apache.solr.core.RequestHandlers.initHandlersFromConfig(RequestHandlers.java:169)
at org.apache.solr.core.SolrCore.(SolrCore.java:444)

I've tried a few variations on the class definition, including extending
RequestHandlerBase (as suggested here:
http://wiki.apache.org/solr/SolrRequestHandler#head-1de7365d7ecf2eac079c5f8b92ee9af712ed75c2)
and implementing SolrRequestHandler directly.

I'm sure that the Solr libraries I built against and those I'm running on
are the same version too, as I unzipped the Solr war file and copies the
relevant jars out of there to build against.

Any ideas on what could be causing the ClassCastException? I've attached a
debugger to the running Solr process but it didn't shed any light on the
issue...

Thanks!
James

Re: ClassCastException from custom request handler

2009-08-03 Thread Avlesh Singh

Can you cross check the class attribute for your handler in solrconfig.xml?
My guess is that it is specified as "solr.LiveCoresHandler". It should be
fully qualified class name - com.foo.path.to.LiveCoresHandler instead.

Moreover, I am damn sure that you did not forget to drop your jar into
solr.home/lib. Checking once again might not be a bad idea :)

Cheers
Avlesh

On Mon, Aug 3, 2009 at 9:11 PM, James Brady wrote:

> Hi,
> I'm creating a custom request handler to return a list of live cores in
> Solr.
>
> On startup, I get this exception for each core:
>
> Jul 31, 2009 5:20:39 PM org.apache.solr.common. SolrException log
> SEVERE: java.lang.ClassCastException: LiveCoresHandler
>at
> org.apache.solr.core.RequestHandlers$1.create(RequestHandlers.java:152)
>at
> org.apache.solr.core.RequestHandlers$1.create(RequestHandlers.java:161)
>at
>
> org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:140)
>at
>
> org.apache.solr.core.RequestHandlers.initHandlersFromConfig(RequestHandlers.java:169)
>at org.apache.solr.core.SolrCore.(SolrCore.java:444)
>
> I've tried a few variations on the class definition, including extending
> RequestHandlerBase (as suggested here:
>
> http://wiki.apache.org/solr/SolrRequestHandler#head-1de7365d7ecf2eac079c5f8b92ee9af712ed75c2
> )
> and implementing SolrRequestHandler directly.
>
> I'm sure that the Solr libraries I built against and those I'm running on
> are the same version too, as I unzipped the Solr war file and copies the
> relevant jars out of there to build against.
>
> Any ideas on what could be causing the ClassCastException? I've attached a
> debugger to the running Solr process but it didn't shed any light on the
> issue...
>
> Thanks!
> James
>

Re: mergeFactor / indexing speed

2009-08-03 Thread Chantal Ackermann

Hi all,

I'm still struggling with the index performance. I've moved the indexer
to a different machine, now, which is faster and less occupied.

The new machine is a 64bit 8Gig-RAM RedHat. JDK1.6, Tomcat 6.0.18,
running with those settings (and others):
-server -Xms1G -Xmx7G

Currently, I've set mergeFactor to 1000 and ramBufferSize to 256MB.
It has been processing roughly 70k documents in half an hour, so far. 
Which means 1,5 hours at least for 200k - which is as fast/slow as 
before (on the less performant machine).

The machine is not swapping. It is only using 13% of the memory.
iostat gives me:
 iostat
Linux 2.6.9-67.ELsmp  08/03/2009

avg-cpu:  %user   %nice%sys %iowait   %idle
   1.230.000.030.03   98.71

Basically, it is doing very little? *scratch*

The sourcing database is responding as fast as ever. (I checked that 
from my own machine, and did only a ping from the linux box to the db 
server.)

Any help, any hint on where to look would be greatly appreciated.

Thanks!
Chantal

Chantal Ackermann schrieb:

Hi again!

Thanks for the answer, Grant.

 > It could very well be the case that you aren't seeing any merges with
 > only 20K docs.  Ultimately, if you really want to, you can look in
 > your data.dir and count the files.  If you have indexed a lot and have
 > an MF of 100 and haven't done an optimize, you will see a lot more
 > index files.

Do you mean that 20k is not representative enough to test those settings?
I've chosen the smaller data set so that the index can run completely
but doesn't take too long at the same time.
If it would be faster to begin with, I could use a larger data set, of
course. I still can't believe that 11 minutes is normal (I haven't
managed to make it run faster or slower than that, that duration is very
stable).

It "feels kinda" slow to me...
Out of your experience - what would you expect as duration for an index
with:
- 21 fields, some using a text type with 6 filters
- database access using DataImportHandler with a query of (far) less
than 20ms
- 2 transformers

If I knew that indexing time should be shorter than that, at least, I
would know that something is definitely wrong with what I am doing or
with the environment I am using.

 > Likely, but not guaranteed.  Typically, larger merge factors are good
 > for batch indexing, but a lot of that has changed with Lucene's new
 > background merger, such that I don't know if it matters as much anymore.

Ok. I also read some posting where it basically said that the default
parameters are ok. And one shouldn't mess around with them.

The thing is that our current search setup uses Lucene directly, and the
indexer takes less than an hour (MF: 500, maxBufferedDocs: 7500). The
fields are different, the complete setup is different. But it will be
hard to advertise a new implementation/setup where indexing is three
times slower - unless I can give some reasons why that is.

The full index should be fairly fast because the backing data is update
every few hours. I want to put in place an incremental/partial update as
main process, but full indexing might have to be done at certain times
if data has changed completely, or the schema has to be changed/extended.

 > No, those are separate things.  The ramBufferSizeMB (although, I like
 > the thought of a "rum"BufferSizeMB too!  ;-)  ) controls how many docs
 > Lucene holds in memory before it has to flush.  MF controls how many
 > segments are on disk

alas! the rum. I had that typo on the commandline before. that's my
subconscious telling me what I should do when I get home, tonight...

So, increasing ramBufferSize should lead to higher memory usage,
shouldn't it? I'm not seeing that. :-(

I'll try once more with MF 10 and a higher rum... well, you know... ;-)

Cheers,
Chantal

Grant Ingersoll schrieb:

On Jul 31, 2009, at 8:04 AM, Chantal Ackermann wrote:

Dear all,

I want to find out which settings give the best full index
performance for my setup.
Therefore, I have been running a small index (less than 20k
documents) with a mergeFactor of 10 and 100.
In both cases, indexing took about 11.5 min:

mergeFactor: 10
0:11:46.792
mergeFactor: 100
/admin/cores?action=RELOAD
0:11:44.441
Tomcat restart
0:11:34.143

This is a Tomcat 5.5.20, started with a max heap size of 1GB. But it
always used much less. No swapping (RedHat Linux 32bit, 3GB RAM, old
ATA disk).

Now, I have three questions:

1. How can I check which mergeFactor is really being used? The
solrconfig.xml that is displayed in the admin application is the up-
to-date view on the file system. I tested that. But it's not
necessarily what the current SOLR core is using, isn't it?
Is there a way to check on the actually used mergeFactor (while the
index is running)?

It could very well be the case that you aren't seeing any merges with
only 20K docs.  Ultimately, if you really want to, you can look in
your data.dir and count the files.  If you have indexed a lot and have
an MF of 100 a

Re: Solr failing on "y" charakter in string?

2009-08-03 Thread Bill Au

I have a Solr text field and when I use Solr's field analysis tool, it shows
that wildcard queries are being stemmed.  But query results indicate that it
is not.  It looks like there is a bug in the tool.

Bill

On Mon, Aug 3, 2009 at 7:21 AM, gateway0  wrote:

>
> Ok thanks you´re right.
>
> But the thing is my users will often search for expressions like: "Harr" or
> "har" etc..
>
> So I thought I automatically add the wildcard "*" to every request.
>
> If that too gets me into trouble "Harr*"=>no result "harry*"=>no result
>
> What should I do?
>
>
>
> Otis Gospodnetic wrote:
> >
> > I believe it's because wildcard queries are not stemmed.  During indexing
> > "harry" probably got stemmed to "harr", so now "harry*" doesn't match,
> > because there is no "harry" token in that string, only "harr".  Why
> > wildcard queries are not analyzed is described in the Lucene FAQ on the
> > Lucene Wiki.
> >
> > You could also try searching for kunde:Harr* for example (not the
> > upper-case Harr).  I bet it won't result in a hit for the same reason -
> at
> > index time you probably lower-case tokens with LowerCaseFilter(Factory),
> > and if you search for Harr*, the lower-casing won't happen because the
> > query string with the wildcard character isn't analyzed.
> >
> > Otis
> > --
> > Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> > Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
> >
> >
> >
> > - Original Message 
> >> From: gateway0 
> >> To: solr-user@lucene.apache.org
> >> Sent: Sunday, August 2, 2009 7:30:19 PM
> >> Subject: Solr failing on "y" charakter in string?
> >>
> >>
> >> Hi,
> >>
> >> I have the following setting:
> >> schema.xml:
> >> ""
> >> the "text" field-type was updated with the "preserveOriginal=1" option
> in
> >> the schema
> >>
> >> I have the following string indexd in the field "kunde"
> >> "Harry Heim KG"
> >>
> >> Now when I search for "kunde:harry*" it gives me an empty result.
> >>
> >> When I search for "kunde:harry" I get the right result. Also
> >> "kunde:harr*"
> >> works just fine.
> >>
> >> The strange thing is that with every other string (for example
> >> "kunde:heim*") I will get the right result.
> >>
> >> So why not on "harry*" with an "y*" at the end?
> >>
> >> kind regards, S.
> >> --
> >> View this message in context:
> >>
> http://www.nabble.com/Solr-failing-on-%22y%22-charakter-in-string--tp24783211p24783211.html
> >> Sent from the Solr - User mailing list archive at Nabble.com.
> >
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Solr-failing-on-%22y%22-charakter-in-string--tp24783211p24789070.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>

Re: mergeFactor / indexing speed

2009-08-03 Thread Avlesh Singh

>
> avg-cpu:  %user   %nice%sys %iowait   %idle
>   1.230.000.030.03   98.71
>
I agree, real bad statistics, actually.

Currently, I've set mergeFactor to 1000 and ramBufferSize to 256MB.
>
To me the former appears to be too high and latter too low (for your machine
configuration). You can safely increase the ramBufferSize (or
maxBufferedDocs) to a higher value.

Couple of things -

   1. The stock solrconfig.xml comes with two sections  and
   . Options in the latter override the former. Just make sure that
   you have right values at the right place.
   2. Do you have too many nested entities inside the DIH's data-config? If
   yes, a database level optimization (creating views, in memory tables ...)
   might hold the answer.
   3. Tried playing around with jdbc paramters in the data source? Setting
   "batchSize" property to a considerable value might help.

Cheers
Avlesh

On Mon, Aug 3, 2009 at 10:02 PM, Chantal Ackermann <
chantal.ackerm...@btelligent.de> wrote:

> Hi all,
>
> I'm still struggling with the index performance. I've moved the indexer
> to a different machine, now, which is faster and less occupied.
>
> The new machine is a 64bit 8Gig-RAM RedHat. JDK1.6, Tomcat 6.0.18,
> running with those settings (and others):
> -server -Xms1G -Xmx7G
>
> Currently, I've set mergeFactor to 1000 and ramBufferSize to 256MB.
> It has been processing roughly 70k documents in half an hour, so far. Which
> means 1,5 hours at least for 200k - which is as fast/slow as before (on the
> less performant machine).
>
> The machine is not swapping. It is only using 13% of the memory.
> iostat gives me:
>  iostat
> Linux 2.6.9-67.ELsmp  08/03/2009
>
> avg-cpu:  %user   %nice%sys %iowait   %idle
>   1.230.000.030.03   98.71
>
> Basically, it is doing very little? *scratch*
>
> The sourcing database is responding as fast as ever. (I checked that from
> my own machine, and did only a ping from the linux box to the db server.)
>
> Any help, any hint on where to look would be greatly appreciated.
>
>
> Thanks!
> Chantal
>
>
> Chantal Ackermann schrieb:
>
>> Hi again!
>>
>> Thanks for the answer, Grant.
>>
>>  > It could very well be the case that you aren't seeing any merges with
>>  > only 20K docs.  Ultimately, if you really want to, you can look in
>>  > your data.dir and count the files.  If you have indexed a lot and have
>>  > an MF of 100 and haven't done an optimize, you will see a lot more
>>  > index files.
>>
>> Do you mean that 20k is not representative enough to test those settings?
>> I've chosen the smaller data set so that the index can run completely
>> but doesn't take too long at the same time.
>> If it would be faster to begin with, I could use a larger data set, of
>> course. I still can't believe that 11 minutes is normal (I haven't
>> managed to make it run faster or slower than that, that duration is very
>> stable).
>>
>> It "feels kinda" slow to me...
>> Out of your experience - what would you expect as duration for an index
>> with:
>> - 21 fields, some using a text type with 6 filters
>> - database access using DataImportHandler with a query of (far) less
>> than 20ms
>> - 2 transformers
>>
>> If I knew that indexing time should be shorter than that, at least, I
>> would know that something is definitely wrong with what I am doing or
>> with the environment I am using.
>>
>>  > Likely, but not guaranteed.  Typically, larger merge factors are good
>>  > for batch indexing, but a lot of that has changed with Lucene's new
>>  > background merger, such that I don't know if it matters as much
>> anymore.
>>
>> Ok. I also read some posting where it basically said that the default
>> parameters are ok. And one shouldn't mess around with them.
>>
>> The thing is that our current search setup uses Lucene directly, and the
>> indexer takes less than an hour (MF: 500, maxBufferedDocs: 7500). The
>> fields are different, the complete setup is different. But it will be
>> hard to advertise a new implementation/setup where indexing is three
>> times slower - unless I can give some reasons why that is.
>>
>> The full index should be fairly fast because the backing data is update
>> every few hours. I want to put in place an incremental/partial update as
>> main process, but full indexing might have to be done at certain times
>> if data has changed completely, or the schema has to be changed/extended.
>>
>>  > No, those are separate things.  The ramBufferSizeMB (although, I like
>>  > the thought of a "rum"BufferSizeMB too!  ;-)  ) controls how many docs
>>  > Lucene holds in memory before it has to flush.  MF controls how many
>>  > segments are on disk
>>
>> alas! the rum. I had that typo on the commandline before. that's my
>> subconscious telling me what I should do when I get home, tonight...
>>
>> So, increasing ramBufferSize should lead to higher memory usage,
>> shouldn't it? I'm not seeing that. :-(
>>
>> I'll try once more with MF 10 and a higher rum

Re: mergeFactor / indexing speed

2009-08-03 Thread Otis Gospodnetic

Hi,

I'd have to poke around the machine(s) to give you better guidance, but here is 
some initial feedback:

- mergeFactor of 1000 seems crazy.  mergeFactor is probably not your problem.  
I'd go back to default of 10.
- 256 MB for ramBufferSizeMB sounds OK.
- pinging the DB won't tell you much about the DB server's performance - ssh to 
the machine and check its CPU load, memory usage, disk IO

Other things to look into:
- Network as the bottleneck?
- Field analysis as the bottleneck?


Otis 
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



- Original Message 
> From: Chantal Ackermann 
> To: "solr-user@lucene.apache.org" 
> Sent: Monday, August 3, 2009 12:32:12 PM
> Subject: Re: mergeFactor / indexing speed
> 
> Hi all,
> 
> I'm still struggling with the index performance. I've moved the indexer
> to a different machine, now, which is faster and less occupied.
> 
> The new machine is a 64bit 8Gig-RAM RedHat. JDK1.6, Tomcat 6.0.18,
> running with those settings (and others):
> -server -Xms1G -Xmx7G
> 
> Currently, I've set mergeFactor to 1000 and ramBufferSize to 256MB.
> It has been processing roughly 70k documents in half an hour, so far. 
> Which means 1,5 hours at least for 200k - which is as fast/slow as 
> before (on the less performant machine).
> 
> The machine is not swapping. It is only using 13% of the memory.
> iostat gives me:
>   iostat
> Linux 2.6.9-67.ELsmp  08/03/2009
> 
> avg-cpu:  %user   %nice%sys %iowait   %idle
> 1.230.000.030.03   98.71
> 
> Basically, it is doing very little? *scratch*
> 
> The sourcing database is responding as fast as ever. (I checked that 
> from my own machine, and did only a ping from the linux box to the db 
> server.)
> 
> Any help, any hint on where to look would be greatly appreciated.
> 
> 
> Thanks!
> Chantal
> 
> 
> Chantal Ackermann schrieb:
> > Hi again!
> >
> > Thanks for the answer, Grant.
> >
> >  > It could very well be the case that you aren't seeing any merges with
> >  > only 20K docs.  Ultimately, if you really want to, you can look in
> >  > your data.dir and count the files.  If you have indexed a lot and have
> >  > an MF of 100 and haven't done an optimize, you will see a lot more
> >  > index files.
> >
> > Do you mean that 20k is not representative enough to test those settings?
> > I've chosen the smaller data set so that the index can run completely
> > but doesn't take too long at the same time.
> > If it would be faster to begin with, I could use a larger data set, of
> > course. I still can't believe that 11 minutes is normal (I haven't
> > managed to make it run faster or slower than that, that duration is very
> > stable).
> >
> > It "feels kinda" slow to me...
> > Out of your experience - what would you expect as duration for an index
> > with:
> > - 21 fields, some using a text type with 6 filters
> > - database access using DataImportHandler with a query of (far) less
> > than 20ms
> > - 2 transformers
> >
> > If I knew that indexing time should be shorter than that, at least, I
> > would know that something is definitely wrong with what I am doing or
> > with the environment I am using.
> >
> >  > Likely, but not guaranteed.  Typically, larger merge factors are good
> >  > for batch indexing, but a lot of that has changed with Lucene's new
> >  > background merger, such that I don't know if it matters as much anymore.
> >
> > Ok. I also read some posting where it basically said that the default
> > parameters are ok. And one shouldn't mess around with them.
> >
> > The thing is that our current search setup uses Lucene directly, and the
> > indexer takes less than an hour (MF: 500, maxBufferedDocs: 7500). The
> > fields are different, the complete setup is different. But it will be
> > hard to advertise a new implementation/setup where indexing is three
> > times slower - unless I can give some reasons why that is.
> >
> > The full index should be fairly fast because the backing data is update
> > every few hours. I want to put in place an incremental/partial update as
> > main process, but full indexing might have to be done at certain times
> > if data has changed completely, or the schema has to be changed/extended.
> >
> >  > No, those are separate things.  The ramBufferSizeMB (although, I like
> >  > the thought of a "rum"BufferSizeMB too!  ;-)  ) controls how many docs
> >  > Lucene holds in memory before it has to flush.  MF controls how many
> >  > segments are on disk
> >
> > alas! the rum. I had that typo on the commandline before. that's my
> > subconscious telling me what I should do when I get home, tonight...
> >
> > So, increasing ramBufferSize should lead to higher memory usage,
> > shouldn't it? I'm not seeing that. :-(
> >
> > I'll try once more with MF 10 and a higher rum... well, you know... ;-)
> >
> > Cheers,
> > Chantal
> >
> > Grant Ingersoll schrieb:
> >> On Jul 31, 2009, at 8:04 AM

Re: Solr failing on "y" charakter in string?

2009-08-03 Thread Avlesh Singh

>
> I have a Solr text field and when I use Solr's field analysis tool, it
> shows that wildcard queries are being stemmed.  But query results indicate
> that it is not.  It looks like there is a bug in the tool.
>
I am in agreement. Seems like a bug to me.

Cheers
Avlesh

On Mon, Aug 3, 2009 at 10:19 PM, Bill Au  wrote:

> I have a Solr text field and when I use Solr's field analysis tool, it
> shows
> that wildcard queries are being stemmed.  But query results indicate that
> it
> is not.  It looks like there is a bug in the tool.
>
> Bill
>
> On Mon, Aug 3, 2009 at 7:21 AM, gateway0  wrote:
>
> >
> > Ok thanks you´re right.
> >
> > But the thing is my users will often search for expressions like: "Harr"
> or
> > "har" etc..
> >
> > So I thought I automatically add the wildcard "*" to every request.
> >
> > If that too gets me into trouble "Harr*"=>no result "harry*"=>no result
> >
> > What should I do?
> >
> >
> >
> > Otis Gospodnetic wrote:
> > >
> > > I believe it's because wildcard queries are not stemmed.  During
> indexing
> > > "harry" probably got stemmed to "harr", so now "harry*" doesn't match,
> > > because there is no "harry" token in that string, only "harr".  Why
> > > wildcard queries are not analyzed is described in the Lucene FAQ on the
> > > Lucene Wiki.
> > >
> > > You could also try searching for kunde:Harr* for example (not the
> > > upper-case Harr).  I bet it won't result in a hit for the same reason -
> > at
> > > index time you probably lower-case tokens with
> LowerCaseFilter(Factory),
> > > and if you search for Harr*, the lower-casing won't happen because the
> > > query string with the wildcard character isn't analyzed.
> > >
> > > Otis
> > > --
> > > Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> > > Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
> > >
> > >
> > >
> > > - Original Message 
> > >> From: gateway0 
> > >> To: solr-user@lucene.apache.org
> > >> Sent: Sunday, August 2, 2009 7:30:19 PM
> > >> Subject: Solr failing on "y" charakter in string?
> > >>
> > >>
> > >> Hi,
> > >>
> > >> I have the following setting:
> > >> schema.xml:
> > >> ""
> > >> the "text" field-type was updated with the "preserveOriginal=1" option
> > in
> > >> the schema
> > >>
> > >> I have the following string indexd in the field "kunde"
> > >> "Harry Heim KG"
> > >>
> > >> Now when I search for "kunde:harry*" it gives me an empty result.
> > >>
> > >> When I search for "kunde:harry" I get the right result. Also
> > >> "kunde:harr*"
> > >> works just fine.
> > >>
> > >> The strange thing is that with every other string (for example
> > >> "kunde:heim*") I will get the right result.
> > >>
> > >> So why not on "harry*" with an "y*" at the end?
> > >>
> > >> kind regards, S.
> > >> --
> > >> View this message in context:
> > >>
> >
> http://www.nabble.com/Solr-failing-on-%22y%22-charakter-in-string--tp24783211p24783211.html
> > >> Sent from the Solr - User mailing list archive at Nabble.com.
> > >
> > >
> > >
> >
> > --
> > View this message in context:
> >
> http://www.nabble.com/Solr-failing-on-%22y%22-charakter-in-string--tp24783211p24789070.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
> >
>

Re: ClassCastException from custom request handler

2009-08-03 Thread James Brady

Hi,
Thanks for your suggestions!

I'm sure I have the class name right - changing it to something patently
incorrect results in the expected
"org.apache.solr.common.SolrException: Error loading class ...", rather than
the ClassCastException.

I did have some problems getting my class on the app server's classpath. I'm
running with solr.home set to "multicore", but creating a multicore/lib
directory and putting my request handler class in there resulted in "Error
loading class" errors.

I found that setting jetty.class.path to include multicore/lib (and also
explicitly point at Solr's core and common JARs) fixed the "Error loading
class" errors, leaving these ClassCastExceptions...

2009/8/3 Avlesh Singh 

> Can you cross check the class attribute for your handler in solrconfig.xml?
> My guess is that it is specified as "solr.LiveCoresHandler". It should be
> fully qualified class name - com.foo.path.to.LiveCoresHandler instead.
>
> Moreover, I am damn sure that you did not forget to drop your jar into
> solr.home/lib. Checking once again might not be a bad idea :)
>
> Cheers
> Avlesh
>
> On Mon, Aug 3, 2009 at 9:11 PM, James Brady  >wrote:
>
> > Hi,
> > I'm creating a custom request handler to return a list of live cores in
> > Solr.
> >
> > On startup, I get this exception for each core:
> >
> > Jul 31, 2009 5:20:39 PM org.apache.solr.common. SolrException log
> > SEVERE: java.lang.ClassCastException: LiveCoresHandler
> >at
> > org.apache.solr.core.RequestHandlers$1.create(RequestHandlers.java:152)
> >at
> > org.apache.solr.core.RequestHandlers$1.create(RequestHandlers.java:161)
> >at
> >
> >
> org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:140)
> >at
> >
> >
> org.apache.solr.core.RequestHandlers.initHandlersFromConfig(RequestHandlers.java:169)
> >at org.apache.solr.core.SolrCore.(SolrCore.java:444)
> >
> > I've tried a few variations on the class definition, including extending
> > RequestHandlerBase (as suggested here:
> >
> >
> http://wiki.apache.org/solr/SolrRequestHandler#head-1de7365d7ecf2eac079c5f8b92ee9af712ed75c2
> > )
> > and implementing SolrRequestHandler directly.
> >
> > I'm sure that the Solr libraries I built against and those I'm running on
> > are the same version too, as I unzipped the Solr war file and copies the
> > relevant jars out of there to build against.
> >
> > Any ideas on what could be causing the ClassCastException? I've attached
> a
> > debugger to the running Solr process but it didn't shed any light on the
> > issue...
> >
> > Thanks!
> > James
> >
>



-- 
http://twitter.com/goodgravy
512 300 4210
http://webmynd.com/
Sent from Bury, United Kingdom

RE: 99.9% uptime requirement

2009-08-03 Thread Robert Petersen

So then would the 'right' thing to do be to run it under something like
Daemontools so it bounces back up on a crash?  Do any other people use
this approach or is there something better to make it come back up?

Speaking of overly large caches, if I have solr running on a machine
with 8GB main memory is it going to hurt to make some huge cache sizes?
Are these settings reasonable?  With a small index I have been getting
some great hit-rates.
1024





Thanks
Robi

-Original Message-
From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] 
Sent: Friday, July 31, 2009 11:37 PM
To: solr-user@lucene.apache.org
Subject: Re: 99.9% uptime requirement

Robi,

Solr is indeed very stable.  However, it can crash and I've seen it
crash.  Or rather, I should say I've seen the JVM that runs Solr crash.
For instance, if you have a servlet container with a number of webapps,
one of which is Solr, and one of which has a memory leak, I believe all
webapps will suffer and "crash".  And even if you have just Solr in your
servlet container, it can OOM, say if you specify overly large caches or
too frequent commits, etc.

 Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



- Original Message 
> From: Robert Petersen 
> To: solr-user@lucene.apache.org
> Sent: Friday, July 31, 2009 12:18:55 PM
> Subject: 99.9% uptime requirement
> 
> Hi all,
> 
> My solr project powers almost all the pages in our site and so needs
to
> be up period.  My question is what can I do to ensure that happens?
> Does solr ever crash, assuming reasonable load conditions and no
extreme
> index sizes?
> 
> I saw some comments about running solr under daemontools in order to
get
> an auto-restart on crashes.  From what I have seen so far in my
limited
> experience, solr is very stable and never crashes (so far).  Does
anyone
> else have this requirement and if so how do they deal with it?  Is
> anyone else running solr under daemontools in a production site?
> 
> Thanks for any input you might have,
> Robi

Re: 99.9% uptime requirement

2009-08-03 Thread Walter Underwood

For 99.9%, run three copies behind a load balancer. That allows you to  
take one down for upgrade, and still be fault-tolerant.


wunder

On Aug 3, 2009, at 10:46 AM, Robert Petersen wrote:

So then would the 'right' thing to do be to run it under something  
like

Daemontools so it bounces back up on a crash?  Do any other people use
this approach or is there something better to make it come back up?

Speaking of overly large caches, if I have solr running on a machine
with 8GB main memory is it going to hurt to make some huge cache  
sizes?

Are these settings reasonable?  With a small index I have been getting
some great hit-rates.
1024





Thanks
Robi

-Original Message-
From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com]
Sent: Friday, July 31, 2009 11:37 PM
To: solr-user@lucene.apache.org
Subject: Re: 99.9% uptime requirement

Robi,

Solr is indeed very stable.  However, it can crash and I've seen it
crash.  Or rather, I should say I've seen the JVM that runs Solr  
crash.
For instance, if you have a servlet container with a number of  
webapps,
one of which is Solr, and one of which has a memory leak, I believe  
all
webapps will suffer and "crash".  And even if you have just Solr in  
your
servlet container, it can OOM, say if you specify overly large  
caches or

too frequent commits, etc.

Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



- Original Message 

From: Robert Petersen 
To: solr-user@lucene.apache.org
Sent: Friday, July 31, 2009 12:18:55 PM
Subject: 99.9% uptime requirement

Hi all,

My solr project powers almost all the pages in our site and so needs

to

be up period.  My question is what can I do to ensure that happens?
Does solr ever crash, assuming reasonable load conditions and no

extreme

index sizes?

I saw some comments about running solr under daemontools in order to

get

an auto-restart on crashes.  From what I have seen so far in my

limited

experience, solr is very stable and never crashes (so far).  Does

anyone

else have this requirement and if so how do they deal with it?  Is
anyone else running solr under daemontools in a production site?

Thanks for any input you might have,
Robi

Re: 99.9% uptime requirement

2009-08-03 Thread Rafał Kuć

Hello!

Robert, from my experience with Solr (since 1.2 and running few 1.4 
deployments) Solr does not need any kind of mechanism to ensure it will auto 
start on crash, because I didn`t see it crash on it`s own fault. Just ensure, 
You have not one instance of Solr, and run it behind a proxy or load balancer 
of some kind. 

-- 
Regards,
Rafał Kuć

> So then would the 'right' thing to do be to run it under something like
> Daemontools so it bounces back up on a crash? Do any other people use
> this approach or is there something better to make it come back up?

> Speaking of overly large caches, if I have solr running on a machine
> with 8GB main memory is it going to hurt to make some huge cache sizes?
> Are these settings reasonable? With a small index I have been getting
> some great hit-rates.
> 1024

>  initialSize="512" autowarmCount="80"/>
>  initialSize="512" autowarmCount="80"/>
>  initialSize="512" autowarmCount="0"/>

> Thanks
> Robi

> -Original Message-
> From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] 
> Sent: Friday, July 31, 2009 11:37 PM
> To: solr-user@lucene.apache.org
> Subject: Re: 99.9% uptime requirement

> Robi,

> Solr is indeed very stable. However, it can crash and I've seen it
> crash. Or rather, I should say I've seen the JVM that runs Solr crash.
> For instance, if you have a servlet container with a number of webapps,
> one of which is Solr, and one of which has a memory leak, I believe all
> webapps will suffer and "crash". And even if you have just Solr in your
> servlet container, it can OOM, say if you specify overly large caches or
> too frequent commits, etc.

> Otis
> --
> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



> - Original Message 
>> From: Robert Petersen 
>> To: solr-user@lucene.apache.org
>> Sent: Friday, July 31, 2009 12:18:55 PM
>> Subject: 99.9% uptime requirement

>> Hi all,

>> My solr project powers almost all the pages in our site and so needs
> to
>> be up period. My question is what can I do to ensure that happens?
>> Does solr ever crash, assuming reasonable load conditions and no
> extreme
>> index sizes?

>> I saw some comments about running solr under daemontools in order to
> get
>> an auto-restart on crashes. From what I have seen so far in my
> limited
>> experience, solr is very stable and never crashes (so far). Does
> anyone
>> else have this requirement and if so how do they deal with it? Is
>> anyone else running solr under daemontools in a production site?

>> Thanks for any input you might have,
>> Robi

Re: Questions on FieldValueCache

2009-08-03 Thread Stephen Duncan Jr

On Fri, Jul 31, 2009 at 5:23 PM, Yonik Seeley wrote:

> On Fri, Jul 31, 2009 at 5:06 PM, Stephen Duncan
> Jr wrote:
> > I have a couple more questions on the FieldValueCache.  I see that the
> > number of items in the cache is basically the number of multi-valued
> fields
> > facets have been requested for.  What does each entry in the cache
> actually
> > contain?  How does it's size grow as the number of total documents
> > increases?
>
> It's basically an array of int[maxDoc] that contain the list of
> delta-coded vint values or optionally point out to shared byte arrays
> if the list of values don't fit in an int.  See the javadoc for
> UnInvertedField for more details.
>
> -Yonik
> http://www.lucidimagination.com
>

I'm getting the following warning in my logs: 2009-08-03 13:41:40,114
[http-127.0.0.1-8080-1] WARN  org.apache.solr.core.SolrCore - Approaching
too many values for UnInvertedField faceting on field 'originaltext' :
bucket size=15802492

What's the impact of that?  If the number of values (number of unique terms
for that field, or some other "values"?) exceeds that limit, will faceting
for that field go back to a different technique and be slower, or...?

-- 
Stephen Duncan Jr
www.stephenduncanjr.com

Re: Questions on FieldValueCache

2009-08-03 Thread Yonik Seeley

On Mon, Aug 3, 2009 at 2:18 PM, Stephen Duncan
Jr wrote:
> On Fri, Jul 31, 2009 at 5:23 PM, Yonik Seeley 
> wrote:
>
>> On Fri, Jul 31, 2009 at 5:06 PM, Stephen Duncan
>> Jr wrote:
>> > I have a couple more questions on the FieldValueCache.  I see that the
>> > number of items in the cache is basically the number of multi-valued
>> fields
>> > facets have been requested for.  What does each entry in the cache
>> actually
>> > contain?  How does it's size grow as the number of total documents
>> > increases?
>>
>> It's basically an array of int[maxDoc] that contain the list of
>> delta-coded vint values or optionally point out to shared byte arrays
>> if the list of values don't fit in an int.  See the javadoc for
>> UnInvertedField for more details.
>>
>> -Yonik
>> http://www.lucidimagination.com
>>
>
> I'm getting the following warning in my logs: 2009-08-03 13:41:40,114
> [http-127.0.0.1-8080-1] WARN  org.apache.solr.core.SolrCore - Approaching
> too many values for UnInvertedField faceting on field 'originaltext' :
> bucket size=15802492
>
> What's the impact of that?  If the number of values (number of unique terms
> for that field, or some other "values"?) exceeds that limit, will faceting
> for that field go back to a different technique and be slower, or...?

It will throw an exception.

This method of faceting wasn't really designed for big full-text fields.
The enum method should work better for this... try something like the following:

f.originaltext.facet.method=enum
facet.enum.cache.minDf=1

-Yonik
http://www.lucidimagination.com

Re: Boosting for most recent documents

2009-08-03 Thread vivek sar

Hi,

 Related question to "getting the latest records first". After trying
few suggested ways (function query, index time boosting) of getting
the latest first I settled for simple "sort" parameter,

 sort=field+asc

As per wiki, http://wiki.apache.org/solr/SchemaDesign?highlight=(sort),

Lucene would cache "4 bytes * the number of documents" plus unique
terms for the sorted field in fieldcache. This is done so subsequent
sort requests can be retrieved from cache. So the memory usage if I
got 1 billion records in one Indexer instance, for ex,

1) 1 billion records
2) sort on time stamp field (rounded to hour) - for 1 year - 8760
unique terms. (negligible)
3) Total memory requirement  for sorting on this single field would be
around  1G * 4 = 4GB

So, if I run only one sort query once in a day there would still be
4GB required at all time. Is there any way to tell Solr/Lucene to
release the memory once the query has been run? Basically I don't want
cache. I've commented out all the cache parameters in the
solrconfig.xml, but I still see the very first time I run the sort
query the memory jumps by 4 G and remains there.

Is there any way so Lucene/Solr doesn't use so much memory for sorting
so my application can scale (sorting memory requirement won't be
function of number of documents)?

Thanks,
-vivek

On Thu, Jul 16, 2009 at 3:10 PM, Chris
Hostetter wrote:
>
> :   Does anyone know if Solr supports sorting by internal document ids,
> : i.e, like Sort.INDEXORDER in Lucene? If so, how?
>
> It does not.  in Solr the decisison to make "score desc" the default
> search ment there is no way to request simple docId ordering.
>
> : Also, if anyone have any insight on if function query loads up unique
> : terms (like field sorts) in memory or not.
>
> It uses the exact same FieldCache as sorting.
>
>
>
>
> -Hoss
>

Re: mergeFactor / indexing speed

2009-08-03 Thread Grant Ingersoll

How big are your documents?  I haven't benchmarked DIH, so I am not  
sure what to expect, but it does seem like something isn't right.  Can  
you fully describe how you are indexing?  Have you done any profiling?

On Aug 3, 2009, at 12:32 PM, Chantal Ackermann wrote:

Hi all,

I'm still struggling with the index performance. I've moved the  
indexer

to a different machine, now, which is faster and less occupied.

The new machine is a 64bit 8Gig-RAM RedHat. JDK1.6, Tomcat 6.0.18,
running with those settings (and others):
-server -Xms1G -Xmx7G

Currently, I've set mergeFactor to 1000 and ramBufferSize to 256MB.
It has been processing roughly 70k documents in half an hour, so  
far. Which means 1,5 hours at least for 200k - which is as fast/slow  
as before (on the less performant machine).

The machine is not swapping. It is only using 13% of the memory.
iostat gives me:
iostat
Linux 2.6.9-67.ELsmp  08/03/2009

avg-cpu:  %user   %nice%sys %iowait   %idle
  1.230.000.030.03   98.71

Basically, it is doing very little? *scratch*

The sourcing database is responding as fast as ever. (I checked that  
from my own machine, and did only a ping from the linux box to the  
db server.)

Any help, any hint on where to look would be greatly appreciated.

Thanks!
Chantal

Chantal Ackermann schrieb:

Hi again!

Thanks for the answer, Grant.

> It could very well be the case that you aren't seeing any merges  
with

> only 20K docs.  Ultimately, if you really want to, you can look in
> your data.dir and count the files.  If you have indexed a lot and  
have

> an MF of 100 and haven't done an optimize, you will see a lot more
> index files.

Do you mean that 20k is not representative enough to test those  
settings?

I've chosen the smaller data set so that the index can run completely
but doesn't take too long at the same time.
If it would be faster to begin with, I could use a larger data set,  
of

course. I still can't believe that 11 minutes is normal (I haven't
managed to make it run faster or slower than that, that duration is  
very

stable).

It "feels kinda" slow to me...
Out of your experience - what would you expect as duration for an  
index

with:
- 21 fields, some using a text type with 6 filters
- database access using DataImportHandler with a query of (far) less
than 20ms
- 2 transformers

If I knew that indexing time should be shorter than that, at least, I
would know that something is definitely wrong with what I am doing or
with the environment I am using.

> Likely, but not guaranteed.  Typically, larger merge factors are  
good

> for batch indexing, but a lot of that has changed with Lucene's new
> background merger, such that I don't know if it matters as much  
anymore.

Ok. I also read some posting where it basically said that the default
parameters are ok. And one shouldn't mess around with them.

The thing is that our current search setup uses Lucene directly,  
and the

indexer takes less than an hour (MF: 500, maxBufferedDocs: 7500). The
fields are different, the complete setup is different. But it will be
hard to advertise a new implementation/setup where indexing is three
times slower - unless I can give some reasons why that is.

The full index should be fairly fast because the backing data is  
update
every few hours. I want to put in place an incremental/partial  
update as
main process, but full indexing might have to be done at certain  
times
if data has changed completely, or the schema has to be changed/ 
extended.

> No, those are separate things.  The ramBufferSizeMB (although, I  
like
> the thought of a "rum"BufferSizeMB too!  ;-)  ) controls how many  
docs
> Lucene holds in memory before it has to flush.  MF controls how  
many

> segments are on disk

alas! the rum. I had that typo on the commandline before. that's my
subconscious telling me what I should do when I get home, tonight...

So, increasing ramBufferSize should lead to higher memory usage,
shouldn't it? I'm not seeing that. :-(

I'll try once more with MF 10 and a higher rum... well, you  
know... ;-)

Cheers,
Chantal

Grant Ingersoll schrieb:

On Jul 31, 2009, at 8:04 AM, Chantal Ackermann wrote:

Dear all,

I want to find out which settings give the best full index
performance for my setup.
Therefore, I have been running a small index (less than 20k
documents) with a mergeFactor of 10 and 100.
In both cases, indexing took about 11.5 min:

mergeFactor: 10
0:11:46.792
mergeFactor: 100
/admin/cores?action=RELOAD
0:11:44.441
Tomcat restart
0:11:34.143

This is a Tomcat 5.5.20, started with a max heap size of 1GB. But  
it
always used much less. No swapping (RedHat Linux 32bit, 3GB RAM,  
old

ATA disk).

Now, I have three questions:

1. How can I check which mergeFactor is really being used? The
solrconfig.xml that is displayed in the admin application is the  
up-

to-date view on the file system. I tested that. But it's not
necessarily what the current SOLR core is using, is

Re: Boosting for most recent documents

2009-08-03 Thread Yonik Seeley

On Mon, Aug 3, 2009 at 2:46 PM, vivek sar wrote:
> So, if I run only one sort query once in a day there would still be
> 4GB required at all time. Is there any way to tell Solr/Lucene to
> release the memory once the query has been run? Basically I don't want
> cache. I've commented out all the cache parameters in the
> solrconfig.xml, but I still see the very first time I run the sort
> query the memory jumps by 4 G and remains there.

There is currently no way to tell Lucene not to cache the FieldCache
entry it uses for sorting.
If you call commit though, a new searcher will be opened and the
memory should be released.

-Yonik
http://www.lucidimagination.com

Solr Search probem w/ phrase searches, text type, w/ escaped characters

2009-08-03 Thread Peter Keane

Greetings--

I am having a problem with a search that has stumped me.  I have a field
(called "source") with the following value:

Miguel : three dimensions : [Exhibition]

when I perform this search:

source:"Miguel \: three dimensions \: \[Exhibition\]"

it fails to match.

Yet when I change the value to:

Miguel : three dimensions : [Exhibitio

the following search matches:

source:"Miguel \: three dimensions \: \[Exhibitio"

This is a text field type w/ schema.xml:



  



  

  



  



I've tried lots of variations with combinations of characters that require
escaping and have not found a pattern as to why it sometimes works and
sometimes not.  (Note the quotations marks -- these are phrase searches).  I
had other problems that kep me from using a string tye for these fields.

--peter keane

RE: 99.9% uptime requirement

2009-08-03 Thread Robert Petersen

Thanks all, I figured there would be more talk about daemontools if there were 
really a need.  I appreciate the input and for starters we'll put two slaves 
behind a load balancer and grow it from there.

Lovin' Solr So Far!  We were using alta vista as our search engine... it was 
sooo 90's!  haha

Thanks again,
Robi

-Original Message-
From: Rafał Kuć [mailto:ra...@alud.com.pl] 
Sent: Monday, August 03, 2009 11:00 AM
To: solr-user@lucene.apache.org
Subject: Re: 99.9% uptime requirement

Hello!

Robert, from my experience with Solr (since 1.2 and running few 1.4 
deployments) Solr does not need any kind of mechanism to ensure it will auto 
start on crash, because I didn`t see it crash on it`s own fault. Just ensure, 
You have not one instance of Solr, and run it behind a proxy or load balancer 
of some kind. 

-- 
Regards,
Rafał Kuć

> So then would the 'right' thing to do be to run it under something like
> Daemontools so it bounces back up on a crash? Do any other people use
> this approach or is there something better to make it come back up?

> Speaking of overly large caches, if I have solr running on a machine
> with 8GB main memory is it going to hurt to make some huge cache sizes?
> Are these settings reasonable? With a small index I have been getting
> some great hit-rates.
> 1024

>  initialSize="512" autowarmCount="80"/>
>  initialSize="512" autowarmCount="80"/>
>  initialSize="512" autowarmCount="0"/>

> Thanks
> Robi

> -Original Message-
> From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] 
> Sent: Friday, July 31, 2009 11:37 PM
> To: solr-user@lucene.apache.org
> Subject: Re: 99.9% uptime requirement

> Robi,

> Solr is indeed very stable. However, it can crash and I've seen it
> crash. Or rather, I should say I've seen the JVM that runs Solr crash.
> For instance, if you have a servlet container with a number of webapps,
> one of which is Solr, and one of which has a memory leak, I believe all
> webapps will suffer and "crash". And even if you have just Solr in your
> servlet container, it can OOM, say if you specify overly large caches or
> too frequent commits, etc.

> Otis
> --
> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



> - Original Message 
>> From: Robert Petersen 
>> To: solr-user@lucene.apache.org
>> Sent: Friday, July 31, 2009 12:18:55 PM
>> Subject: 99.9% uptime requirement

>> Hi all,

>> My solr project powers almost all the pages in our site and so needs
> to
>> be up period. My question is what can I do to ensure that happens?
>> Does solr ever crash, assuming reasonable load conditions and no
> extreme
>> index sizes?

>> I saw some comments about running solr under daemontools in order to
> get
>> an auto-restart on crashes. From what I have seen so far in my
> limited
>> experience, solr is very stable and never crashes (so far). Does
> anyone
>> else have this requirement and if so how do they deal with it? Is
>> anyone else running solr under daemontools in a production site?

>> Thanks for any input you might have,
>> Robi

Re: Questions on FieldValueCache

2009-08-03 Thread Stephen Duncan Jr

On Mon, Aug 3, 2009 at 2:43 PM, Yonik Seeley wrote:

>
> >
> > I'm getting the following warning in my logs: 2009-08-03 13:41:40,114
> > [http-127.0.0.1-8080-1] WARN  org.apache.solr.core.SolrCore - Approaching
> > too many values for UnInvertedField faceting on field 'originaltext' :
> > bucket size=15802492
> >
> > What's the impact of that?  If the number of values (number of unique
> terms
> > for that field, or some other "values"?) exceeds that limit, will
> faceting
> > for that field go back to a different technique and be slower, or...?
>
> It will throw an exception.
>
> This method of faceting wasn't really designed for big full-text fields.
> The enum method should work better for this... try something like the
> following:
>
> f.originaltext.facet.method=enum
> facet.enum.cache.minDf=1
>
> -Yonik
> http://www.lucidimagination.com
>

Hmm, that's a hard thing to sell to the user and my boss, as it makes the
query time go from nearly always being sub-second (frequently less than 60
ms), to ranging up to nearly 4 seconds for a new query not already in the
cache.  (My test was with 100 facets being requested, which may be
reasonable, as one reason to facet on a full-text field to provide a dynamic
world-cloud).

How can I mitigate the time it takes with the enum method?  Do I need to ask
for more facet values in my facet-warming query (I set facet.limit to 1 as
it didn't seem to matter to the FieldValueCache)? And/Or do I need to up the
autowarmCount on the FilterCache?  If speed is the primary concern vs
memory, should I bother with the minDf setting?

I guess I should update my code to use the enum method on all the fields
that are likely to risk crossing this line.  Should I be looking at the
termInstances property on the fields that are displayed in the
FieldValueCache on the stats page, and figuring those on the order of 10
million are likely to grow past the limit?

-- 
Stephen Duncan Jr
www.stephenduncanjr.com

DIH: Any way to make update on db table?

2009-08-03 Thread Jay Hill

Is it possible for the DataImportHandler to update records in the table it
is querying? For example, say I have a query like this in my entity:

query="select field1, field2, from someTable where hasBeenIndexed=false"

Is there a way I can mark each record processed by updating the
hasBeenIndexed field? Here's a config I tried:





  

  
  

  
  

  


It does update the first record, but then an Exception is thrown:
Aug 3, 2009 1:15:24 PM org.apache.solr.handler.dataimport.DocBuilder
buildDocument
SEVERE: Exception while processing: mainEntity document :
SolrInputDocument[{id=id(1.0)={1}, name=name(1.0)={John Jones}}]
org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to
execute query: update tableToIndex set hasBeenIndexed=1 where id=1
Processing Document # 1
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.(JdbcDataSource.java:250)
at
org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:207)
at
org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:40)
at
org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:58)
at
org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:71)
at
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:237)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:344)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:370)
at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:225)
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:167)
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:333)
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:393)
at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:372)
Caused by: java.lang.NullPointerException
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.(JdbcDataSource.java:248)
... 12 more


-Jay

Re: Solr Search probem w/ phrase searches, text type, w/ escaped characters

2009-08-03 Thread Peter Keane

I've used Luke to figure out what is going on, and I see in the fields that
fail to match, a "null_1".  Could someone tell me what that is?  I see some
null_100s there as well, which see to separate field values.  Clearly the
null_1s are causing the search to fail.

--peter keane

On Mon, Aug 3, 2009 at 2:56 PM, Peter Keane  wrote:

> Greetings--
>
> I am having a problem with a search that has stumped me.  I have a field
> (called "source") with the following value:
>
> Miguel : three dimensions : [Exhibition]
>
> when I perform this search:
>
> source:"Miguel \: three dimensions \: \[Exhibition\]"
>
> it fails to match.
>
> Yet when I change the value to:
>
> Miguel : three dimensions : [Exhibitio
>
> the following search matches:
>
> source:"Miguel \: three dimensions \: \[Exhibitio"
>
> This is a text field type w/ schema.xml:
>
> 
>
>   
> 
>  generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> catenateAll="0" splitOnCaseChange="1"/>
> 
>   
>
>   
> 
>  generateNumberParts="1" catenateWords="0" catenateNumbers="0"
> catenateAll="0" splitOnCaseChange="1"/>
> 
>   
>
> 
>
> I've tried lots of variations with combinations of characters that require
> escaping and have not found a pattern as to why it sometimes works and
> sometimes not.  (Note the quotations marks -- these are phrase searches).  I
> had other problems that kep me from using a string tye for these fields.
>
> --peter keane
>

Picking Facet Fields by Frequency-in-Results

2009-08-03 Thread Chris Harris

One task when designing a facet-based UI is deciding which fields to
facet on and display facets for. One possibility that I hope to
explore is to determine which fields to facet on dynamically, based on
the search results. In particular, I hypothesize that, for a somewhat
heterogeneous index (heterogeneous in terms of which fields a given
record might contain), that the following rule might be helpful: Facet
on a given field to the extent that it is frequently set in the
documents matching the user's search.

For example, let's say my results look like this:

Doc A:
  f1: foo
  f2: bar
  f3: 
  f4: 

Doc B:
  f1: foo2
  f2: 
  f3: 
  f4: 

Doc C:
  f1: foo3
  f2: quiz
  f3: 
  f4: buzz

Doc D:
  f1: foo4
  f2: question
  f3: bam
  f4: bing

The field usage information for these documents could be summarized like this:

field f1: Set in 4 docs
field f2: Set in 3 doc
field f3: Set 1 doc
field f4: Set 2 doc

If I were choosing facet fields based on the above rule, I would
definitely want to display facets for field f1, since occurs in all
documents.  If I had room for another facet in the UI, I would facet
f2. If I wanted another one, I'd go with f4, since it's more popular
than f3. I probably would ignore f3 in any case, because it's set for
only one document.

Has anyone implemented such a scheme with Solr? Any success? (The
closest thing I can find is
http://wiki.apache.org/solr/ComplexFacetingBrainstorming, which tries
to pick which facets to display based not on frequency but based more
on a ruleset.)

As far as implementation, the most straightforward approach (which
wouldn't involve modifying Solr) would apparently be to add a new
multi-valued "fieldsindexed" field to each document, which would note
which fields actually have a value for each document. So when I pass
data to Solr at indexing time, it will look something like this
(except of course it will be in valid Solr XML, rather than this
schematic):

Doc A:
  f1: foo
  f2: bar
  indexedfields: f1, f2

Doc B:
  f1: foo2
  indexedfields: f1

Doc C:
  f1: foo3
  f2: quiz
  f4: buzz
  indexedfields: f1, f2, f4

Doc D:
  f1: foo4
  f2: question
  f3: bam
  f4: bing
  indexedfields: f1, f2, f3, f4

Then to chose which facets to display, I call

http://myserver/solr/search?q=myquery&facet=true&facet.field=indexedfields&facet.sort=true

and use the frequency information from this query to determine which
fields to display in the faceting UI. (To get the actual facet
information for those fields, I would query Solr a second time.)

Are there any alternatives that would be easier or more efficient?

Thanks,
Chris

Re: Questions on FieldValueCache

2009-08-03 Thread Yonik Seeley

On Mon, Aug 3, 2009 at 4:18 PM, Stephen Duncan
Jr wrote:
> On Mon, Aug 3, 2009 at 2:43 PM, Yonik Seeley 
> wrote:
> Hmm, that's a hard thing to sell to the user and my boss, as it makes the
> query time go from nearly always being sub-second (frequently less than 60
> ms), to ranging up to nearly 4 seconds for a new query not already in the
> cache.  (My test was with 100 facets being requested, which may be
> reasonable, as one reason to facet on a full-text field to provide a dynamic
> world-cloud).

Could you possibly profile it to find out what the hotspot is?
We don't really have a good algorithm for faceting text fields, but it
would be nice to see what the current bottleneck is.

> How can I mitigate the time it takes with the enum method?  Do I need to ask
> for more facet values in my facet-warming query (I set facet.limit to 1 as
> it didn't seem to matter to the FieldValueCache)?

Yes, it matters for the enum method because of the smart
short-circuiting that takes place.
Use a base query that matches fewer documents than the size of the
sets you want cached.
Set the limit higher to avoid short circuiting

> And/Or do I need to up the
> autowarmCount on the FilterCache?

Not if you have a static warming query that includes the facets you
are interested in.

> If speed is the primary concern vs
> memory, should I bother with the minDf setting?

minDf is pretty much just for memory savings.  But if you turn it down
or eliminate it, make sure your filterCache is big enough to hold a
filter for each possible term.

> I guess I should update my code to use the enum method on all the fields
> that are likely to risk crossing this line.  Should I be looking at the
> termInstances property on the fields that are displayed in the
> FieldValueCache on the stats page, and figuring those on the order of 10
> million are likely to grow past the limit?

For an index over 16M docs, it's perhaps closer to 16M/avg_bytes_per_term*256.

The storage space for terms that aren't "big terms" (which come from
the fieldCache) is 256 byte arrays, each which can be up to 16MB in
size.  Every 65536 block of documents shares one of those byte arrays
(or more if you have more than 16M documents).  So the average
document can't take up more than 256 bytes in the array.  That doesn't
mean 256 term instances though... that's the max.  The list is delta
encoded vints, so if there are many terms, each vint could be bigger.

More details in UnInvertedField after the comment:
  //
  // transform intermediate form into the final form, building a
single byte[]
  // at a time, and releasing the intermediate byte[]s as we go to avoid
  // increasing the memory footprint.
  //

-Yonik
http://www.lucidimagination.com

Re: Solr Search probem w/ phrase searches, text type, w/ escaped characters

2009-08-03 Thread Andrzej Bialecki


Peter Keane wrote:

I've used Luke to figure out what is going on, and I see in the fields that
fail to match, a "null_1".  Could someone tell me what that is?  I see some
null_100s there as well, which see to separate field values.  Clearly the
null_1s are causing the search to fail.


You used the "Reconstruct" function to obtain the field values for 
unstored fields, right? null_NNN is Luke's way of telling you that the 
tokens that should be on these positions are absent, because they were 
removed by analyzer during indexing, and there is no stored value of 
this field from which you could recover the original text. In other 
words, they are holes in the token stream, of length NNN.


Such holes may be also produced by artificially increasing the token 
positions, hence the null_100 that serves to separate multiple field 
values so that e.g. phrase queries don't match unrelated text.


Phrase queries that you can construct using QueryParser can't match two 
tokens separated by a hole, unless you set a slop value > 0.


--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Using Luke to get terms for docs matching a specific query filter?

2009-08-03 Thread Mark Bennett

You can get a nice list of terms for a field using the Luke handler:
http://localhost:8983/solr/admin/luke?fl=title&numTerms=1000

But what I'd really like is to get the terms for the docs that match a
particular slice of the index.

For example, let's say I have records for all 50 states, but I want to get
the top 1,000 terms for documents in California.

I'd like to add q or fq like this:
http://localhost:8983/solr/admin/luke?fl=title&numTerms=1000&q=state:CA
OR
http://localhost:8983/solr/admin/luke?fl=title&numTerms=1000&fq=state:CA

Although I don't get any errors, this syntax doesn't seem to filter the
terms.  Not a bug, nobody ever said it would.

But has anybody written a utility to get term instances for a subset of the
index, based on a query?  And to be clear, I was hoping to get all of the
terms in matching documents, not just terms that are also present in the
query.

Thanks,
Mark

--
Mark Bennett / New Idea Engineering, Inc. / mbenn...@ideaeng.com
Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513

Re: Questions on FieldValueCache

2009-08-03 Thread Stephen Duncan Jr

On Mon, Aug 3, 2009 at 4:56 PM, Yonik Seeley wrote:

> On Mon, Aug 3, 2009 at 4:18 PM, Stephen Duncan
> Jr wrote:
> > On Mon, Aug 3, 2009 at 2:43 PM, Yonik Seeley  >wrote:
> > Hmm, that's a hard thing to sell to the user and my boss, as it makes the
> > query time go from nearly always being sub-second (frequently less than
> 60
> > ms), to ranging up to nearly 4 seconds for a new query not already in the
> > cache.  (My test was with 100 facets being requested, which may be
> > reasonable, as one reason to facet on a full-text field to provide a
> dynamic
> > world-cloud).
>
> Could you possibly profile it to find out what the hotspot is?
> We don't really have a good algorithm for faceting text fields, but it
> would be nice to see what the current bottleneck is.


I'll put in my TODO list to try that out soon.  I'll let you know the
results if I manage.


>
> > I guess I should update my code to use the enum method on all the fields
> > that are likely to risk crossing this line.  Should I be looking at the
> > termInstances property on the fields that are displayed in the
> > FieldValueCache on the stats page, and figuring those on the order of 10
> > million are likely to grow past the limit?
>
> For an index over 16M docs, it's perhaps closer to
> 16M/avg_bytes_per_term*256.
>
> The storage space for terms that aren't "big terms" (which come from
> the fieldCache) is 256 byte arrays, each which can be up to 16MB in
> size.  Every 65536 block of documents shares one of those byte arrays
> (or more if you have more than 16M documents).  So the average
> document can't take up more than 256 bytes in the array.  That doesn't
> mean 256 term instances though... that's the max.  The list is delta
> encoded vints, so if there are many terms, each vint could be bigger.
>
> More details in UnInvertedField after the comment:
>  //
>  // transform intermediate form into the final form, building a
> single byte[]
>  // at a time, and releasing the intermediate byte[]s as we go to avoid
>  // increasing the memory footprint.
>  //
>
> -Yonik
> http://www.lucidimagination.com
>

Ok, a lot of that is going over-my-head for the moment.  I'll try to digest
this info a little further, but for now let's see if my minimal
understanding is correct:

What will cause me to exceed the limit and fail during faceting using the fc
method is if the documents within a block of 65536 combine to take up too
much space.  And this (generally speaking) going to be a function on the
average number of unique terms in the documents?

-- 
Stephen Duncan Jr
www.stephenduncanjr.com

Re: Using Luke to get terms for docs matching a specific query filter?

2009-08-03 Thread Yonik Seeley

Sounds like faceting?
q=state:CA&facet=true&facet.field=title&facet.limit=1000

-Yonik
http://www.lucidimagination.com


On Mon, Aug 3, 2009 at 5:39 PM, Mark Bennett wrote:
> You can get a nice list of terms for a field using the Luke handler:
>    http://localhost:8983/solr/admin/luke?fl=title&numTerms=1000
>
> But what I'd really like is to get the terms for the docs that match a
> particular slice of the index.
>
> For example, let's say I have records for all 50 states, but I want to get
> the top 1,000 terms for documents in California.
>
> I'd like to add q or fq like this:
>    http://localhost:8983/solr/admin/luke?fl=title&numTerms=1000&q=state:CA
>        OR
>    http://localhost:8983/solr/admin/luke?fl=title&numTerms=1000&fq=state:CA
>
> Although I don't get any errors, this syntax doesn't seem to filter the
> terms.  Not a bug, nobody ever said it would.
>
> But has anybody written a utility to get term instances for a subset of the
> index, based on a query?  And to be clear, I was hoping to get all of the
> terms in matching documents, not just terms that are also present in the
> query.
>
> Thanks,
> Mark
>
> --
> Mark Bennett / New Idea Engineering, Inc. / mbenn...@ideaeng.com
> Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513
>

Re: Wild card search does not return result but exact search does

2009-08-03 Thread Mohamed Parvez

On Mon, Aug 3, 2009 at 4:46 PM, Mohamed Parvez  wrote:

> Hello,
>
>I have two fields.
>
> 
> 
>
> I have documnet, that has a value of ICS for BUS and SSE for ROLE
>
> *When I search for q=BUS:ics i get the result. but if i search for
> q=BUS:ics* i don't get any match*
>
> when I search for q=ROLE:sse or q=ROLE:sse*, bot the times I get the
> result.
> *
> why does BUS:ics* does not return the result ?
> *
>
> I have the default configuration for text filed type.
>
>  positionIncrementGap="100">
>   
> 
> 
> 
>  ignoreCase="true"
> words="stopwords.txt"
> enablePositionIncrements="true"
> />
>  generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> 
>  protected="protwords.txt"/>
> 
>   
>   
> 
>  ignoreCase="true" expand="true"/>
>  words="stopwords.txt"/>
>  generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
> 
>  protected="protwords.txt"/>
> 
>   
> 
>
> Thanks/Regards,
> Parvez
>
>

Re: change sort order for MoreLikeThis

2009-08-03 Thread Bill Au

Avlesh,
 You lost me.  How does boosting change the sort order?  What about
sorting on a field that is not the mlt field?

Bill

On Mon, Aug 3, 2009 at 3:13 AM, Avlesh Singh  wrote:

> You can boost the similarity field matches, if you want. Look for mlt.boost
> at http://wiki.apache.org/solr/MoreLikeThis
>
> Cheers
> Avlesh
>
> On Mon, Aug 3, 2009 at 11:33 AM, Renz Daluz  wrote:
>
> > Hi,
> >
> > I'm looking at changing the result order when searching by MLT. I tried
> the
> > sort=, but it's not working. I check the wiki and can't
> find
> > anything. Is there a way to do this?
> >
> > Thanks,
> > /Laurence
> >
>

Re: Questions on FieldValueCache

2009-08-03 Thread Yonik Seeley

On Mon, Aug 3, 2009 at 5:45 PM, Stephen Duncan
Jr wrote:
> What will cause me to exceed the limit and fail during faceting using the fc
> method is if the documents within a block of 65536 combine to take up too
> much space.  And this (generally speaking) going to be a function on the
> average number of unique terms in the documents?

Yep... pretty much.  The comment at the head of UnInvertedField tries
to lay out the data structure.
Also subtract out "big terms" that match > 5% of the index... they
aren't uninverted.
And the block of 65536 docs wraps around after 256 blocks (since there
are only 256 arrays).

If you needed just a little more headroom, it wouldn't be to hard for
you to modify from 256 arrays to 512 (which would double the number of
terms allowed per block).  Or if you had much less than 16M docs in
the index, you could also lower the number of docs per block (all
custom code of course... it can't be done with a parameter today).

-Yonik
http://www.lucidimagination.com

Re: Using Luke to get terms for docs matching a specific query filter?

2009-08-03 Thread Mark Bennett

Ah!  Looks like that'll work.  Thanks Yonik!

For other folks listening in, he's suggesting not using Luke, and instead
reverting to a regular faceted query.

The full facet query URL would then be:

http://localhost:8983/solr/select?facet=true&facet.field=title&facet.limit=1000&q=state:CA
Vs. my attempted Luke URL of:
http://localhost:8983/solr/admin/luke?fl=title&numTerms=1000&q=state:CA

The output is still in XML, though the XPath to the terms is a bit
different.

The Facet XPath is something like:

/response/l...@name='facet_count']/l...@name='facet_fields']/l...@name='title']/int/@name

The Luke XPath (terms for all docs) is something like:

/response/l...@name='fields']/l...@name='title']/l...@name='topTerms']/int/@name

--
Mark Bennett / New Idea Engineering, Inc. / mbenn...@ideaeng.com
Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513


On Mon, Aug 3, 2009 at 2:49 PM, Yonik Seeley wrote:

> Sounds like faceting?
> q=state:CA&facet=true&facet.field=title&facet.limit=1000
>
> -Yonik
> http://www.lucidimagination.com
>
>
> On Mon, Aug 3, 2009 at 5:39 PM, Mark Bennett wrote:
> > You can get a nice list of terms for a field using the Luke handler:
> >http://localhost:8983/solr/admin/luke?fl=title&numTerms=1000
> >
> > But what I'd really like is to get the terms for the docs that match a
> > particular slice of the index.
> >
> > For example, let's say I have records for all 50 states, but I want to
> get
> > the top 1,000 terms for documents in California.
> >
> > I'd like to add q or fq like this:
> >
> http://localhost:8983/solr/admin/luke?fl=title&numTerms=1000&q=state:CA
> >OR
> >
> http://localhost:8983/solr/admin/luke?fl=title&numTerms=1000&fq=state:CA
> >
> > Although I don't get any errors, this syntax doesn't seem to filter the
> > terms.  Not a bug, nobody ever said it would.
> >
> > But has anybody written a utility to get term instances for a subset of
> the
> > index, based on a query?  And to be clear, I was hoping to get all of the
> > terms in matching documents, not just terms that are also present in the
> > query.
> >
> > Thanks,
> > Mark
> >
> > --
> > Mark Bennett / New Idea Engineering, Inc. / mbenn...@ideaeng.com
> > Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513
> >
>

Re: Solr Search probem w/ phrase searches, text type, w/ escaped characters

2009-08-03 Thread Peter Keane

Thanks!

Any idea why

Miguel : three dimensions : [Exhibitio

parse to: miguel, three,dimensions, exhibitio

BUT

Miguel : three dimensions : [Exhibition]

parses to miguel, three, dimensions, null_1, exhibition

seems quite strange...

--peter


On Mon, Aug 3, 2009 at 4:02 PM, Andrzej Bialecki  wrote:

> Peter Keane wrote:
>
>> I've used Luke to figure out what is going on, and I see in the fields
>> that
>> fail to match, a "null_1".  Could someone tell me what that is?  I see
>> some
>> null_100s there as well, which see to separate field values.  Clearly the
>> null_1s are causing the search to fail.
>>
>
> You used the "Reconstruct" function to obtain the field values for unstored
> fields, right? null_NNN is Luke's way of telling you that the tokens that
> should be on these positions are absent, because they were removed by
> analyzer during indexing, and there is no stored value of this field from
> which you could recover the original text. In other words, they are holes in
> the token stream, of length NNN.
>
> Such holes may be also produced by artificially increasing the token
> positions, hence the null_100 that serves to separate multiple field values
> so that e.g. phrase queries don't match unrelated text.
>
> Phrase queries that you can construct using QueryParser can't match two
> tokens separated by a hole, unless you set a slop value > 0.
>
> --
> Best regards,
> Andrzej Bialecki <><
>  ___. ___ ___ ___ _ _   __
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>

NoClassDefFoundError for org.apache.solr.handler.component.SearchComponent when running LocalSolr

2009-08-03 Thread Mat Brown

Hi all,

I just tried installing LocalSolr and LocalLucene - both the v2.0 tags
- in my Solr 1.3 instance, but when I start up I'm getting a
NoClassDefFoundError for
org.apache.solr.handler.component.SearchComponent. Grepping through
the source it appears that that class does exist in Solr 1.3, so I'm
not sure what could be wrong. Any help would be much appreciated -
thanks!

Mat

P.S. Here's the whole logging output when I try to start Solr:

java -Djetty.port=8983 -Dsolr.data.dir=/tmp/solr_data -jar start.jar
2009-08-03 19:46:09.599::INFO:  Logging to STDERR via org.mortbay.log.StdErrLog
2009-08-03 19:46:09.743::INFO:  jetty-6.1.3
2009-08-03 19:46:09.822::INFO:  Extract
jar:file:/home/mat/projects/sunspot/solr/webapps/solr.war!/ to
/tmp/Jetty_0_0_0_0_8983_solr.war__solr__k1kf17/webapp
Aug 3, 2009 7:46:10 PM org.apache.solr.servlet.SolrDispatchFilter init
INFO: SolrDispatchFilter.init()
Aug 3, 2009 7:46:10 PM org.apache.solr.core.SolrResourceLoader locateInstanceDir
INFO: JNDI not configured for solr (NoInitialContextEx)
Aug 3, 2009 7:46:10 PM org.apache.solr.core.SolrResourceLoader locateInstanceDir
INFO: solr home defaulted to 'solr/' (could not find system property or JNDI)
Aug 3, 2009 7:46:10 PM org.apache.solr.core.CoreContainer$Initializer initialize
INFO: looking for solr.xml: /home/mat/projects/sunspot/solr/solr/solr.xml
Aug 3, 2009 7:46:10 PM org.apache.solr.core.SolrResourceLoader 
INFO: Solr home set to 'solr/'
Aug 3, 2009 7:46:10 PM org.apache.solr.core.SolrResourceLoader createClassLoader
INFO: Reusing parent classloader
Aug 3, 2009 7:46:10 PM org.apache.solr.core.SolrResourceLoader locateInstanceDir
INFO: JNDI not configured for solr (NoInitialContextEx)
Aug 3, 2009 7:46:10 PM org.apache.solr.core.SolrResourceLoader locateInstanceDir
INFO: solr home defaulted to 'solr/' (could not find system property or JNDI)
Aug 3, 2009 7:46:10 PM org.apache.solr.core.SolrResourceLoader 
INFO: Solr home set to 'solr/'
Aug 3, 2009 7:46:10 PM org.apache.solr.core.SolrResourceLoader createClassLoader
INFO: Reusing parent classloader
Aug 3, 2009 7:46:10 PM org.apache.solr.core.SolrConfig 
INFO: Loaded SolrConfig: solrconfig.xml
Aug 3, 2009 7:46:10 PM org.apache.solr.core.SolrCore 
INFO: Opening new SolrCore at solr/, dataDir=/tmp/solr_data/
Aug 3, 2009 7:46:10 PM org.apache.solr.schema.IndexSchema readSchema
INFO: Reading Solr Schema
Aug 3, 2009 7:46:10 PM org.apache.solr.schema.IndexSchema readSchema
INFO: Schema name=sunspot
Aug 3, 2009 7:46:10 PM org.apache.solr.util.plugin.AbstractPluginLoader load
INFO: created null: org.apache.solr.analysis.StandardTokenizerFactory
Aug 3, 2009 7:46:10 PM org.apache.solr.util.plugin.AbstractPluginLoader load
INFO: created null: org.apache.solr.analysis.StandardFilterFactory
Aug 3, 2009 7:46:10 PM org.apache.solr.util.plugin.AbstractPluginLoader load
INFO: created null: org.apache.solr.analysis.LowerCaseFilterFactory
Aug 3, 2009 7:46:10 PM org.apache.solr.util.plugin.AbstractPluginLoader load
INFO: created text: org.apache.solr.schema.TextField
Aug 3, 2009 7:46:10 PM org.apache.solr.util.plugin.AbstractPluginLoader load
INFO: created rand: org.apache.solr.schema.RandomSortField
Aug 3, 2009 7:46:10 PM org.apache.solr.util.plugin.AbstractPluginLoader load
INFO: created boolean: org.apache.solr.schema.BoolField
Aug 3, 2009 7:46:10 PM org.apache.solr.util.plugin.AbstractPluginLoader load
INFO: created sfloat: org.apache.solr.schema.SortableFloatField
Aug 3, 2009 7:46:10 PM org.apache.solr.util.plugin.AbstractPluginLoader load
INFO: created date: org.apache.solr.schema.DateField
Aug 3, 2009 7:46:10 PM org.apache.solr.util.plugin.AbstractPluginLoader load
INFO: created sint: org.apache.solr.schema.SortableIntField
Aug 3, 2009 7:46:10 PM org.apache.solr.util.plugin.AbstractPluginLoader load
INFO: created string: org.apache.solr.schema.StrField
Aug 3, 2009 7:46:10 PM org.apache.solr.util.plugin.AbstractPluginLoader load
INFO: created sdouble: org.apache.solr.schema.SortableDoubleField
Aug 3, 2009 7:46:10 PM org.apache.solr.util.plugin.AbstractPluginLoader load
INFO: created slong: org.apache.solr.schema.SortableLongField
Aug 3, 2009 7:46:10 PM org.apache.solr.schema.IndexSchema readSchema
INFO: default search field is text
Aug 3, 2009 7:46:10 PM org.apache.solr.schema.IndexSchema readSchema
INFO: query parser default operator is AND
Aug 3, 2009 7:46:10 PM org.apache.solr.schema.IndexSchema readSchema
INFO: unique key field: id
Aug 3, 2009 7:46:10 PM org.apache.solr.core.JmxMonitoredMap 
INFO: No JMX servers found, not exposing Solr information with JMX.
Aug 3, 2009 7:46:10 PM org.apache.solr.core.SolrCore parseListener
INFO: Searching for listeners: //listen...@event="firstSearcher"]
Aug 3, 2009 7:46:10 PM org.apache.solr.core.SolrCore parseListener
INFO: Added SolrEventListener:
org.apache.solr.core.QuerySenderListener{queries=[{q=fast_warm,start=0,rows=10},
{q=static firstSearcher warming query from solrconfig.xml}]}
Aug 3, 2009 7:46:10 PM org.apache.solr.core.SolrCore parseList

Highlighting question

2009-08-03 Thread Stephen Green

Hi, folks.  I'm trying to get a very simple example working with Solr
highlighting.  I have a default search field (called, unsurprisingly
"default-search-field") with text in it and I want query terms to be
highlighted in that field when I do a search.

I'm using an up to date (as of this evening) checkout of 1.4.  My
solrconfig.xml contains the following highlighting element:


   
   


200



   


  
170
  
0.5
  
[-\w ,/\n\"']{20,200}



   










I'm using SolrJ to talk to the Solr server.  Here's the code to do a
query, where qs is the query string.

SolrQuery q = new SolrQuery(qs);
q.setQueryType("dismax");
q.setHighlight(true);
q.setHighlightFragsize(250);
q.set("hl.formatter", "html");
q.set("hl.fragmenter", "regex");
q.setFields("default-search-field", "key");
QueryResponse resp = solr.query(q);

I've set up the dismax handler in solrconfig.xml to search the
default-search-field.

The Solr server logs the following request for this:

INFO: [] webapp=/solr path=/select
params={hl.fragsize=250&fl=default-search-field,key&hl.fragmenter=regex&q=garbage&hl.formatter=html&qt=dismax&wt=javabin&hl=true&version=1}
hits=953 status=0 QTime=39

which looks about right to me, but I don't see any highlighting in the results.

I'm clearly missing something pretty fundamental here, and any help
would be appreciated.

Steve Green

mergeFactor interaction in master/slaves

2009-08-03 Thread KaktuChakarabati


Hey all,
I was wondering - when running Solr in a master/slaves setup using the 1.3
snap* scripts,
does the slaves' solrconfig.xml mergeFactor value make any difference? as
far as I would assume,
the mergeFactor specified in the master solrconfig.xml dictates the format
of the index and then the slaves simply pull that and run from. is that
correct?

Thanks,
-Chak
-- 
View this message in context: 
http://www.nabble.com/mergeFactor-interaction-in-master-slaves-tp24800334p24800334.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: mergeFactor interaction in master/slaves

2009-08-03 Thread Mark Miller


KaktuChakarabati wrote:

Hey all,
I was wondering - when running Solr in a master/slaves setup using the 1.3
snap* scripts,
does the slaves' solrconfig.xml mergeFactor value make any difference? as
far as I would assume,
the mergeFactor specified in the master solrconfig.xml dictates the format
of the index and then the slaves simply pull that and run from. is that
correct?

Thanks,
-Chak
  
Right - the merge factor would only have an effect if you added docs to 
the slaves - otherwise you just get a copy of the master index.


--
- Mark

http://www.lucidimagination.com

Re: Using Luke to get terms for docs matching a specific query filter?

2009-08-03 Thread Mark Bennett

Yonik, can you confirm reasoning below for 1.4 for a text field?

( Of course faceting is so much faster in 1.4 anyway, it's probably worth
the upgrade.
 https://issues.apache.org/jira/browse/SOLR-475  )

A warning for folks NOT using 1.4:

At the bottom of this wiki page: (very bottom)
http://wiki.apache.org/solr/SimpleFacetParameters
It says:
Warming
facet.field queries using the term enumeration method can avoid the
evaluation of some terms for greater efficiency. To force the evaluation of
all terms for warming, the base query should match a single document.

I think this is OK in the newer version, because as of 1.4 the default is
"fc", not "enum".  But prior to 1.4 there was no fc!

Wiki info on the default (enum vs. fc)
http://wiki.apache.org/solr/SimpleFacetParameters

facet.method
This parameter indicates what type of algorithm/method to use when
faceting a field.

enum
Enumerates all terms in a field, calculating the set intersection of
documents that match the term with documents that match the query. This was
the default (and only) method for faceting multi-valued fields prior to Solr
1.4.

fc (stands for field cache)
The facet counts are calculated by iterating over documents that match
the query and summing the terms that appear in each document. This was the
default method for single valued fields prior to Solr 1.4.

The default value is fc (except for BoolField) since it tends to use less
memory and is faster when a field has many unique terms in the index.

--
Mark Bennett / New Idea Engineering, Inc. / mbenn...@ideaeng.com
Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513

On Mon, Aug 3, 2009 at 2:49 PM, Yonik Seeley wrote:

> Sounds like faceting?
> q=state:CA&facet=true&facet.field=title&facet.limit=1000
>
> -Yonik
> http://www.lucidimagination.com
>
>
> On Mon, Aug 3, 2009 at 5:39 PM, Mark Bennett wrote:
> > You can get a nice list of terms for a field using the Luke handler:
> >http://localhost:8983/solr/admin/luke?fl=title&numTerms=1000
> >
> > But what I'd really like is to get the terms for the docs that match a
> > particular slice of the index.
> >
> > For example, let's say I have records for all 50 states, but I want to
> get
> > the top 1,000 terms for documents in California.
> >
> > I'd like to add q or fq like this:
> >
> http://localhost:8983/solr/admin/luke?fl=title&numTerms=1000&q=state:CA
> >OR
> >
> http://localhost:8983/solr/admin/luke?fl=title&numTerms=1000&fq=state:CA
> >
> > Although I don't get any errors, this syntax doesn't seem to filter the
> > terms.  Not a bug, nobody ever said it would.
> >
> > But has anybody written a utility to get term instances for a subset of
> the
> > index, based on a query?  And to be clear, I was hoping to get all of the
> > terms in matching documents, not just terms that are also present in the
> > query.
> >
> > Thanks,
> > Mark
> >
> > --
> > Mark Bennett / New Idea Engineering, Inc. / mbenn...@ideaeng.com
> > Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513
> >
>

Re: Highlighting question

2009-08-03 Thread Erik Hatcher


Is default-search-field stored (as specified in schema.xml)?

Erik


On Aug 3, 2009, at 8:05 PM, Stephen Green wrote:


Hi, folks.  I'm trying to get a very simple example working with Solr
highlighting.  I have a default search field (called, unsurprisingly
"default-search-field") with text in it and I want query terms to be
highlighted in that field when I do a search.

I'm using an up to date (as of this evening) checkout of 1.4.  My
solrconfig.xml contains the following highlighting element:

   
  
  

   
   
   200
   
   

  
   
   
 
   170
 
   0.5
 
   [-\w ,/\n\"']{20,200}str>

   
   

  
   
   
   
   
   
   
   
   
   

I'm using SolrJ to talk to the Solr server.  Here's the code to do a
query, where qs is the query string.

   SolrQuery q = new SolrQuery(qs);
   q.setQueryType("dismax");
   q.setHighlight(true);
   q.setHighlightFragsize(250);
   q.set("hl.formatter", "html");
   q.set("hl.fragmenter", "regex");
   q.setFields("default-search-field", "key");
   QueryResponse resp = solr.query(q);

I've set up the dismax handler in solrconfig.xml to search the
default-search-field.

The Solr server logs the following request for this:

INFO: [] webapp=/solr path=/select
params={hl.fragsize=250&fl=default-search- 
field 
,key 
&hl 
.fragmenter 
= 
regex 
&q=garbage&hl.formatter=html&qt=dismax&wt=javabin&hl=true&version=1}

hits=953 status=0 QTime=39

which looks about right to me, but I don't see any highlighting in  
the results.


I'm clearly missing something pretty fundamental here, and any help
would be appreciated.

Steve Green

Re: Highlighting question

2009-08-03 Thread Stephen Green

On Mon, Aug 3, 2009 at 8:34 PM, Erik Hatcher wrote:
> Is default-search-field stored (as specified in schema.xml)?

Yep:



While trying to figure this out, I went and did ant run-examples to
bring up the example in Jetty (I'm using Tomcat), and tried a couple
of queries in the resulting /solr/admin, and they don't appear to be
highlighted either.

Steve Green

Re: Highlighting question

2009-08-03 Thread Stephen Green

On Mon, Aug 3, 2009 at 8:38 PM, Stephen Green wrote:
> On Mon, Aug 3, 2009 at 8:34 PM, Erik Hatcher 
> wrote:
>> Is default-search-field stored (as specified in schema.xml)?
>
> Yep:
>
>            type="html" indexed="true" stored="true"
>        termVectors="true" multiValued="true"/>
>
> While trying to figure this out, I went and did ant run-examples to
> bring up the example in Jetty (I'm using Tomcat), and tried a couple
> of queries in the resulting /solr/admin, and they don't appear to be
> highlighted either.

Actually, if I check the highlighting box in the "full interface"
query option in the Solr admin panel, I notice that an element like:





Is added to the end of the results that are returned.

Oh, and thanks for the fast response, Erik :-)

Steve Green

Re: Highlighting question

2009-08-03 Thread Stephen Green

On Mon, Aug 3, 2009 at 8:45 PM, Stephen Green wrote:
> On Mon, Aug 3, 2009 at 8:38 PM, Stephen Green wrote:
>> On Mon, Aug 3, 2009 at 8:34 PM, Erik Hatcher 
>> wrote:
>>> Is default-search-field stored (as specified in schema.xml)?
>>
>> Yep:
>>
>>    >        type="html" indexed="true" stored="true"
>>        termVectors="true" multiValued="true"/>
>>
>> While trying to figure this out, I went and did ant run-examples to
>> bring up the example in Jetty (I'm using Tomcat), and tried a couple
>> of queries in the resulting /solr/admin, and they don't appear to be
>> highlighted either.
>
> Actually, if I check the highlighting box in the "full interface"
> query option in the Solr admin panel, I notice that an element like:
>
> 
> 
> 
>
> Is added to the end of the results that are returned.
>
> Oh, and thanks for the fast response, Erik :-)

OK, I think I might just be dumb.  The query response has a set of
highlighted things with references to the docs that were highlighted.
There's enough information there to create the highlighted
representation that I want.

Duh.

Steve Green

Re: Using Luke to get terms for docs matching a specific query filter?

2009-08-03 Thread Yonik Seeley

On Mon, Aug 3, 2009 at 8:26 PM, Mark Bennett wrote:
> Yonik, can you confirm reasoning below for 1.4 for a text field?

The bit about warming?  Looks right to me - a big base docset can
trigger short-circuit logic in the enum faceting code... using a
docset of size 1 currently avoids this.

-Yonik
http://www.lucidimagination.com


> ( Of course faceting is so much faster in 1.4 anyway, it's probably worth
> the upgrade.
>     https://issues.apache.org/jira/browse/SOLR-475  )
>
> A warning for folks NOT using 1.4:
>
> At the bottom of this wiki page: (very bottom)
>    http://wiki.apache.org/solr/SimpleFacetParameters
> It says:
>    Warming
>    facet.field queries using the term enumeration method can avoid the
> evaluation of some terms for greater efficiency. To force the evaluation of
> all terms for warming, the base query should match a single document.
>
> I think this is OK in the newer version, because as of 1.4 the default is
> "fc", not "enum".  But prior to 1.4 there was no fc!
>
> Wiki info on the default (enum vs. fc)
>    http://wiki.apache.org/solr/SimpleFacetParameters
>
> facet.method
>    This parameter indicates what type of algorithm/method to use when
> faceting a field.
>
> enum
>    Enumerates all terms in a field, calculating the set intersection of
> documents that match the term with documents that match the query. This was
> the default (and only) method for faceting multi-valued fields prior to Solr
> 1.4.
>
> fc (stands for field cache)
>    The facet counts are calculated by iterating over documents that match
> the query and summing the terms that appear in each document. This was the
> default method for single valued fields prior to Solr 1.4.
>
> The default value is fc (except for BoolField) since it tends to use less
> memory and is faster when a field has many unique terms in the index.
>
>
> --
> Mark Bennett / New Idea Engineering, Inc. / mbenn...@ideaeng.com
> Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513
>
>
> On Mon, Aug 3, 2009 at 2:49 PM, Yonik Seeley 
> wrote:
>
>> Sounds like faceting?
>> q=state:CA&facet=true&facet.field=title&facet.limit=1000
>>
>> -Yonik
>> http://www.lucidimagination.com
>>
>>
>> On Mon, Aug 3, 2009 at 5:39 PM, Mark Bennett wrote:
>> > You can get a nice list of terms for a field using the Luke handler:
>> >    http://localhost:8983/solr/admin/luke?fl=title&numTerms=1000
>> >
>> > But what I'd really like is to get the terms for the docs that match a
>> > particular slice of the index.
>> >
>> > For example, let's say I have records for all 50 states, but I want to
>> get
>> > the top 1,000 terms for documents in California.
>> >
>> > I'd like to add q or fq like this:
>> >
>> http://localhost:8983/solr/admin/luke?fl=title&numTerms=1000&q=state:CA
>> >        OR
>> >
>> http://localhost:8983/solr/admin/luke?fl=title&numTerms=1000&fq=state:CA
>> >
>> > Although I don't get any errors, this syntax doesn't seem to filter the
>> > terms.  Not a bug, nobody ever said it would.
>> >
>> > But has anybody written a utility to get term instances for a subset of
>> the
>> > index, based on a query?  And to be clear, I was hoping to get all of the
>> > terms in matching documents, not just terms that are also present in the
>> > query.
>> >
>> > Thanks,
>> > Mark
>> >
>> > --
>> > Mark Bennett / New Idea Engineering, Inc. / mbenn...@ideaeng.com
>> > Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513
>> >
>>
>

Re: change sort order for MoreLikeThis

2009-08-03 Thread Bill Au

I checked the javadoc of Lucene's MoreLikeThis and MoreLikeThisQuery and
don't see any method regarding sorting at all.  So my guess is we can't sort
the MoreLIkeThis result in Solr until this is supported in Lucene.
Bill

On Mon, Aug 3, 2009 at 5:54 PM, Bill Au  wrote:

> Avlesh,
>  You lost me.  How does boosting change the sort order?  What about
> sorting on a field that is not the mlt field?
>
> Bill
>
>
> On Mon, Aug 3, 2009 at 3:13 AM, Avlesh Singh  wrote:
>
>> You can boost the similarity field matches, if you want. Look for
>> mlt.boost
>> at http://wiki.apache.org/solr/MoreLikeThis
>>
>> Cheers
>> Avlesh
>>
>> On Mon, Aug 3, 2009 at 11:33 AM, Renz Daluz  wrote:
>>
>> > Hi,
>> >
>> > I'm looking at changing the result order when searching by MLT. I tried
>> the
>> > sort=, but it's not working. I check the wiki and can't
>> find
>> > anything. Is there a way to do this?
>> >
>> > Thanks,
>> > /Laurence
>> >
>>
>
>

Re: 99.9% uptime requirement

2009-08-03 Thread Otis Gospodnetic

Yes, daemontools or any kind of home-grown process-watching-and-restarting tool 
will work.
Regarding those caches - they look too large.
Also, the ramBufferSizeMB is irrelevant on search slaves.

 Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



- Original Message 
> From: Robert Petersen 
> To: solr-user@lucene.apache.org
> Sent: Monday, August 3, 2009 1:46:21 PM
> Subject: RE: 99.9% uptime requirement
> 
> So then would the 'right' thing to do be to run it under something like
> Daemontools so it bounces back up on a crash?  Do any other people use
> this approach or is there something better to make it come back up?
> 
> Speaking of overly large caches, if I have solr running on a machine
> with 8GB main memory is it going to hurt to make some huge cache sizes?
> Are these settings reasonable?  With a small index I have been getting
> some great hit-rates.
> 1024
> 
> 
> initialSize="512"  autowarmCount="80"/>
> 
> initialSize="512"  autowarmCount="80"/>
> 
> initialSize="512"  autowarmCount="0"/>
> 
> Thanks
> Robi
> 
> -Original Message-
> From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] 
> Sent: Friday, July 31, 2009 11:37 PM
> To: solr-user@lucene.apache.org
> Subject: Re: 99.9% uptime requirement
> 
> Robi,
> 
> Solr is indeed very stable.  However, it can crash and I've seen it
> crash.  Or rather, I should say I've seen the JVM that runs Solr crash.
> For instance, if you have a servlet container with a number of webapps,
> one of which is Solr, and one of which has a memory leak, I believe all
> webapps will suffer and "crash".  And even if you have just Solr in your
> servlet container, it can OOM, say if you specify overly large caches or
> too frequent commits, etc.
> 
> Otis
> --
> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
> 
> 
> 
> - Original Message 
> > From: Robert Petersen 
> > To: solr-user@lucene.apache.org
> > Sent: Friday, July 31, 2009 12:18:55 PM
> > Subject: 99.9% uptime requirement
> > 
> > Hi all,
> > 
> > My solr project powers almost all the pages in our site and so needs
> to
> > be up period.  My question is what can I do to ensure that happens?
> > Does solr ever crash, assuming reasonable load conditions and no
> extreme
> > index sizes?
> > 
> > I saw some comments about running solr under daemontools in order to
> get
> > an auto-restart on crashes.  From what I have seen so far in my
> limited
> > experience, solr is very stable and never crashes (so far).  Does
> anyone
> > else have this requirement and if so how do they deal with it?  Is
> > anyone else running solr under daemontools in a production site?
> > 
> > Thanks for any input you might have,
> > Robi

Re: ClassCastException from custom request handler

2009-08-03 Thread Avlesh Singh

>
> I'm sure I have the class name right - changing it to something patently
> incorrect results in the expected "org.apache.solr.common.SolrException:
> Error loading class ...", rather thanthe ClassCastException.
>
You are right about that, James.

Which Solr version are you using?
Can you please paste the relevant pieces in your solrconfig.xml and the
request handler class you have created?

Cheers
Avlesh

On Mon, Aug 3, 2009 at 10:51 PM, James Brady wrote:

> Hi,
> Thanks for your suggestions!
>
> I'm sure I have the class name right - changing it to something patently
> incorrect results in the expected
> "org.apache.solr.common.SolrException: Error loading class ...", rather
> than
> the ClassCastException.
>
> I did have some problems getting my class on the app server's classpath.
> I'm
> running with solr.home set to "multicore", but creating a multicore/lib
> directory and putting my request handler class in there resulted in "Error
> loading class" errors.
>
> I found that setting jetty.class.path to include multicore/lib (and also
> explicitly point at Solr's core and common JARs) fixed the "Error loading
> class" errors, leaving these ClassCastExceptions...
>
> 2009/8/3 Avlesh Singh 
>
> > Can you cross check the class attribute for your handler in
> solrconfig.xml?
> > My guess is that it is specified as "solr.LiveCoresHandler". It should be
> > fully qualified class name - com.foo.path.to.LiveCoresHandler instead.
> >
> > Moreover, I am damn sure that you did not forget to drop your jar into
> > solr.home/lib. Checking once again might not be a bad idea :)
> >
> > Cheers
> > Avlesh
> >
> > On Mon, Aug 3, 2009 at 9:11 PM, James Brady  > >wrote:
> >
> > > Hi,
> > > I'm creating a custom request handler to return a list of live cores in
> > > Solr.
> > >
> > > On startup, I get this exception for each core:
> > >
> > > Jul 31, 2009 5:20:39 PM org.apache.solr.common. SolrException log
> > > SEVERE: java.lang.ClassCastException: LiveCoresHandler
> > >at
> > > org.apache.solr.core.RequestHandlers$1.create(RequestHandlers.java:152)
> > >at
> > > org.apache.solr.core.RequestHandlers$1.create(RequestHandlers.java:161)
> > >at
> > >
> > >
> >
> org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:140)
> > >at
> > >
> > >
> >
> org.apache.solr.core.RequestHandlers.initHandlersFromConfig(RequestHandlers.java:169)
> > >at org.apache.solr.core.SolrCore.(SolrCore.java:444)
> > >
> > > I've tried a few variations on the class definition, including
> extending
> > > RequestHandlerBase (as suggested here:
> > >
> > >
> >
> http://wiki.apache.org/solr/SolrRequestHandler#head-1de7365d7ecf2eac079c5f8b92ee9af712ed75c2
> > > )
> > > and implementing SolrRequestHandler directly.
> > >
> > > I'm sure that the Solr libraries I built against and those I'm running
> on
> > > are the same version too, as I unzipped the Solr war file and copies
> the
> > > relevant jars out of there to build against.
> > >
> > > Any ideas on what could be causing the ClassCastException? I've
> attached
> > a
> > > debugger to the running Solr process but it didn't shed any light on
> the
> > > issue...
> > >
> > > Thanks!
> > > James
> > >
> >
>
>
>
> --
> http://twitter.com/goodgravy
> 512 300 4210
> http://webmynd.com/
> Sent from Bury, United Kingdom
>

Re: DIH: Any way to make update on db table?

2009-08-03 Thread Avlesh Singh

Couple of things -

   1. Your dataSource is probably in readOnly mode. It is possible to fire
   updates, by specifying readOnly="false" in your dataSource.
   2. What you are trying achieve, is typically done using a "select for
   update". For MySql, here's the documentation -
   http://dev.mysql.com/doc/refman/5.0/en/innodb-locking-reads.html
   3. You don't need to create a separate "entity" for firing updates.
   Writing a database procedure might be a good idea. In that case your query
   will simply be  .
   All the heavy lifting can be done by this query.

Moreover, update queries, only return the number of rows affected and not a
resultSet. DIH expects one and hence the exception.

Cheers
Avlesh

On Tue, Aug 4, 2009 at 1:49 AM, Jay Hill  wrote:

> Is it possible for the DataImportHandler to update records in the table it
> is querying? For example, say I have a query like this in my entity:
>
> query="select field1, field2, from someTable where hasBeenIndexed=false"
>
> Is there a way I can mark each record processed by updating the
> hasBeenIndexed field? Here's a config I tried:
>
> 
> 
>   type="JdbcDataSource"
>   driver="com.mysql.jdbc.Driver"
>   url="jdbc:mysql://localhost:3306/solrhacks"
>   user="user"
>   password="pass"/>
>
>  
>pk="id"
>query="select id, name from tableToIndex where
> hasBeenIndexed=0">
>  
>  
>
>pk="id"
>  query="update tableToIndex set hasBeenIndexed=1 where
> id=${mainEntity.id}">
>  
>
>  
> 
>
> It does update the first record, but then an Exception is thrown:
> Aug 3, 2009 1:15:24 PM org.apache.solr.handler.dataimport.DocBuilder
> buildDocument
> SEVERE: Exception while processing: mainEntity document :
> SolrInputDocument[{id=id(1.0)={1}, name=name(1.0)={John Jones}}]
> org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to
> execute query: update tableToIndex set hasBeenIndexed=1 where id=1
> Processing Document # 1
>at
>
> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.(JdbcDataSource.java:250)
>at
>
> org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:207)
>at
>
> org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:40)
>at
>
> org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:58)
>at
>
> org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:71)
>at
>
> org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:237)
>at
>
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:344)
>at
>
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:370)
>at
>
> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:225)
>at
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:167)
>at
>
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:333)
>at
>
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:393)
>at
>
> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:372)
> Caused by: java.lang.NullPointerException
>at
>
> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.(JdbcDataSource.java:248)
>... 12 more
>
>
> -Jay
>

Re: Using Luke to get terms for docs matching a specific query filter?

2009-08-03 Thread Mark Bennett

Sow just make sure to use rows=1 ?

--
Mark Bennett / New Idea Engineering, Inc. / mbenn...@ideaeng.com
Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513


On Mon, Aug 3, 2009 at 5:51 PM, Yonik Seeley wrote:

> On Mon, Aug 3, 2009 at 8:26 PM, Mark Bennett wrote:
> > Yonik, can you confirm reasoning below for 1.4 for a text field?
>
> The bit about warming?  Looks right to me - a big base docset can
> trigger short-circuit logic in the enum faceting code... using a
> docset of size 1 currently avoids this.
>
> -Yonik
> http://www.lucidimagination.com
>
>
> > ( Of course faceting is so much faster in 1.4 anyway, it's probably worth
> > the upgrade.
> > https://issues.apache.org/jira/browse/SOLR-475  )
> >
> > A warning for folks NOT using 1.4:
> >
> > At the bottom of this wiki page: (very bottom)
> >http://wiki.apache.org/solr/SimpleFacetParameters
> > It says:
> >Warming
> >facet.field queries using the term enumeration method can avoid the
> > evaluation of some terms for greater efficiency. To force the evaluation
> of
> > all terms for warming, the base query should match a single document.
> >
> > I think this is OK in the newer version, because as of 1.4 the default is
> > "fc", not "enum".  But prior to 1.4 there was no fc!
> >
> > Wiki info on the default (enum vs. fc)
> >http://wiki.apache.org/solr/SimpleFacetParameters
> >
> > facet.method
> >This parameter indicates what type of algorithm/method to use when
> > faceting a field.
> >
> > enum
> >Enumerates all terms in a field, calculating the set intersection of
> > documents that match the term with documents that match the query. This
> was
> > the default (and only) method for faceting multi-valued fields prior to
> Solr
> > 1.4.
> >
> > fc (stands for field cache)
> >The facet counts are calculated by iterating over documents that match
> > the query and summing the terms that appear in each document. This was
> the
> > default method for single valued fields prior to Solr 1.4.
> >
> > The default value is fc (except for BoolField) since it tends to use less
> > memory and is faster when a field has many unique terms in the index.
> >
> >
> > --
> > Mark Bennett / New Idea Engineering, Inc. / mbenn...@ideaeng.com
> > Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513
> >
> >
> > On Mon, Aug 3, 2009 at 2:49 PM, Yonik Seeley  >wrote:
> >
> >> Sounds like faceting?
> >> q=state:CA&facet=true&facet.field=title&facet.limit=1000
> >>
> >> -Yonik
> >> http://www.lucidimagination.com
> >>
> >>
> >> On Mon, Aug 3, 2009 at 5:39 PM, Mark Bennett
> wrote:
> >> > You can get a nice list of terms for a field using the Luke handler:
> >> >http://localhost:8983/solr/admin/luke?fl=title&numTerms=1000
> >> >
> >> > But what I'd really like is to get the terms for the docs that match a
> >> > particular slice of the index.
> >> >
> >> > For example, let's say I have records for all 50 states, but I want to
> >> get
> >> > the top 1,000 terms for documents in California.
> >> >
> >> > I'd like to add q or fq like this:
> >> >
> >> http://localhost:8983/solr/admin/luke?fl=title&numTerms=1000&q=state:CA
> >> >OR
> >> >
> >>
> http://localhost:8983/solr/admin/luke?fl=title&numTerms=1000&fq=state:CA
> >> >
> >> > Although I don't get any errors, this syntax doesn't seem to filter
> the
> >> > terms.  Not a bug, nobody ever said it would.
> >> >
> >> > But has anybody written a utility to get term instances for a subset
> of
> >> the
> >> > index, based on a query?  And to be clear, I was hoping to get all of
> the
> >> > terms in matching documents, not just terms that are also present in
> the
> >> > query.
> >> >
> >> > Thanks,
> >> > Mark
> >> >
> >> > --
> >> > Mark Bennett / New Idea Engineering, Inc. / mbenn...@ideaeng.com
> >> > Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513
> >> >
> >>
> >
>

Re: change sort order for MoreLikeThis

2009-08-03 Thread Avlesh Singh

>
> You lost me.
>
Absolutely sorry about that Bill :(

How does boosting change the sort order?

What I really meant here is that if you have more than one "similarity"
fields in you MLT query, you can boost the results found due to one over the
other. It was not at all aimed to be an answer for sort. Actually, I was too
prompt to respond!

What about sorting on a field that is not the mlt field?
>
Haven't tried this yet. It would be surprising if it does not work as
expected.

Cheers
Avlesh

On Tue, Aug 4, 2009 at 3:24 AM, Bill Au  wrote:

> Avlesh,
> You lost me.  How does boosting change the sort order?  What about
> sorting on a field that is not the mlt field?
>
> Bill
>
> On Mon, Aug 3, 2009 at 3:13 AM, Avlesh Singh  wrote:
>
> > You can boost the similarity field matches, if you want. Look for
> mlt.boost
> > at http://wiki.apache.org/solr/MoreLikeThis
> >
> > Cheers
> > Avlesh
> >
> > On Mon, Aug 3, 2009 at 11:33 AM, Renz Daluz 
> wrote:
> >
> > > Hi,
> > >
> > > I'm looking at changing the result order when searching by MLT. I tried
> > the
> > > sort=, but it's not working. I check the wiki and can't
> > find
> > > anything. Is there a way to do this?
> > >
> > > Thanks,
> > > /Laurence
> > >
> >
>

Re: DIH: Any way to make update on db table?

2009-08-03 Thread Noble Paul നോബിള്‍ नोब्ळ्

If your are writing a Transformer (or any other component) you can get
hold of a dataSource instance .

 datasource =Context#getDataSource(name).
//then you can invoke
datasource.getData("update mytable ");
//though the name is getData() it can execute update commands also

ensure that you do a
datasource.close();
after you are done

On Tue, Aug 4, 2009 at 9:40 AM, Avlesh Singh wrote:
> Couple of things -
>
>   1. Your dataSource is probably in readOnly mode. It is possible to fire
>   updates, by specifying readOnly="false" in your dataSource.
>   2. What you are trying achieve, is typically done using a "select for
>   update". For MySql, here's the documentation -
>   http://dev.mysql.com/doc/refman/5.0/en/innodb-locking-reads.html
>   3. You don't need to create a separate "entity" for firing updates.
>   Writing a database procedure might be a good idea. In that case your query
>   will simply be  .
>   All the heavy lifting can be done by this query.
>
> Moreover, update queries, only return the number of rows affected and not a
> resultSet. DIH expects one and hence the exception.
>
> Cheers
> Avlesh
>
> On Tue, Aug 4, 2009 at 1:49 AM, Jay Hill  wrote:
>
>> Is it possible for the DataImportHandler to update records in the table it
>> is querying? For example, say I have a query like this in my entity:
>>
>> query="select field1, field2, from someTable where hasBeenIndexed=false"
>>
>> Is there a way I can mark each record processed by updating the
>> hasBeenIndexed field? Here's a config I tried:
>>
>> 
>> 
>>    >       type="JdbcDataSource"
>>       driver="com.mysql.jdbc.Driver"
>>       url="jdbc:mysql://localhost:3306/solrhacks"
>>       user="user"
>>       password="pass"/>
>>
>>  
>>    >            pk="id"
>>            query="select id, name from tableToIndex where
>> hasBeenIndexed=0">
>>      
>>      
>>
>>      >              pk="id"
>>              query="update tableToIndex set hasBeenIndexed=1 where
>> id=${mainEntity.id}">
>>      
>>    
>>  
>> 
>>
>> It does update the first record, but then an Exception is thrown:
>> Aug 3, 2009 1:15:24 PM org.apache.solr.handler.dataimport.DocBuilder
>> buildDocument
>> SEVERE: Exception while processing: mainEntity document :
>> SolrInputDocument[{id=id(1.0)={1}, name=name(1.0)={John Jones}}]
>> org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to
>> execute query: update tableToIndex set hasBeenIndexed=1 where id=1
>> Processing Document # 1
>>    at
>>
>> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.(JdbcDataSource.java:250)
>>    at
>>
>> org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:207)
>>    at
>>
>> org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:40)
>>    at
>>
>> org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:58)
>>    at
>>
>> org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:71)
>>    at
>>
>> org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:237)
>>    at
>>
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:344)
>>    at
>>
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:370)
>>    at
>>
>> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:225)
>>    at
>> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:167)
>>    at
>>
>> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:333)
>>    at
>>
>> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:393)
>>    at
>>
>> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:372)
>> Caused by: java.lang.NullPointerException
>>    at
>>
>> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.(JdbcDataSource.java:248)
>>    ... 12 more
>>
>>
>> -Jay
>>
>



-- 
-
Noble Paul | Principal Engineer| AOL | http://aol.com

Re: DIH: Any way to make update on db table?

2009-08-03 Thread Avlesh Singh

>
> datasource.getData("update mytable "); //though the name is getData()
> it can execute update commands also
>
Even when the dataSource is "readOnly", Noble?

Cheers
Avlesh

2009/8/4 Noble Paul നോബിള്‍ नोब्ळ् 

> If your are writing a Transformer (or any other component) you can get
> hold of a dataSource instance .
>
>  datasource =Context#getDataSource(name).
> //then you can invoke
> datasource.getData("update mytable ");
> //though the name is getData() it can execute update commands also
>
> ensure that you do a
> datasource.close();
> after you are done
>
> On Tue, Aug 4, 2009 at 9:40 AM, Avlesh Singh wrote:
> > Couple of things -
> >
> >   1. Your dataSource is probably in readOnly mode. It is possible to fire
> >   updates, by specifying readOnly="false" in your dataSource.
> >   2. What you are trying achieve, is typically done using a "select for
> >   update". For MySql, here's the documentation -
> >   http://dev.mysql.com/doc/refman/5.0/en/innodb-locking-reads.html
> >   3. You don't need to create a separate "entity" for firing updates.
> >   Writing a database procedure might be a good idea. In that case your
> query
> >   will simply be   .../>.
> >   All the heavy lifting can be done by this query.
> >
> > Moreover, update queries, only return the number of rows affected and not
> a
> > resultSet. DIH expects one and hence the exception.
> >
> > Cheers
> > Avlesh
> >
> > On Tue, Aug 4, 2009 at 1:49 AM, Jay Hill  wrote:
> >
> >> Is it possible for the DataImportHandler to update records in the table
> it
> >> is querying? For example, say I have a query like this in my entity:
> >>
> >> query="select field1, field2, from someTable where hasBeenIndexed=false"
> >>
> >> Is there a way I can mark each record processed by updating the
> >> hasBeenIndexed field? Here's a config I tried:
> >>
> >> 
> >> 
> >> >>   type="JdbcDataSource"
> >>   driver="com.mysql.jdbc.Driver"
> >>   url="jdbc:mysql://localhost:3306/solrhacks"
> >>   user="user"
> >>   password="pass"/>
> >>
> >>  
> >> >>pk="id"
> >>query="select id, name from tableToIndex where
> >> hasBeenIndexed=0">
> >>  
> >>  
> >>
> >>   >>  pk="id"
> >>  query="update tableToIndex set hasBeenIndexed=1 where
> >> id=${mainEntity.id}">
> >>  
> >>
> >>  
> >> 
> >>
> >> It does update the first record, but then an Exception is thrown:
> >> Aug 3, 2009 1:15:24 PM org.apache.solr.handler.dataimport.DocBuilder
> >> buildDocument
> >> SEVERE: Exception while processing: mainEntity document :
> >> SolrInputDocument[{id=id(1.0)={1}, name=name(1.0)={John Jones}}]
> >> org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to
> >> execute query: update tableToIndex set hasBeenIndexed=1 where id=1
> >> Processing Document # 1
> >>at
> >>
> >>
> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.(JdbcDataSource.java:250)
> >>at
> >>
> >>
> org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:207)
> >>at
> >>
> >>
> org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:40)
> >>at
> >>
> >>
> org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:58)
> >>at
> >>
> >>
> org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:71)
> >>at
> >>
> >>
> org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:237)
> >>at
> >>
> >>
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:344)
> >>at
> >>
> >>
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:370)
> >>at
> >>
> >>
> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:225)
> >>at
> >>
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:167)
> >>at
> >>
> >>
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:333)
> >>at
> >>
> >>
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:393)
> >>at
> >>
> >>
> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:372)
> >> Caused by: java.lang.NullPointerException
> >>at
> >>
> >>
> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.(JdbcDataSource.java:248)
> >>... 12 more
> >>
> >>
> >> -Jay
> >>
> >
>
>
>
> --
> -
> Noble Paul | Principal Engineer| AOL | http://aol.com
>

Re: Using Luke to get terms for docs matching a specific query filter?

2009-08-03 Thread Yonik Seeley

On Tue, Aug 4, 2009 at 12:16 AM, Mark Bennett wrote:
> Sow just make sure to use rows=1 ?

No, make sure that the query matches one document - rows (the number
of top docs returned) is irrelevant to faceting.
So q=id:some_doc

-Yonik
http://www.lucidimagination.com

Re: Picking Facet Fields by Frequency-in-Results

2009-08-03 Thread Avlesh Singh

I understand the general need here. And just extending what you suggested
(indexing the fields themselves inside a multiValued field), you can perform
a query like this -
/search?q=myquery&facet=true&facet.field=indexedfields&facet.field=field1&facet.field=field2...&facet.sort=true

You'll get facets for all the fields (passed as multiple facet.field
params), including the one that gives you field frequency. You can do all
sorts of post processing on this data to achieve the desired.

Hope this helps.

Cheers
Avlesh

On Tue, Aug 4, 2009 at 2:20 AM, Chris Harris  wrote:

> One task when designing a facet-based UI is deciding which fields to
> facet on and display facets for. One possibility that I hope to
> explore is to determine which fields to facet on dynamically, based on
> the search results. In particular, I hypothesize that, for a somewhat
> heterogeneous index (heterogeneous in terms of which fields a given
> record might contain), that the following rule might be helpful: Facet
> on a given field to the extent that it is frequently set in the
> documents matching the user's search.
>
> For example, let's say my results look like this:
>
> Doc A:
>  f1: foo
>  f2: bar
>  f3: 
>  f4: 
>
> Doc B:
>  f1: foo2
>  f2: 
>  f3: 
>  f4: 
>
> Doc C:
>  f1: foo3
>  f2: quiz
>  f3: 
>  f4: buzz
>
> Doc D:
>  f1: foo4
>  f2: question
>  f3: bam
>  f4: bing
>
> The field usage information for these documents could be summarized like
> this:
>
> field f1: Set in 4 docs
> field f2: Set in 3 doc
> field f3: Set 1 doc
> field f4: Set 2 doc
>
> If I were choosing facet fields based on the above rule, I would
> definitely want to display facets for field f1, since occurs in all
> documents.  If I had room for another facet in the UI, I would facet
> f2. If I wanted another one, I'd go with f4, since it's more popular
> than f3. I probably would ignore f3 in any case, because it's set for
> only one document.
>
> Has anyone implemented such a scheme with Solr? Any success? (The
> closest thing I can find is
> http://wiki.apache.org/solr/ComplexFacetingBrainstorming, which tries
> to pick which facets to display based not on frequency but based more
> on a ruleset.)
>
> As far as implementation, the most straightforward approach (which
> wouldn't involve modifying Solr) would apparently be to add a new
> multi-valued "fieldsindexed" field to each document, which would note
> which fields actually have a value for each document. So when I pass
> data to Solr at indexing time, it will look something like this
> (except of course it will be in valid Solr XML, rather than this
> schematic):
>
> Doc A:
>  f1: foo
>  f2: bar
>  indexedfields: f1, f2
>
> Doc B:
>  f1: foo2
>  indexedfields: f1
>
> Doc C:
>  f1: foo3
>  f2: quiz
>  f4: buzz
>  indexedfields: f1, f2, f4
>
> Doc D:
>  f1: foo4
>  f2: question
>  f3: bam
>  f4: bing
>  indexedfields: f1, f2, f3, f4
>
> Then to chose which facets to display, I call
>
>
> http://myserver/solr/search?q=myquery&facet=true&facet.field=indexedfields&facet.sort=true
>
> and use the frequency information from this query to determine which
> fields to display in the faceting UI. (To get the actual facet
> information for those fields, I would query Solr a second time.)
>
> Are there any alternatives that would be easier or more efficient?
>
> Thanks,
> Chris
>

Re: ClassCastException from custom request handler

2009-08-03 Thread Noble Paul നോബിള്‍ नोब्ळ्

I guess a requesthandler may not be the right component to implement.
try implementing a CoreAdminHandler .

http://wiki.apache.org/solr/CoreAdmin#head-dfab3233db613c1976e05cdf894aa326ca64d312

On Mon, Aug 3, 2009 at 9:11 PM, James Brady wrote:
> Hi,
> I'm creating a custom request handler to return a list of live cores in
> Solr.
>
> On startup, I get this exception for each core:
>
> Jul 31, 2009 5:20:39 PM org.apache.solr.common. SolrException log
> SEVERE: java.lang.ClassCastException: LiveCoresHandler
>        at
> org.apache.solr.core.RequestHandlers$1.create(RequestHandlers.java:152)
>        at
> org.apache.solr.core.RequestHandlers$1.create(RequestHandlers.java:161)
>        at
> org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:140)
>        at
> org.apache.solr.core.RequestHandlers.initHandlersFromConfig(RequestHandlers.java:169)
>        at org.apache.solr.core.SolrCore.(SolrCore.java:444)
>
> I've tried a few variations on the class definition, including extending
> RequestHandlerBase (as suggested here:
> http://wiki.apache.org/solr/SolrRequestHandler#head-1de7365d7ecf2eac079c5f8b92ee9af712ed75c2)
> and implementing SolrRequestHandler directly.
>
> I'm sure that the Solr libraries I built against and those I'm running on
> are the same version too, as I unzipped the Solr war file and copies the
> relevant jars out of there to build against.
>
> Any ideas on what could be causing the ClassCastException? I've attached a
> debugger to the running Solr process but it didn't shed any light on the
> issue...
>
> Thanks!
> James
>



-- 
-
Noble Paul | Principal Engineer| AOL | http://aol.com

JVM Heap utilization & Memory leaks with Solr

2009-08-03 Thread Rahul R

I am trying to track memory utilization with my Application that uses Solr.
Details of the setup :
 -3rd party Software : Solaris 10, Weblogic 10, jdk_150_14, Solr 1.3.0
- Hardware : 12 CPU, 24 GB RAM

For testing during PSR I am using a smaller subset of the actual data that I
want to work with. Details of this smaller sub-set :
- 5 million records, 4.5 GB index size

Observations during PSR:
A) I have allocated 3.2 GB for the JVM(s) that I used. After all users
logout and doing a force GC, only 60 % of the heap is reclaimed. As part of
the logout process I am invalidating the HttpSession and doing a close() on
CoreContainer. From my application's side, I don't believe I am holding on
to any resource. I wanted to know if there are known issues surrounding
memory leaks with Solr ?
B) To further test this, I tried deploying with shards. 3.2 GB was allocated
to each JVM. All JVMs had 96 % free heap space after start up. I got varying
results with this.
Case 1 : Used 6 weblogic domains. My application was deployed one 1 domain.
I split the 5 million index into 5 parts of 1 million each and used them as
shards. After multiple users used the system and doing a force GC, around 94
- 96 % of heap was reclaimed in all the JVMs.
Case 2: Used 2 weblogic domains. My application was deployed on 1 domain. On
the other, I deployed the entire 5 million part index as one shard. After
multiple users used the system and doing a gorce GC, around 76 % of the heap
was reclaimed in the shard JVM. And 96 % was reclaimed in the JVM where my
application was running. This result further convinces me that my
application can be absolved of holding on to memory resources.

I am not sure how to interpret these results ? For searching, I am using
Without Shards : EmbeddedSolrServer
With Shards :CommonsHttpSolrServer
In terms of Solr objects this is what differs in my code between normal
search and shards search (distributed search)

After looking at Case 1, I thought that the CommonsHttpSolrServer was more
memory efficient but Case 2 proved me wrong. Or could there still be memory
leaks in my application ? Any thoughts, suggestions would be welcome.

Regards
Rahul

Re: Rotating the primary shard in /solr/select

2009-08-03 Thread Rahul R

Philip,
I cannot answer your question, but I do have a question for you. Does
aggregation happen at the primary shard ? For eg : if I have three JVMs
JVM 1 : My application powered by Solr
JVM 2 : Shard 1
JVM 3 : Shard 2

I initialize my SolrServer like this
SolrServer _solrServer = *new* CommonsHttpSolrServer(shard1);

Does aggregation now happen at JVM 2 ? Is there any other reason for
initializing the SolrServer with one of the shard URLs ?

On Wed, Jul 29, 2009 at 2:57 AM, Phillip Farber  wrote:

>
> Is there any value in a round-robin scheme to cycle through the Solr
> instances supporting a multi-shard index over several machines when sending
> queries or is it better to just pick one instance and stick with it.  I'm
> assuming all machines in the cluster have the same hardware specs.
>
> So scenario A (round-robin):
>
> query 1: /solr-shard-1/select?q=dog... shards=shard-1,shard2
> query 2: /solr-shard-2/select?q=dog... shards=shard-1,shard2
> query 3: /solr-shard-1/select?q=dog... shards=shard-1,shard2
> etc.
>
> or or scenario B (fixed):
>
> query 1: /solr-shard-1/select?q=dog... shards=shard-1,shard2
> query 2: /solr-shard-1/select?q=dog... shards=shard-1,shard2
> query 3: /solr-shard-1/select?q=dog... shards=shard-1,shard2
> etc.
>
> Is there evidence that distributing the overhead of result merging over
> more machines (A) gives a performance boost?
>
> Thanks,
>
> Phil
>
>
>

Re: mergeFactor / indexing speed

2009-08-03 Thread Chantal Ackermann


Hi Avlesh,
hi Otis,
hi Grant,
hi all,


(enumerating to keep track of all the input)

a) mergeFactor 1000 too high
I'll change that back to 10. I thought it would make Lucene use more RAM 
before starting IO.


b) ramBufferSize:
OK, or maybe more. I'll keep that in mind.

c) solrconfig.xml - default and main index:
I've always changed both sections, the default and the main index one.

d) JDBC batch size:
I haven't set it. I'll do that.

e) DB server performance:
I agree, ping is definitely not much information. I also did queries 
from my own computer towards it (while the indexer ran) which came back 
as fast as usual.
Currently, I don't have any login to ssh to that machine, but I'm going 
to try get one.


f) Network:
I'll definitely need to have a look at that once I have access to the db 
machine.



g) the data

g.1) nested entity in DIH conf
there is only the root and one nested entity. However, that nested 
entity returns multiple rows (about 10) for one query. (Fetched rows is 
about 10 times the number of processed documents.)


g.2) my custom EntityProcessor
( The code is pasted at the very end of this e-mail. )
- iterates over those multiple rows,
- uses one column to create a key in a map,
- uses two other columns to create the corresponding value (String 
concatenation),
- if a key already exists, it gets the value, if that value is a list, 
it adds the new value to that list, if it's not a list, it creates one 
and adds the old and the new value to it.
I refrained from adding any business logic to that processor. It treats 
all rows alike, no matter whether they hold values that can appear 
multiple or values that must appear only once.


g.3) the two transformers
- to split one value into two (regex)


regex="[^\|]+\|\d+,\d+,\d+,(.*)"/>


- to create extract a number from an existing number (bit calculation 
using the script transformer). As that one works on a field that is 
potentially multiValued, it needs to take care of creating and 
populating a list, as well.


>8);
if (!arr.contains(mainCat)) {
arr.add(mainCat);
}
}
row.put('maincat', arr);
} else { // it is a single value
var mainCat = new java.lang.Integer(cat>>8);
row.put('maincat', mainCat);
}
}
return row;
}
]]>
(The EpgValueEntityProcessor decides on creating lists on a case by case 
basis: only if a value is specified multiple times for a certain data 
set does it create a list. This is because I didn't want to put any 
complex configuration or business logic into it.)


g.4) fields
the DIH extracts 5 fields from the root entity, 11 fields from the 
nested entity, and the transformers might create additional 3 (multiValued).
schema.xml defines 21 fields (two additional fields: the timestamp field 
(default="NOW") and a field collecting three other text fields for 
default search (using copy field)):

- 2 long
- 3 integer
- 3 sint
- 3 date
- 6 text_cs (class="solr.TextField" positionIncrementGap="100"):


generateWordParts="0" generateNumberParts="0" catenateWords="0" 
catenateNumbers="0" catenateAll="0" />


- 4 text_de (one is the field populated by copying from the 3 others):



words="stopwords_de.txt" />

generateNumberParts="1" catenateWords="1" catenateNumbers="1" 
catenateAll="0" splitOnCaseChange="1" />







Thank you for taking your time!
Cheers,
Chantal





** EpgValueEntityProcessor.java ***

import java.util.ArrayList;
import java.util.Collections;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.logging.Logger;

import org.apache.solr.handler.dataimport.Context;
import org.apache.solr.handler.dataimport.SqlEntityProcessor;

public class EpgValueEntityProcessor extends SqlEntityProcessor {
private static final Logger log = Logger
.getLogger(EpgValueEntityProcessor.class.getName());
	private static final String ATTR_ID_EPG_DEFINITION = 
"columnIdEpgDefinition";

private static final String ATTR_COLUMN_ATT_NAME = "columnAttName";
private static final String ATTR_COLUMN_EPG_VALUE = "columnEpgValue";
private static final String ATTR_COLUMN_EPG_SUBVALUE = 
"columnEpgSubvalue";
private static final String DEF_ATT_NAME = "ATT_NAME";
private static final String DEF_EPG_VALUE = "EPG_VALUE";
private static final String DEF

73 matches

Mail list logo