date:20060926

Re: [Lucene2.0]How to not highlight keywords in some fields?

2006-09-26 Thread markharw00d


Pass a field name to the QueryScorer constructor.
See testFieldSpecificHighlighting method in the Junit test for the 
highlighter for an example.


Cheers
Mark


zhu jiang wrote:

Hi all,

   For example, if I have a document with two fields text and num like
this:

text:foo bar 1
num:1


When users query foo, I changed the query to text:foo AND num:1, both
foo and 1 in the text field will be highlighted. I don't wanna the 
word

1 in text field to be highlighted. What should I do? Pls help me







___ 
The all-new Yahoo! Mail goes wherever you go - free your email address from your Internet provider. http://uk.docs.yahoo.com/nowyoucan.html



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re[3]: how to enhance speed of sorted search

2006-09-26 Thread Yura Smolsky

Hello, Chris.

CH 3) most likely, if you are seeing slow performance from sorted searches,
CH the time spent scoring the results isn't the biggest contributor to how
CH long thesearch takes -- it tends to be negligable for most queries.  A
CH better question is: are you reusing the exact same IndexReader /
CH IndexSearcher instance for every querey? ... if not, that right there is
CH going to be your biggest problem, because it will prevent you from being
CH able to reuse teh FieldCache needed when sorting results.

Sure I do reuse IndexSearcher :) and second query is always faster
than the first one...

I am thinking should be this faster

query = QueryParser(text, StandardAnalyzer()).parse(good boy)
IndexSearcher.search(
  new ConstantScoreQuery(new QueryFilter(query)),
  sortByIntField)

than usual search

IndexSearcher.search(
  query,
  sortByIntField)

Is there anyway I could use filter to get needed results from the
query?

--
Yura Smolsky,
http://altervisionmedia.com/



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

How to tell if IndexSearcher/IndexReader was closed?

2006-09-26 Thread Frank Kunemann

Hi all,

after I delete some entries from the index, I close the IndexSearcher to
ensure that the changes are done.
But after this I couldn't figure out a way to tell if the searcher is closed
or not.
Any ideas?


Regards
Frank


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: How to tell if IndexSearcher/IndexReader was closed?

2006-09-26 Thread Simon Willnauer

I guess there are many possibilities to implement some control
structure to track the references to your searcher / reader. As it is
best practice to have one single searcher open you can track the
reference to the searcher while one reference is hold by the class you
request your searcher from. If you close your searcher you decrement
the reference which is hold by the, I will call it controller, class
and create a new searcher. If no reference to the searcher is
remaining you close the searcher and it will get garbage collected.

If you use this kind of pattern you might have more than one searcher
open for a short time. but if the last search client has decremented
the reference the searcher will be closed. You don't have to care
whether the searcher is closed or not, you won't get a reference to a
closed searcher instance.

Solr and GData Server uses this kind of reference tracking for this purpose.

have a look at
http://svn.apache.org/viewvc/lucene/java/trunk/contrib/gdata-server/src/java/org/apache/lucene/gdata/utils/ReferenceCounter.java?view=markup

best regards Simon
On 9/26/06, Frank Kunemann [EMAIL PROTECTED] wrote:

Hi all,

after I delete some entries from the index, I close the IndexSearcher to
ensure that the changes are done.
But after this I couldn't figure out a way to tell if the searcher is closed
or not.
Any ideas?

Regards
Frank

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

How to remove duplicate records from result

2006-09-26 Thread Bhavin Pandya

Hi,

I searched the index and i found say 1000 records but out of that 1000 records 
i want to filter duplicate records based on value of one field.

is there any way except looping through whole Hit object ?
Because it wont work when number of hit is too large...

Thanks.
Bhavin pandya

spell checker with lucene

2006-09-26 Thread Bhavin Pandya

Hi,
Do anybody have idea for spell checker in java.
I want to use with lucene...but which must work well for phrases also...
-Bhavin pandya

Re: searching for the part of a term.

2006-09-26 Thread heritrix . lucene

Hi,
While i was searching forum for my problem for searching a substring, i got
few very good links.

http://www.gossamer-threads.com/lists/lucene/java-user/39753?search_string=Bitset%20filter;#39753
http://www.gossamer-threads.com/lists/lucene/java-user/7813?search_string=substring;#7813
http://www.gossamer-threads.com/lists/lucene/java-user/5931?search_string=substring;#5931

In first, WildcardTermEnum is used.

I tried with this but this takes a lot of time in searching.

The other solution i found was to create a tokenstream which splits a token
into multiple tokens, and then index those token. like : google into google,
oogle, ogle
And then while searching make a prefix query , then search.

But here it seems to create a lot of tokens from one token resulting index

size multiple times bigger then if we index a single token.

Since the overhead in first is the speed of the system, i think adopting
second method will be better.

Is there any other solution for this problem?? Am i going in right
direction??

It'll be great to see your response...

Regards,

On 9/23/06, heritrix. lucene [EMAIL PROTECTED] wrote:

Hi All,

How can i make my search so that if i am looking for the term counting
the documents containing accounting must also come up.

Similarly if i am looking for term workload, document s containing work
also come up as a search result.

Wildcard query seems to work in the first case, but if the index size is
very big, it throws TooManyClausesException.

Is there a way to resolve this issue, apart from indexing n-grams of each
term.

Regards,

Re: does anyone know of a 'smart' categorizing text pattern finder?

2006-09-26 Thread Otis Gospodnetic

Look at LingPipe from Alias-i.com.  Look at Named Entity extraction and its 
classifiers.

Otis

- Original Message 
From: Vladimir Olenin [EMAIL PROTECTED]
To: java-user@lucene.apache.org
Sent: Monday, September 25, 2006 9:49:31 PM
Subject: does anyone know of a 'smart' categorizing text pattern finder?

Hi,

I wonder if anyone here knows if there is a 'smart' text pattern finder, 
ideally written in Java. The library I'm looking for should be able to 'guess' 
the category of the particular text on the page, most probably by finding 
similarities between the bulk of the pages and a set of templates.

Eg, many forums are powered by phpbb, which structures 99% of the pages (except 
for some title pages  user profile pages) in a very similar fashion (page is 
broken into blocks, each block is broken into further blocks, etc). By 
comparing many pages with each other (eg, from the same domain root: 
forum.springframework.org) it should be possible to detect common ('template 
decorations') and page specific (actual content, like 'user name' and 'posting 
body') parts. After that it should further be possible, by comparing 'template 
decorations' parts to a set of templates, to 'guess' the nature of each of the 
'page specific' block (eg, 'Vladimir Olenin' in the left side column will be 
marked as 'name', while whatever is adjucent to this column is the post body).

So, I wonder if anyone knows of a package capable of such things. Primary goal 
though is simplier: to be able to parse out just posters' names from message 
boards. Though sometimes the 'block category' can be derived from CSS class 
name of the tags around the text, it's very often not the case.

Might Nutch have similar functionality built into their crawler?

Thanks.

Vlad

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: spell checker with lucene

2006-09-26 Thread Otis Gospodnetic

Lucene-based one is described on the Wiki.  Another one is the one from 
LingPipe.  It may not be free, depending on what you do with it.

Otis

- Original Message 
From: Bhavin Pandya [EMAIL PROTECTED]
To: java-user@lucene.apache.org
Sent: Tuesday, September 26, 2006 8:50:14 AM
Subject: spell checker with lucene

Hi,
Do anybody have idea for spell checker in java.
I want to use with lucene...but which must work well for phrases also...
-Bhavin pandya




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Caused by: java.io.IOException: The handle is invalid

2006-09-26 Thread Michael McCandless


Van Nguyen wrote:

I only get this error when using the server version of jvm.dll with my 
JBoss app server… but when I use the client version of jvm.dll, the same 
index builds just fine. 


This is an odd error.  Which OS are you running on?  And, what kind of 
filesystem is the index directory on?


It's surprising that client vs server JRE causes this.

Is the exception easily reproduced or is it intermittent?

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: How to remove duplicate records from result

2006-09-26 Thread Otis Gospodnetic

You could do it with a custom HitCollector, no?

Otis

- Original Message 
From: Bhavin Pandya [EMAIL PROTECTED]
To: java-user@lucene.apache.org
Sent: Tuesday, September 26, 2006 8:43:56 AM
Subject: How to remove duplicate records from result

Hi,

I searched the index and i found say 1000 records but out of that 1000 records 
i want to filter duplicate records based on value of one field.

is there any way except looping through whole Hit object ?
Because it wont work when number of hit is too large...

Thanks.
Bhavin pandya

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Ordered positions

2006-09-26 Thread Virlouvet Olivier

Hi 
   
 In javadoc IndexReader.termPositions() maps to the definition :
 Term=docNum, freq, pos1, pos2, ... posfreq-1 * 
   where returned enumeration is ordered by doc number.   
   
 Are positions ordered for each doc or not ?
   
  Thanks
  Olivier


-
 Yahoo! Mail réinvente le mail ! Découvrez le nouveau Yahoo! Mail et son 
interface révolutionnaire.

Re: Advice on Custom Sorting

2006-09-26 Thread Paul Lynch

Thanks again Erick for taking the time.

I agree that the CachingWrapperFilter as described
under using a custom filter in LIA is probably my
best bet. I wanted to check if anything had been added
in Lucene releases since the book was written I wasn't
aware of.

Cheers again.

--- Erick Erickson [EMAIL PROTECTED] wrote:

 You were probably right. See below
 
 On 9/25/06, Paul Lynch [EMAIL PROTECTED] wrote:
 
  Thanks for the quick response Erick.
 
  index the documents in your preferred list with a
  field and index your non-preferred docs with a
 field
  subid?
 
  I considered this approach and dismissed it due to
 the
  actual list of preferred ids changing so
 frequently
  (every 10 mins...ish) but maybe I was a little
 hasty
  in doing so. I will investigate the overhead in
  updating all docs in the index each time my list
  refreshes. I had assumed it was too prohibitive
 but I
  know what they say about assumptions :)
 
 
 Lots of overhead. There's really no capability of
 updating a doc in place.
 This has been on several people's wish-list. You'd
 have to delete every doc
 that you wanted to change and re-add it. I don't
 know how many documents
 this would be, if just a few it'd be OK, but if
 many I was assuming (and
 I *do* know what they say about assumptions G)
 that you were just adding
 to your preferred doc list every few minutes, not
 changing existing
 documents
 
 It really does sound like you want a filter. I was
 pleasantly surprised by
 how very quickly a filters are built. You could use
 a CachingWrapperFilter
 to have the filter kept around automatically (I
 guess you'd only have one
 per index update) to minimize your overhead for
 building filters, and
 perhaps warm up your cache by firing a canned query
 at your searcher when
 you re-open your IndexReader after index update. I
 think you'd have to do
 the two-query thing in this case. If you wanted to
 really get exotic, you
 could build your filter when you created your index
 and store it in a *very
 special document* and just read it in the first time
 you needed it. Although
 I've never used it, I guess you can store binary
 data. From the Javadoc
 

*Fieldfile:///C:/lucene-2.0.0/docs/api/org/apache/lucene/document/Field.html#Field%28java.lang.String,%20byte%5B%5D,%20org.apache.lucene.document.Field.Store%29
 *(String

http://java.sun.com/j2se/1.4/docs/api/java/lang/String.html
 name,
 byte[] value,

Field.Storefile:///C:/lucene-2.0.0/docs/api/org/apache/lucene/document/Field.Store.html
  store)
   Create a stored field with binary value.
 
 The only thing here is that the filters (probably
 wrapped in a
 ConstantScoreQuery) lose relevance, but since you're
 sorting one of several
 ways, that probably doesn't matter.
 
 Best
 Erick
 
 
 
 Should I be able to make this workable, the beauty
 of
  this solution would be that I would actually only
 need
  to query once. If I had a field which indicates
  whether it is a preferred doc or not, all I will
  have to do is sort across the two fields.
 
  Thanks again Erick. Any other suggestions are most
  welcome.
 
  Regards,
  Paul
 
  --- Erick Erickson [EMAIL PROTECTED]
 wrote:
 
   OK, a really off the top of my head response,
 but
   what the heck
  
   I'm not sure you need to worry about filters.
 Would
   it work for you to index
   the documents in your preferred list with a 
 field
   (called, at the limit of
   my creativity, preferredsubid G) and index
 your
   non-preferred docs with a
   field subid? You'd still have to fire two
 queries,
   one on subid (to pick up
   the ones in your non-preferred list) and one on
   preferredsubid.
  
   Since there's no requirement that all docs have
 the
   same fields, your
   preferred docs could have ONLY the
 preferredsubid
   field and your
   non-preferred docs ONLY the subid field. That
 way
   you wouldn't have to worry
   about picking the docs up twice.
  
   Merging should be simple then, just iterate over
   however many hits you want
   in your preferredHits object, then tack on
 however
   many you want from your
   nonPreferredHits object. All the code for the
 two
   queries would be
   identical, the only difference being whether you
   specify subid or
   preferredsubid..
  
   I can imagine several variations on this
 scenario,
   but they depend on your
   problem space.
  
   Whether this is the best or not, I leave as an
   exercise for the reader.
  
   Best
   Erick
  
   On 9/25/06, Paul Lynch [EMAIL PROTECTED]
 wrote:
   
Hi All,
   
I have an index containing documents which all
   have a
field called SubId which holds the ID of the
Subscriber that submitted the data. This field
 is
STORED and UN_TOKENIZED
   
When I am querying the index, the user can
 cloose
   a
number of different ways to sort the Hits. The
   problem
is that I have a list of SubIds that should
 appear
   at
the top of the results list regardless of how
 the

Re: Where to find drill-down examples (source code)

2006-09-26 Thread djd0383

I there a link to a zip file where I can get the entire package of source
files (version 2, please). I know I am able to view them in the Source
Repository (http://svn.apache.org/viewvc/lucene/java/trunk/), but I do not
really feel like going through each of those to download them all. I am
looking for a one stop shop here.

Miles Barr-3 wrote:

Martin Braun wrote:

I want to realize a drill-down Function aka narrow search aka refine
search.

I want to have something like:

Refine by Date:
* 1990-2000 (30 Docs)
* 2001-2003 (200 Docs)
* 2004-2006 (10 Docs)

But not only DateRanges but also for other Categories.

What I have found in the List-Archives so far is that I have to use
Filters for my search.

Does anybody knows where to find some Source Code, to get an Idea how to
implement this?
I think that's a useful property for a search engine, so are there any
contributions for Lucene for that?

If you want to do a refined search I'd put the original query in a
QueryFilter, which filters on the new search.

http://lucene.apache.org/java/docs/api/org/apache/lucene/search/QueryFilter.html

e.g.

Query original = // saved from the last time the search was executed
QueryFilter filter = new QueryFilter(original);

QueryParser parser = ...
Searcher searcher = ...

String userQuery;
Query query = parser.parse(userQuery);

Hits hits = searcher.search(query, filter);

Fill in the blanks with however you normally get your QueryParser and
IndexSearcher. You could store the old query on the session, or
somewhere else.

Then the QueryFilter will ensure you're doing a refinement, but won't
affect the scoring in the new search.

Alternatively, since you appear to only want to refine on dates and
categories, you might want to put them in filters so they don't affect
the score, and leave the query as is. In which case you can use a
RangeQuery for the dates, and a wrap a TermQuery in a QueryFilter to
handle the categories.

If you need multiple filters you can use the ChainedFilter class.

Miles

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

--
View this message in context:
http://www.nabble.com/Where-to-find-drill-down-examples-%28source-code%29-tf1980330.html#a6512411
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: spell checker with lucene

2006-09-26 Thread Bill Taylor


On Sep 26, 2006, at 8:50 AM, Bhavin Pandya wrote:


Hi,
Do anybody have idea for spell checker in java.
I want to use with lucene...but which must work well for phrases 
also...

-Bhavin pandya


When I googled java spell check open source I found

http://jazzy.sourceforge.net/

I have looked at it.

Are you thinking of doing a spell check on the queries people type?  It 
might be better simply to check each word and see if it is found in the 
index.  That will be a lot less work than adapting the spell checker to 
Lucene.


Bill Taylor


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

cache persistent Hits

2006-09-26 Thread Gaston


Hi,

Lucene has itself  volatile caching mechanism provided by a weak 
HashMap. Is there a possibilty to serialize the Hits Object? I think of 
a HashMap that for each found result, caches the first 100 results. Is 
it possible to implement such a feature or is there such an extension? 
My problem is that the searching of my application with an index with 
the size of 212MB takes to much time, despite I set the BooleanOperator 
from OR to AND


I am happy about every suggestion.

Greetings

Gaston.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Where to find drill-down examples (source code)

2006-09-26 Thread Simon Willnauer

Either you grap the next best svn client and check out the branch of
2.0 or you just download
the source dist from a mirror. use this one
http://mirrorspace.org/apache/lucene/java/

best regards simon

On 9/26/06, djd0383 [EMAIL PROTECTED] wrote:

Miles Barr-3 wrote:

Martin Braun wrote:

I want to realize a drill-down Function aka narrow search aka refine
search.

I want to have something like:

Refine by Date:
* 1990-2000 (30 Docs)
* 2001-2003 (200 Docs)
* 2004-2006 (10 Docs)

But not only DateRanges but also for other Categories.

What I have found in the List-Archives so far is that I have to use
Filters for my search.

Does anybody knows where to find some Source Code, to get an Idea how to
implement this?
I think that's a useful property for a search engine, so are there any
contributions for Lucene for that?

If you want to do a refined search I'd put the original query in a
QueryFilter, which filters on the new search.

http://lucene.apache.org/java/docs/api/org/apache/lucene/search/QueryFilter.html

e.g.

Query original = // saved from the last time the search was executed
QueryFilter filter = new QueryFilter(original);

QueryParser parser = ...
Searcher searcher = ...

String userQuery;
Query query = parser.parse(userQuery);

Hits hits = searcher.search(query, filter);

Fill in the blanks with however you normally get your QueryParser and
IndexSearcher. You could store the old query on the session, or
somewhere else.

Then the QueryFilter will ensure you're doing a refinement, but won't
affect the scoring in the new search.

If you need multiple filters you can use the ChainedFilter class.

Miles

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: searching for the part of a term.

2006-09-26 Thread Chris Hostetter


: Since the overhead in first is the speed of the system, i think adopting
: second method will be better.
:
: Is there any other solution for this problem?? Am i going in right
: direction??

you're definitely on teh right path -- those are the two bigsolutions i
can think of, which appraoch you should take really depends on the nature
of your data, what your performance concerns are, and how much development
time you have.

Here's another good thread you may want to check out...

http://www.nabble.com/I-just-don%27t-get-wildcards-at-all.-tf1412243.html#a3804223


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: cache persistent Hits

2006-09-26 Thread Erick Erickson


Well, my index is over 1.4G, and others are reporting very large indexes in
the 10s of gigabytes. So I suspect your index size isn't the issue. I'd be
very, very, very surprised if it was.

Three things spring immediately to mind.

First, opening an IndexSearcher is a slow operation. Are you opening a new
IndexSearcher for each query? If so, don't G. You can re-use the same
searcher across threads without fear and you should *definitely* keep it
open between queries.

Second, your query could just be very, very interesting. It would be more
helpful if you posted an example of the code where you take your timings
(including opening the IndexSearcher).

Third, if you're using a Hits object to iterate over many documents, be
aware that it re-executes the query every hundred results or so. You want to
use one of the  HitCollector/TopDocs/TopDocsCollector classes if you are
iterating over all the returned documents. And you really *don't* want to do
an IndexReader.doc(doc#) or Searcher.doc(doc#) on every document.

If none of this helps, please post some code fragments and I'm sure others
will chime in.

Best
Erick

On 9/26/06, Gaston [EMAIL PROTECTED] wrote:


Hi,

Lucene has itself  volatile caching mechanism provided by a weak
HashMap. Is there a possibilty to serialize the Hits Object? I think of
a HashMap that for each found result, caches the first 100 results. Is
it possible to implement such a feature or is there such an extension?
My problem is that the searching of my application with an index with
the size of 212MB takes to much time, despite I set the BooleanOperator
from OR to AND

I am happy about every suggestion.

Greetings

Gaston.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re[3]: how to enhance speed of sorted search

2006-09-26 Thread Chris Hostetter


: I am thinking should be this faster

The ConstantScoreQuery wrapped arround the QueryFilter might in fact be
faster then the raw query -- have your tried it to see?

you might be able to shave a little bit of speed off by accessing the bits
from the Filter directly and iterating over them yourself to check the
FieldCache ad build up your sorted list of the first N -- i think that
would save you one method call per match (the score method of
ConstantScoreQuery)

At some point you just have to wonder if it's fast enough?

how long does a typically sorted query take for you right now?

how many documents are in your index?

how many matches do you typically have?

-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Very high fieldNorm for a field resulting in bad results

2006-09-26 Thread Chris Hostetter


: The symptom:
: Very high fieldNorm for field A.(explain output pasted below) The boost i am
: applying to the troublesome field is 3.5   the max boost applied per doc is
: 1.8
: Given that information, the very high fieldNorm is very surprising to me.
: Based on what I read,  FieldNorm  = 1 / sqrt(sum of terms) , possibly
: multiplied by field boost values.

The value of a the field norm for any field named A is typically the
lengthNorm of the field, times the document boost, times the field boost
for *each* Field instance added to the document with the name A.
(lengthNorm is by default 1/swrt(num of terms))

so in your situation...

: for (Collection of values){
:  Field thisField = new Field(fieldName, value, fieldConfig.STORED,
: fieldConfig.INDEXED);
:  thisField.setBoost(fieldConfig);
: doc.add(thisField);

the fieldNorm for A is going to be fieldConfig * values.size() * any
document boost you didn't mention using * length norm.

: which should basically lead to the values being appended,
: Am i making a mistake in the way I am adding fields ?

the way you are adding fields is the proper way to deal with multi-value
fields in my opinion, but it may be leading to more boost then you
intended, in which case only boosting the first Field may be the way to
go.

another aspect of this to keep in mind, is htat since fieldNorms are
stored as a single byte encoded float, some precission is lost ... the
byte encoding for the norms is targeted at smaller values, so with really
big norms you might find the problem exaserbated by the rounding.

play around with your boost values - you can use indexReader.norms(A)
along with similarity.decodeNorm to see what norm values your various
documents are getting as you tweak your numbers.



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: cache persistent Hits

2006-09-26 Thread Gaston


hi,

first thank you for the fast reply.

I use MultiSearcher that opens 3 indexes, so this makes the whole 
operation surly slower, but 20seconds for 5260 results out of an 212MB 
index  is  much too slow.

Another reason can of course be my ISP.

Here is my code:

   IndexSearcher[] searchers;
   searchers=new IndexSearcher[3];
   String path=/home/sn/public_html/;
   searchers[0]=new IndexSearcher(path+index1);
   searchers[1]=new IndexSearcher(path+index2);
   searchers[2]=new IndexSearcher(path+index3);
   MultiSearcher saercher=new MultiSearcher(searchers);
   QueryParser parser=new QueryParser(content,new 
StandardAnalyzer());

   parser.setOperator(QueryParser.DEFAULT_OPERATOR_AND);
 
   Query query=parser.parse(urlName:+userInput+ OR 
+content:+userInput);
  
   Hits hits=searcher.search(query);
  
   for(int i=0;ihits.length();i++)

   {
  
   Document doc=hits.doc(i);
  
  
   }
  
  
  // Outprint only 10 results per page


   for(int i=startPoint;istartPoint+10;i++)
   {
  
   Document doc=hits.doc(i);
  
   out.println(escapeHTML(doc.get(description))+p);
   out.println(a 
href=+doc.get(url)++doc.get(url).substring(7)+/a);

   out.println(ppp);
  
   }


Perhaps somebody see the reason why it is so slow.

Thank you in advance

Greetings Gaston




Erick Erickson schrieb:

Well, my index is over 1.4G, and others are reporting very large 
indexes in
the 10s of gigabytes. So I suspect your index size isn't the issue. 
I'd be

very, very, very surprised if it was.

Three things spring immediately to mind.

First, opening an IndexSearcher is a slow operation. Are you opening a 
new

IndexSearcher for each query? If so, don't G. You can re-use the same
searcher across threads without fear and you should *definitely* keep it
open between queries.

Second, your query could just be very, very interesting. It would be more
helpful if you posted an example of the code where you take your timings
(including opening the IndexSearcher).

Third, if you're using a Hits object to iterate over many documents, be
aware that it re-executes the query every hundred results or so. You 
want to

use one of the  HitCollector/TopDocs/TopDocsCollector classes if you are
iterating over all the returned documents. And you really *don't* want 
to do

an IndexReader.doc(doc#) or Searcher.doc(doc#) on every document.

If none of this helps, please post some code fragments and I'm sure 
others

will chime in.

Best
Erick

On 9/26/06, Gaston [EMAIL PROTECTED] wrote:



Hi,

Lucene has itself  volatile caching mechanism provided by a weak
HashMap. Is there a possibilty to serialize the Hits Object? I think of
a HashMap that for each found result, caches the first 100 results. Is
it possible to implement such a feature or is there such an extension?
My problem is that the searching of my application with an index with
the size of 212MB takes to much time, despite I set the BooleanOperator
from OR to AND

I am happy about every suggestion.

Greetings

Gaston.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]







-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Lucene In Action Book vs Lucene 2.0

2006-09-26 Thread KEGan


Hi,

I have bought the Lucene In Action Book for more than a year now, and was
using Lucene 1.x during that time. Now, I have a new project with Lucene and
Lucene is now 2.0. Many APIs seems to have changed.

I would like to ask the experts here, what are the important or substantial
changes from Lucene 1.x to 2.0? Which part of the LIA book that is still
usable and which part is not? Any particular things that a new Lucene
2.0user that only have used the
1.x version, should pay attention to?

Thanks.

Re: cache persistent Hits

2006-09-26 Thread Erick Erickson


See below.

On 9/26/06, Gaston [EMAIL PROTECTED] wrote:


hi,

first thank you for the fast reply.

I use MultiSearcher that opens 3 indexes, so this makes the whole
operation surly slower, but 20seconds for 5260 results out of an 212MB
index  is  much too slow.
Another reason can of course be my ISP.

Here is my code:

IndexSearcher[] searchers;
searchers=new IndexSearcher[3];
String path=/home/sn/public_html/;
searchers[0]=new IndexSearcher(path+index1);
searchers[1]=new IndexSearcher(path+index2);
searchers[2]=new IndexSearcher(path+index3);
MultiSearcher saercher=new MultiSearcher(searchers);




Above you've opened the searcher for each search, exactly as I feared. This
is a major hit. Don't do this, but keep the searchers open between calls.
You can demonstrate this to yourself by returning time intervals in your
HTML page. Take one timestamp right here, one after a new dummy query that
you make up and hard-code, and one after the real query you already have
below. Return them all in your HTML page and take a look. I think you'll see
that the first query takes a while, and the second is very fast. And don't
iterate over all the hits (more below).


   QueryParser parser=new QueryParser(content,new

StandardAnalyzer());
parser.setOperator(QueryParser.DEFAULT_OPERATOR_AND);

Query query=parser.parse(urlName:+userInput+ OR
+content:+userInput);

Hits hits=searcher.search(query);

for(int i=0;ihits.length();i++)
{

Document doc=hits.doc(i);


}



what is the purpose of iteration above? This does nothing except waste time.
I'd just remove it (unless there's something else you're doing here that you
left out). If you're trying to get to the startPoint below, well, there's no
reason to iterate above, just to directly to the loop below. For 5000 hits,
you're repeating the search 50 times or so, as has been discussed in these
archives repeatedly. See my previous mail.


  // Outprint only 10 results per page


for(int i=startPoint;istartPoint+10;i++)
{

Document doc=hits.doc(i);

out.println(escapeHTML(doc.get(description))+p);
out.println(a
href=+doc.get(url)++doc.get(url).substring(7)+/a);
out.println(ppp);

}

Perhaps somebody see the reason why it is so slow.

Thank you in advance

Greetings Gaston



I'm assuming that your ISP comment is just where you're getting your page
from, and that your searchers and indexes are at least on the same network
and NOT separated by the web, as that would be slow and hard to fix.

To get a sense of where you're really spending your time, I'd actually get
the system time at various points in the process and send the *times* back
in your HTML page. That'll give you a much better sense of where you're
actually spending time. You can't really tell anything by measuring now long
it takes to get your HTML page back, you've *got* to measure at discreet
points in the code and return those.

5,000+ results should not be taking 20 seconds. I strongly suspect that the
fact that you're opening your searchers every time and uselessly iterating
through all the hits is the culprit. If I remember correctly, and you have
5,000 documents, you're executing the query about 50 times when you iterate
through all the hits. Under the covers, Hits is optimized for about 100
results. As you iterate through, each next 100 re-executes the query. You
could search the mail archive for this topic, maybe hits slow or some such
for greater explications.

Hope this helps
Erick


Erick Erickson schrieb:


 Well, my index is over 1.4G, and others are reporting very large
 indexes in
 the 10s of gigabytes. So I suspect your index size isn't the issue.
 I'd be
 very, very, very surprised if it was.

 Three things spring immediately to mind.

 First, opening an IndexSearcher is a slow operation. Are you opening a
 new
 IndexSearcher for each query? If so, don't G. You can re-use the same
 searcher across threads without fear and you should *definitely* keep it
 open between queries.

 Second, your query could just be very, very interesting. It would be
more
 helpful if you posted an example of the code where you take your timings
 (including opening the IndexSearcher).

 Third, if you're using a Hits object to iterate over many documents, be
 aware that it re-executes the query every hundred results or so. You
 want to
 use one of the  HitCollector/TopDocs/TopDocsCollector classes if you are
 iterating over all the returned documents. And you really *don't* want
 to do
 an IndexReader.doc(doc#) or Searcher.doc(doc#) on every document.

 If none of this helps, please post some code fragments and I'm sure
 others
 will chime in.

 Best
 Erick

 On 9/26/06, Gaston [EMAIL PROTECTED] wrote:


 Hi,

 Lucene has itself  volatile caching mechanism provided

spell checker

2006-09-26 Thread Chris Salem

Does anyone have sample code on how to build a dictionary?

I found this article online and but it uses version 1.4.3 and it doesn't seem 
to work on 2.0.0:  
http://today.java.net/pub/a/today/2005/08/09/didyoumean.html?page=1

Here's the code I have:

indexReader = IndexReader.open(originalIndexDirectory);
Dictionary dictionary = new LuceneDictionary(indexReader, experience_desired);
SpellChecker spellChckr = new SpellChecker(spellIndexDirectory);
spellChckr.indexDictionary(dictionary);
I'm getting a null pointer exception when I call indexDirectory().  
Here's how I index the field experience_desired: 
doc.add(new Field(experience_desired, value, Field.Store.NO, 
Field.Index.TOKENIZED));
Is there another way I should do it so there is a way to build a dictionary on 
that field?

Thanks

Chris Salem
440.946.5214 x5458
[EMAIL PROTECTED] 

(The following links were included with this email:)
http://today.java.net/pub/a/today/2005/08/09/didyoumean.html?page=1

mailto:[EMAIL PROTECTED]



(The following links were included with this email:)
http://today.java.net/pub/a/today/2005/08/09/didyoumean.html?page=1

mailto:[EMAIL PROTECTED]

spell checker

2006-09-26 Thread Chris Salem

Does anyone have sample code on how to build a dictionary?

I found this article online and but it uses version 1.4.3 and it doesn't seem 
to work on 2.0.0:  
http://today.java.net/pub/a/today/2005/08/09/didyoumean.html?page=1

Here's the code I have:

indexReader = IndexReader.open(originalIndexDirectory);
Dictionary dictionary = new LuceneDictionary(indexReader, experience_desired);
SpellChecker spellChckr = new SpellChecker(spellIndexDirectory);
spellChckr.indexDictionary(dictionary);
I'm getting a null pointer exception when I call indexDirectory().  
Here's how I index the field experience_desired: 
doc.add(new Field(experience_desired, value, Field.Store.NO, 
Field.Index.TOKENIZED));
Is there another way I should do it so there is a way to build a dictionary on 
that field?

Thanks

Chris Salem
440.946.5214 x5458
[EMAIL PROTECTED] 

(The following links were included with this email:)
http://today.java.net/pub/a/today/2005/08/09/didyoumean.html?page=1

mailto:[EMAIL PROTECTED]



(The following links were included with this email:)
http://today.java.net/pub/a/today/2005/08/09/didyoumean.html?page=1

mailto:[EMAIL PROTECTED]

Re: cache persistent Hits

2006-09-26 Thread Gaston


Hi Erick,

the problem was this piece of code I don't need anymore.

for(int i=0;ihits.length();i++)
{

 Document doc=hits.doc(i);


 }

Now it is very fast, thank you very much for your email that is written  
in detail.

Here is my application, that still is in development phase.
http://www.suchste.de

Greetings Gaston

P.S. The search for 'web' delivers over 5000 hits...


Erick Erickson schrieb:


See below.

On 9/26/06, Gaston [EMAIL PROTECTED] wrote:



hi,

first thank you for the fast reply.

I use MultiSearcher that opens 3 indexes, so this makes the whole
operation surly slower, but 20seconds for 5260 results out of an 212MB
index  is  much too slow.
Another reason can of course be my ISP.

Here is my code:

IndexSearcher[] searchers;
searchers=new IndexSearcher[3];
String path=/home/sn/public_html/;
searchers[0]=new IndexSearcher(path+index1);
searchers[1]=new IndexSearcher(path+index2);
searchers[2]=new IndexSearcher(path+index3);
MultiSearcher saercher=new MultiSearcher(searchers);





Above you've opened the searcher for each search, exactly as I feared. 
This

is a major hit. Don't do this, but keep the searchers open between calls.
You can demonstrate this to yourself by returning time intervals in your
HTML page. Take one timestamp right here, one after a new dummy query 
that
you make up and hard-code, and one after the real query you already 
have
below. Return them all in your HTML page and take a look. I think 
you'll see
that the first query takes a while, and the second is very fast. And 
don't

iterate over all the hits (more below).


   QueryParser parser=new QueryParser(content,new


StandardAnalyzer());
parser.setOperator(QueryParser.DEFAULT_OPERATOR_AND);

Query query=parser.parse(urlName:+userInput+ OR
+content:+userInput);

Hits hits=searcher.search(query);

for(int i=0;ihits.length();i++)
{

Document doc=hits.doc(i);


}




what is the purpose of iteration above? This does nothing except waste 
time.
I'd just remove it (unless there's something else you're doing here 
that you
left out). If you're trying to get to the startPoint below, well, 
there's no
reason to iterate above, just to directly to the loop below. For 5000 
hits,
you're repeating the search 50 times or so, as has been discussed in 
these

archives repeatedly. See my previous mail.


  // Outprint only 10 results per page



for(int i=startPoint;istartPoint+10;i++)
{

Document doc=hits.doc(i);


out.println(escapeHTML(doc.get(description))+p);

out.println(a
href=+doc.get(url)++doc.get(url).substring(7)+/a);
out.println(ppp);

}

Perhaps somebody see the reason why it is so slow.

Thank you in advance

Greetings Gaston




I'm assuming that your ISP comment is just where you're getting your page
from, and that your searchers and indexes are at least on the same 
network

and NOT separated by the web, as that would be slow and hard to fix.

To get a sense of where you're really spending your time, I'd actually 
get
the system time at various points in the process and send the *times* 
back

in your HTML page. That'll give you a much better sense of where you're
actually spending time. You can't really tell anything by measuring 
now long

it takes to get your HTML page back, you've *got* to measure at discreet
points in the code and return those.

5,000+ results should not be taking 20 seconds. I strongly suspect 
that the
fact that you're opening your searchers every time and uselessly 
iterating
through all the hits is the culprit. If I remember correctly, and you 
have
5,000 documents, you're executing the query about 50 times when you 
iterate

through all the hits. Under the covers, Hits is optimized for about 100
results. As you iterate through, each next 100 re-executes the 
query. You
could search the mail archive for this topic, maybe hits slow or 
some such

for greater explications.

Hope this helps
Erick


Erick Erickson schrieb:



 Well, my index is over 1.4G, and others are reporting very large
 indexes in
 the 10s of gigabytes. So I suspect your index size isn't the issue.
 I'd be
 very, very, very surprised if it was.

 Three things spring immediately to mind.

 First, opening an IndexSearcher is a slow operation. Are you opening a
 new
 IndexSearcher for each query? If so, don't G. You can re-use the 
same
 searcher across threads without fear and you should *definitely* 
keep it

 open between queries.

 Second, your query could just be very, very interesting. It would be
more
 helpful if you posted an example of the code where you take your 
timings

 (including opening the IndexSearcher).

 Third, if you're using a Hits object to iterate over many 
documents, be

 aware that it re-executes the query every hundred results or

term OR term OR term OR .... query question

2006-09-26 Thread Vladimir Olenin

Hi.
 
I have a question regarding Lucene scoring algorithm. Providing I have a
query a OR b OR c OR d OR e OR f, and two documents: doc1 a b c d
and doc2 d e, will doc1 score higher than doc2? In other words, does
Lucene takes into account the number of terms matched in the document in
case of the 'or' query?
 
Providing that I don't know the algorithms behind the Lucene, how does
'or' query time depends on the number of searched terms? Does it grow
linierly, exponentially? How does 'and' query time depends on the number
of searched terms? (it should decrease, right?)
 
Thanks.
 
Vlad

Re: Re[2]: how to enhance speed of sorted search

2006-09-26 Thread karl wettin


On 9/26/06, Chris Hostetter [EMAIL PROTECTED] wrote:


if you are seeing slow performance from sorted searches,
the time spent scoring the results isn't the biggest contributor to how
long thesearch takes -- it tends to be negligable for most queries.


I've many times wished for a visiting score mechanism of some kind.
Turn it off and save CPU, remove floating points, or even hide a
global sort order in the norms.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

how to get results without getting total number of found documents?

2006-09-26 Thread Vladimir Olenin

Hi.
 
I couldn't find the answer to this question in the mailing list archive.
In case I missed it, please let me know the keyword phrase I should be
looking for, if not a direct link.
 
All the 'Lucene' powered implementations I saw (well, primarily those
utilizing Solr) return exact count of the number of documents found. It
means that the query is resolved across the whole data set in precise
fashion. If the number of searched documents is huge (eg,  1billion),
this should present quite a problem. I wonder if that's the default
behaviour of Lucene or rather the frameworks that utilize it? Is it
possible to:
 
- get the top 1000 results WITHOUT executing query across whole data set
- in other words, can Lucene:
  - chunk out top X results by 'approximate' fast search, which will
return _approximate_ total number of found documents, similar to
'Google' total pages found count
  - and perform more accurate search within that chunk
 
Is such functionality built in or it has be customized? If it's
built-in, what algorithms are used to 'chunk out' the results and get
approximate docs count? What classes should I look at?
 
Thanks!
 
Vlad
 
PS: it's pretty much the functionality Google has - you can't get more
than 1000 matches per query (meaning, you can get even '10M' documents
found, but if you'll try to browse beyond '1000' results, you'll get an
error page).

Re: Re[2]: how to enhance speed of sorted search

2006-09-26 Thread eks dev

Paul's Matcher in Jira will almost enable this, indirectly but possible

- Original Message 
From: karl wettin [EMAIL PROTECTED]
To: java-user@lucene.apache.org
Sent: Tuesday, 26 September, 2006 11:30:24 PM
Subject: Re: Re[2]: how to enhance speed of sorted search

On 9/26/06, Chris Hostetter [EMAIL PROTECTED] wrote:

 if you are seeing slow performance from sorted searches,
 the time spent scoring the results isn't the biggest contributor to how
 long thesearch takes -- it tends to be negligable for most queries.

I've many times wished for a visiting score mechanism of some kind.
Turn it off and save CPU, remove floating points, or even hide a
global sort order in the norms.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: how to get results without getting total number of found documents?

2006-09-26 Thread markharw00d


- get the top 1000 results WITHOUT executing query across whole data set

(Apologies if this is telling something you are already fully aware of ) 
- Counting matches doesn't involve scanning the text of all the docs so 
may be less expensive than you think for a single index. It very quickly 
looks up and ranks only the docs containing your search terms so a total 
match count is not an expensive by-product of this operation - see a 
description of inverted indexes for more details: 
http://en.wikipedia.org/wiki/Inverted_index


If you're aware of all that and considering larger scale problems 
(billions of docs) where multiple machines/indexes must be queried in 
parallel things are more complex. The cost of combining result scores 
from multiple machines is typically why you can't page beyond 1000 
results. Some of these large distributed  architectures will divide 
content into popular/recent content and older/less popular content. 
Approximations for total number of matching docs are calculated based on 
queries executed solely on the subset of popular stuff. Only queries 
with insufficient matches in popular content will resort to querying the 
older stuff.


Cheers
Mark


Vladimir Olenin wrote:

Hi.
 
I couldn't find the answer to this question in the mailing list archive.

In case I missed it, please let me know the keyword phrase I should be
looking for, if not a direct link.
 
All the 'Lucene' powered implementations I saw (well, primarily those

utilizing Solr) return exact count of the number of documents found. It
means that the query is resolved across the whole data set in precise
fashion. If the number of searched documents is huge (eg,  1billion),
this should present quite a problem. I wonder if that's the default
behaviour of Lucene or rather the frameworks that utilize it? Is it
possible to:
 
- get the top 1000 results WITHOUT executing query across whole data set

- in other words, can Lucene:
  - chunk out top X results by 'approximate' fast search, which will
return _approximate_ total number of found documents, similar to
'Google' total pages found count
  - and perform more accurate search within that chunk
 
Is such functionality built in or it has be customized? If it's

built-in, what algorithms are used to 'chunk out' the results and get
approximate docs count? What classes should I look at?
 
Thanks!
 
Vlad
 
PS: it's pretty much the functionality Google has - you can't get more

than 1000 matches per query (meaning, you can get even '10M' documents
found, but if you'll try to browse beyond '1000' results, you'll get an
error page).

  





___ 
To help you stay safe and secure online, we've developed the all new Yahoo! Security Centre. http://uk.security.yahoo.com



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: cache persistent Hits

2006-09-26 Thread Erick Erickson


Glad I could help. I don't read a word of German, but even I could see the
227 milliseconds at the bottom G.

Glad things are working for you.
Erick

On 9/26/06, Gaston [EMAIL PROTECTED] wrote:


Hi Erick,

the problem was this piece of code I don't need anymore.

for(int i=0;ihits.length();i++)
{

  Document doc=hits.doc(i);


  }

Now it is very fast, thank you very much for your email that is written
in detail.
Here is my application, that still is in development phase.
http://www.suchste.de

Greetings Gaston

P.S. The search for 'web' delivers over 5000 hits...


Erick Erickson schrieb:

 See below.

 On 9/26/06, Gaston [EMAIL PROTECTED] wrote:


 hi,

 first thank you for the fast reply.

 I use MultiSearcher that opens 3 indexes, so this makes the whole
 operation surly slower, but 20seconds for 5260 results out of an 212MB
 index  is  much too slow.
 Another reason can of course be my ISP.

 Here is my code:

 IndexSearcher[] searchers;
 searchers=new IndexSearcher[3];
 String path=/home/sn/public_html/;
 searchers[0]=new IndexSearcher(path+index1);
 searchers[1]=new IndexSearcher(path+index2);
 searchers[2]=new IndexSearcher(path+index3);
 MultiSearcher saercher=new MultiSearcher(searchers);




 Above you've opened the searcher for each search, exactly as I feared.
 This
 is a major hit. Don't do this, but keep the searchers open between
calls.
 You can demonstrate this to yourself by returning time intervals in your
 HTML page. Take one timestamp right here, one after a new dummy query
 that
 you make up and hard-code, and one after the real query you already
 have
 below. Return them all in your HTML page and take a look. I think
 you'll see
 that the first query takes a while, and the second is very fast. And
 don't
 iterate over all the hits (more below).


QueryParser parser=new QueryParser(content,new

 StandardAnalyzer());
 parser.setOperator(QueryParser.DEFAULT_OPERATOR_AND);

 Query query=parser.parse(urlName:+userInput+ OR
 +content:+userInput);

 Hits hits=searcher.search(query);

 for(int i=0;ihits.length();i++)
 {

 Document doc=hits.doc(i);


 }



 what is the purpose of iteration above? This does nothing except waste
 time.
 I'd just remove it (unless there's something else you're doing here
 that you
 left out). If you're trying to get to the startPoint below, well,
 there's no
 reason to iterate above, just to directly to the loop below. For 5000
 hits,
 you're repeating the search 50 times or so, as has been discussed in
 these
 archives repeatedly. See my previous mail.


   // Outprint only 10 results per page


 for(int i=startPoint;istartPoint+10;i++)
 {

 Document doc=hits.doc(i);


 out.println(escapeHTML(doc.get(description))+p);
 out.println(a
 href=+doc.get(url)++doc.get(url).substring(7)+/a);
 out.println(ppp);

 }

 Perhaps somebody see the reason why it is so slow.

 Thank you in advance

 Greetings Gaston



 I'm assuming that your ISP comment is just where you're getting your
page
 from, and that your searchers and indexes are at least on the same
 network
 and NOT separated by the web, as that would be slow and hard to fix.

 To get a sense of where you're really spending your time, I'd actually
 get
 the system time at various points in the process and send the *times*
 back
 in your HTML page. That'll give you a much better sense of where you're
 actually spending time. You can't really tell anything by measuring
 now long
 it takes to get your HTML page back, you've *got* to measure at discreet
 points in the code and return those.

 5,000+ results should not be taking 20 seconds. I strongly suspect
 that the
 fact that you're opening your searchers every time and uselessly
 iterating
 through all the hits is the culprit. If I remember correctly, and you
 have
 5,000 documents, you're executing the query about 50 times when you
 iterate
 through all the hits. Under the covers, Hits is optimized for about 100
 results. As you iterate through, each next 100 re-executes the
 query. You
 could search the mail archive for this topic, maybe hits slow or
 some such
 for greater explications.

 Hope this helps
 Erick


 Erick Erickson schrieb:


  Well, my index is over 1.4G, and others are reporting very large
  indexes in
  the 10s of gigabytes. So I suspect your index size isn't the issue.
  I'd be
  very, very, very surprised if it was.
 
  Three things spring immediately to mind.
 
  First, opening an IndexSearcher is a slow operation. Are you opening
a
  new
  IndexSearcher for each query? If so, don't G. You can re-use the
 same
  searcher across threads without fear and you should *definitely*
 keep it
  open between queries.
 
  Second, your query could just be very, very interesting. It would be
 more

RE: how to get results without getting total number of found documents?

2006-09-26 Thread Vladimir Olenin

Thanks, Mark, that clears things up a bit. No need to appologise - I am
quite a novice with Lucene.

To explain my concern a bit, assume that your inverted index is queried
with 'or' query for the most 'common' terms (ie, after excluding such
denominators as 'a', 'the', etc). Let's say, you have following terms:

- 'work': occurs in 200M documents
- 'java': occurs in 100M documents
- '.net': occurs in 100M documents

Now, if I'm doing a query: 'work OR java OR .net' the total result set
should be somewhere between 200M and 400M, right? But to get the exact
number you'll actually need to make the union of ALL the document Ids,
which means you'd have to loop through 400M ids at least. For more
complex queries the cost should be higher. The sorting step should be
quite expensive for the 'whole' dataset as well. The intersect should be
cheaper because each step eliminates some number of documents. In the
implementations I saw/did in the past the ability (or, to be more
correct, Unability) to create this kind of 'approximation' and chunk out
'most significant' results was the main limiting factor of all the
algorithms.

Thanks!

Vlad

PS: by 'computationally expensive', I mean 'scalability' as well - all
this operations are very CPU intensive, so if one query returns within
1-2 seconds during which time takes up 100% of CPU time, it means that
for 20 concurrent users the response time for the 'most unlucky' one
would be somewhere between 20 and 40 seconds.
 

-Original Message-
From: markharw00d [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, September 26, 2006 6:35 PM
To: java-user@lucene.apache.org
Subject: Re: how to get results without getting total number of found
documents?

 - get the top 1000 results WITHOUT executing query across whole data
set

(Apologies if this is telling something you are already fully aware of )
- Counting matches doesn't involve scanning the text of all the docs so
may be less expensive than you think for a single index. It very quickly
looks up and ranks only the docs containing your search terms so a total
match count is not an expensive by-product of this operation - see a
description of inverted indexes for more details: 
http://en.wikipedia.org/wiki/Inverted_index

If you're aware of all that and considering larger scale problems
(billions of docs) where multiple machines/indexes must be queried in
parallel things are more complex. The cost of combining result scores
from multiple machines is typically why you can't page beyond 1000
results. Some of these large distributed  architectures will divide
content into popular/recent content and older/less popular content. 
Approximations for total number of matching docs are calculated based on
queries executed solely on the subset of popular stuff. Only queries
with insufficient matches in popular content will resort to querying the
older stuff.

Cheers
Mark


Vladimir Olenin wrote:
 Hi.
  
 I couldn't find the answer to this question in the mailing list
archive.
 In case I missed it, please let me know the keyword phrase I should be

 looking for, if not a direct link.
  
 All the 'Lucene' powered implementations I saw (well, primarily those 
 utilizing Solr) return exact count of the number of documents found. 
 It means that the query is resolved across the whole data set in 
 precise fashion. If the number of searched documents is huge (eg,  
 1billion), this should present quite a problem. I wonder if that's the

 default behaviour of Lucene or rather the frameworks that utilize it? 
 Is it possible to:
  
 - get the top 1000 results WITHOUT executing query across whole data 
 set
 - in other words, can Lucene:
   - chunk out top X results by 'approximate' fast search, which will 
 return _approximate_ total number of found documents, similar to 
 'Google' total pages found count
   - and perform more accurate search within that chunk
  
 Is such functionality built in or it has be customized? If it's 
 built-in, what algorithms are used to 'chunk out' the results and get 
 approximate docs count? What classes should I look at?
  
 Thanks!
  
 Vlad
  
 PS: it's pretty much the functionality Google has - you can't get more

 than 1000 matches per query (meaning, you can get even '10M' documents

 found, but if you'll try to browse beyond '1000' results, you'll get 
 an error page).

   




___
To help you stay safe and secure online, we've developed the all new
Yahoo! Security Centre. http://uk.security.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: how to get results without getting total number of found documents?

2006-09-26 Thread Andrzej Bialecki


Vlad,

Please check published papers on sampling inverted indexes and 
multi-level caching - this is most probably what Google and other major 
search engines use.


You can see a simple implementation of this principle in Nutch - the 
index is sorted in decreasing order by a PageRank-like score (the logic 
for this is in IndexSorter.java), and then when running a query we only 
collect top-N results, and extrapolate total numbers over the whole 
collection, assuming certain model of term distributions 
(LuceneQueryOptimizer.LimitedCollector).


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: spell checker

2006-09-26 Thread Chris Hostetter



I've added a FAQ that may help you with this, How do i get code written
for Lucene 1.4.x to work with Lucene 2.x?

http://wiki.apache.org/jakarta-lucene/LuceneFAQ#head-d09fdfc8a6335eab4e3f3dc8ac41a40a3666318e



: Date: Tue, 26 Sep 2006 20:56:57 -
: From: Chris Salem [EMAIL PROTECTED]
: Reply-To: java-user@lucene.apache.org, Chris Salem [EMAIL PROTECTED]
: To: java-user@lucene.apache.org
: Subject: spell checker
:
: Does anyone have sample code on how to build a dictionary?
:
: I found this article online and but it uses version 1.4.3 and it doesn't seem 
to work on 2.0.0:  
http://today.java.net/pub/a/today/2005/08/09/didyoumean.html?page=1
:
: Here's the code I have:
:
: indexReader = IndexReader.open(originalIndexDirectory);
: Dictionary dictionary = new LuceneDictionary(indexReader, 
experience_desired);
: SpellChecker spellChckr = new SpellChecker(spellIndexDirectory);
: spellChckr.indexDictionary(dictionary);
: I'm getting a null pointer exception when I call indexDirectory().
: Here's how I index the field experience_desired:
: doc.add(new Field(experience_desired, value, Field.Store.NO, 
Field.Index.TOKENIZED));
: Is there another way I should do it so there is a way to build a dictionary 
on that field?
:
: Thanks
:
: Chris Salem
: 440.946.5214 x5458
: [EMAIL PROTECTED]
:
: (The following links were included with this email:)
: http://today.java.net/pub/a/today/2005/08/09/didyoumean.html?page=1
:
: mailto:[EMAIL PROTECTED]
:
:
:
: (The following links were included with this email:)
: http://today.java.net/pub/a/today/2005/08/09/didyoumean.html?page=1
:
: mailto:[EMAIL PROTECTED]
:
:
:



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Multiple Terms, Delete From Index

2006-09-26 Thread Josh Joy

Hi All,

I need to delete from the index where 2 terms are
matching, rather than 
just one term.
For example,

IndexReader reader = IndexReader.open(dir);
Term[] terms = new Term[2];
terms[0] = new Term(city,city1);
terms[1] = new Term(state,state1);
reader.delete(terms);
reader.close();

Any suggestions?

Thanks in advance,
Josh

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Very high fieldNorm for a field resulting in bad results

2006-09-26 Thread Mek


Thanks a lot Chris for the detailed  patitent response.




The value of a the field norm for any field named A is typically the
lengthNorm of the field, times the document boost, times the field boost
for *each* Field instance added to the document with the name A.
(lengthNorm is by default 1/swrt(num of terms))


That explains the very high value for the fieldNorm. The boost value
became boost_vale^#of  values in the field.

A couple of more questions:

1. Can I do away with index-time boosting for fields  tweak
query-time boosting for them ? I understand that doc level boosting is
very useful while indexing.
But for fields, both index-boost  query-boost are mutiples which lead
to the score, so would it be safe to say that I can replace the
index-time boost with query-time boosting. This allows me a lot of
freedom to test different values without re-indexing which takes  me
about 6 hours.

2. When searching through the archive I had read a post by you, saying
its possible to give exact matches much higher weightage by indexing
the START  END
from : http://www.nabble.com/What-are-norms--tf1919250.html#a5335856
it is possible to score exact matches on (tokenized) fields very high
without using lengthNorm by indexing START and END tokens for the field as
well, and then including them in your sloppy phrase queries -- the
tighter match will score highest.

Can you please elaborate on this,

Thanks a ton for the response,
mekin

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Multiple Terms, Delete From Index

2006-09-26 Thread Otis Gospodnetic

Heh, I have to try the obvious - two reader.delete(term) calls?

Otis

- Original Message 
From: Josh Joy [EMAIL PROTECTED]
To: java-user@lucene.apache.org
Sent: Tuesday, September 26, 2006 10:04:13 PM
Subject: Multiple Terms, Delete From Index

Hi All,

I need to delete from the index where 2 terms are
matching, rather than 
just one term.
For example,

IndexReader reader = IndexReader.open(dir);
Term[] terms = new Term[2];
terms[0] = new Term(city,city1);
terms[1] = new Term(state,state1);
reader.delete(terms);
reader.close();

Any suggestions?

Thanks in advance,
Josh

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: spell checker

2006-09-26 Thread Otis Gospodnetic

The code works with Lucene 2.0, I've used it.  However, it did change slightly. 
 If you look in JIRA you'll find some comments about it.  If I recall 
correctly, some changes I made to LuceneDictionary(?) class now require the 
index directory to existI think.

Otis

- Original Message 
From: Chris Salem [EMAIL PROTECTED]
To: java-user@lucene.apache.org
Sent: Tuesday, September 26, 2006 4:56:57 PM
Subject: spell checker

Does anyone have sample code on how to build a dictionary?

I found this article online and but it uses version 1.4.3 and it doesn't seem 
to work on 2.0.0:  
http://today.java.net/pub/a/today/2005/08/09/didyoumean.html?page=1

Here's the code I have:

indexReader = IndexReader.open(originalIndexDirectory);
Dictionary dictionary = new LuceneDictionary(indexReader, experience_desired);
SpellChecker spellChckr = new SpellChecker(spellIndexDirectory);
spellChckr.indexDictionary(dictionary);
I'm getting a null pointer exception when I call indexDirectory().  
Here's how I index the field experience_desired: 
doc.add(new Field(experience_desired, value, Field.Store.NO, 
Field.Index.TOKENIZED));
Is there another way I should do it so there is a way to build a dictionary on 
that field?

Thanks

Chris Salem
440.946.5214 x5458
[EMAIL PROTECTED] 

(The following links were included with this email:)
http://today.java.net/pub/a/today/2005/08/09/didyoumean.html?page=1

mailto:[EMAIL PROTECTED]

(The following links were included with this email:)
http://today.java.net/pub/a/today/2005/08/09/didyoumean.html?page=1

mailto:[EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene In Action Book vs Lucene 2.0

2006-09-26 Thread Otis Gospodnetic

Hi,

I think you'll find most of the book to still be useful (but then again, I'm 
the co-author, so maybe I'm not 100% objective).  One thing where the API 
changed is Fields.  They are now constructed differently, so the code in the 
book won't match the current API.
We have LIA code working under Lucene 2.0, but we'll have to wait and publish 
it along with LIA2.  It IS coming! :)

Otis

- Original Message 
From: KEGan [EMAIL PROTECTED]
To: java-user@lucene.apache.org
Sent: Tuesday, September 26, 2006 3:32:06 PM
Subject: Lucene In Action Book vs Lucene 2.0

Hi,

I have bought the Lucene In Action Book for more than a year now, and was
using Lucene 1.x during that time. Now, I have a new project with Lucene and
Lucene is now 2.0. Many APIs seems to have changed.

I would like to ask the experts here, what are the important or substantial
changes from Lucene 1.x to 2.0? Which part of the LIA book that is still
usable and which part is not? Any particular things that a new Lucene
2.0user that only have used the
1.x version, should pay attention to?

Thanks.




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene In Action Book vs Lucene 2.0

2006-09-26 Thread KEGan


Otis,

What about the internal of Lucene? Are there any major changes in there?

LIA is such a great book. Any date when LIA2 is coming? I definitely must
get it :)


On 9/27/06, Otis Gospodnetic [EMAIL PROTECTED] wrote:


Hi,

I think you'll find most of the book to still be useful (but then again,
I'm the co-author, so maybe I'm not 100% objective).  One thing where the
API changed is Fields.  They are now constructed differently, so the code in
the book won't match the current API.
We have LIA code working under Lucene 2.0, but we'll have to wait and
publish it along with LIA2.  It IS coming! :)

Otis

- Original Message 
From: KEGan [EMAIL PROTECTED]
To: java-user@lucene.apache.org
Sent: Tuesday, September 26, 2006 3:32:06 PM
Subject: Lucene In Action Book vs Lucene 2.0

Hi,

I have bought the Lucene In Action Book for more than a year now, and was
using Lucene 1.x during that time. Now, I have a new project with Lucene
and
Lucene is now 2.0. Many APIs seems to have changed.

I would like to ask the experts here, what are the important or
substantial
changes from Lucene 1.x to 2.0? Which part of the LIA book that is still
usable and which part is not? Any particular things that a new Lucene
2.0user that only have used the
1.x version, should pay attention to?

Thanks.




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Multiple Terms, Delete From Index

2006-09-26 Thread Josh Joy

Hi Otis,

Won't that delete all documents with term1, then all
documents with 
term2...rather than deleting all documents that
contain only term1 and 
term2...or am I missing the obvious and doing
something wrong?

Thanks,
Josh

Otis Gospodnetic wrote:
 Heh, I have to try the obvious - two
reader.delete(term) calls?

 Otis

 - Original Message 
 From: Josh Joy [EMAIL PROTECTED]
 To: java-user@lucene.apache.org
 Sent: Tuesday, September 26, 2006 10:04:13 PM
 Subject: Multiple Terms, Delete From Index

 Hi All,

 I need to delete from the index where 2 terms are
 matching, rather than 
 just one term.
 For example,

 IndexReader reader =ndexReader.open(dir);
 Term[] terms =ew Term[2];
 terms[0] =ew Term(city,city1);
 terms[1] =ew Term(state,state1);
 reader.delete(terms);
 reader.close();

 Any suggestions?

 Thanks in advance,
 Josh


-
 To unsubscribe, e-mail:
[EMAIL PROTECTED]
 For additional commands, e-mail:
[EMAIL PROTECTED]






-
 To unsubscribe, e-mail:
[EMAIL PROTECTED]
 For additional commands, e-mail:
[EMAIL PROTECTED]



   

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

43 matches

Mail list logo