Re: performance on filtering against thousands of different publications

2007-08-14 Thread Cedric Ho
Hi Steven, Thanks for your clarification. I am using the Searcher.search(query, filter, n, sort) method. I presume this method doesn't have the same problem, since I already pass it the max number of results returned. Regards, Cedric On 8/15/07, Steven Rowe <[EMAIL PROTECTED]> wrote: > Hi Cedr

Re: performance on filtering against thousands of different publications

2007-08-14 Thread Cedric Ho
> > Some options: > 1) Try minimise leaping around the disk - maybe sorting your selected terms > will help. Look at methods in TermEnum and TermDocs which you can use to > build your own bitset from your (sorted) list of terms. Thanks, I'll try this method. > 2) Can you add higher-level terms

query question

2007-08-14 Thread Mohammad Norouzi
Hi I am using WhitespaceAnalyzer and the query is " icdCode:H* " but there is no result however I know that there are many documents with this field value such as H20, H20.5 etc. this field is tokenized and indexed what is wrong with this? when I test this query with Luke it will return no res

Re: Question on custom scoring

2007-08-14 Thread Srinivas.N.
Could be normalized relative to the max score among the matching documents - but I realize that this can only be done AFTER collecting the documents (as the Hits class does currently). It could also be normalized to some "absolute relevance score" that is comparable across queries, but there is no

Re: SpanQuery and database join

2007-08-14 Thread Peter Keegan
I added this under Use Cases. Thanks for the suggestion. Peter On 8/13/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > > There is also a Use Cases item on the Wiki... > > On Aug 13, 2007, at 3:26 PM, Peter Keegan wrote: > > > I suppose it could go under performance or HowTo/Interesting uses of

Re: 答复: Indexing correctly?

2007-08-14 Thread karl wettin
14 aug 2007 kl. 21.34 skrev John Paul Sondag: What exactly is a RAMDirectory, I didn't see it mentioned on that page. Is there example code of using it? Do I just create a Ram Directory and then use it like it's a normal directory? Yes, it is just like FSDirectory, but resides in RAM a

Re: How to keep user search history and how to turn it into information?

2007-08-14 Thread Peter W.
Lukas, One last thing, be sure to log only when a user clicks on a result and in Hadoop document_id will be a key in the map phase. Lucene related steps are the same. Best, Peter W. On Aug 14, 2007, at 1:28 PM, Peter W. wrote: When users perform a search, log the unique document_id, IP add

Re: How to keep user search history and how to turn it into information?

2007-08-14 Thread Peter W.
Hey Lukas, You can get a basic demo of this working in Lucene first then make a more advanced and efficient version. First, give each document in your index a score field using NumberTools so it's sortable. When users perform a search, log the unique document_id, IP address and result position f

Re: Question on custom scoring

2007-08-14 Thread Chris Hostetter
: [1] I need to rank matches by some combination of keyword match, popularity : and recency of the doc. I read the docs about CustomScoreQuery and seems to : be a resonable fit. An alternate way of achieving my goals is to use a : custom sort. What are the trade-offs between these two approaches?

Re: Rank based on lists.

2007-08-14 Thread Chris Hostetter
: Thanks for pointing me at the DisjunctionMaxQuery, though you're : correct, this is close but not exactly what I want. : : I think the difference lies in that it's not which subexpression had : the greater score, but that a normally lower scoring document should : get its rank elevated becaus

Re: 答复: Indexing correctly?

2007-08-14 Thread John Paul Sondag
Hello again, The file are local, sorry for using the confusing /mounts, I can see where that is confusing. What exactly is a RAMDirectory, I didn't see it mentioned on that page. Is there example code of using it? Do I just create a Ram Directory and then use it like it's a normal director

Re: Rank based on lists.

2007-08-14 Thread Grant Ingersoll
On Aug 14, 2007, at 11:57 AM, Walt Stoneburner wrote: Grant, Thanks for pointing me at the DisjunctionMaxQuery, though you're correct, this is close but not exactly what I want. I think the difference lies in that it's not which subexpression had the greater score, but that a normally low

RE: MultiSearcher with mulitple filter

2007-08-14 Thread Spencer Tickner
Wow Mark, quite the hint. Thanks so much. Spencer -Original Message- From: Mark Miller [mailto:[EMAIL PROTECTED] Sent: August 14, 2007 12:07 PM To: java-user@lucene.apache.org Subject: Re: MultiSearcher with mulitple filter Here is a hint: package org.apache.lucene.search; import jav

Re: MultiSearcher with mulitple filter

2007-08-14 Thread Mark Miller
Here is a hint: package org.apache.lucene.search; import java.io.IOException; /** * Implements search over a set of Searchables using multiple filters. */ public class MultiFilterMultiSearcher extends MultiSearcher { public MultiFilterMultiSearcher(Searchable[] searchables) throws

MultiSearcher with mulitple filter

2007-08-14 Thread Spencer Tickner
Hi List, Thanks in advance for the help. I can't wrap my head around the MultiSearcher. I need to search across multiple indexes, but also need to filter documents from users based on Access. The problem seems to be that MultiSearcher takes in 1 filter, however my filter varies from one index t

Re: performance on filtering against thousands of different publications

2007-08-14 Thread Steven Rowe
Hi Cedric, Cedric Ho wrote: > On 8/13/07, Erick Erickson <[EMAIL PROTECTED]> wrote: >> Are you iterating through a Hits object that has more than >> 100 (maybe it's 200 now) entries? Are you loading each document that >> satisfies the query? Etc. Etc. > > Unfortunately, yes. And I know this is an

Re: Rank based on lists.

2007-08-14 Thread Walt Stoneburner
Grant, Thanks for pointing me at the DisjunctionMaxQuery, though you're correct, this is close but not exactly what I want. I think the difference lies in that it's not which subexpression had the greater score, but that a normally lower scoring document should get its rank elevated because i

Best index architecture

2007-08-14 Thread Albert Vila
Hi We have a system like 'google news'. We currently parse and index over 180.000 headlines per day. One month data is 10Gb and the indexation process takes 2 hours +/-, the index size is 6Gb +/- (We're using mergeFactor 40, setMaxBufferedDocs 10, setRAMBufferSizeMb 500 and useCompou

Re: Indexing PDF documents with structure information

2007-08-14 Thread Mathieu Lecarme
Thomas Arni a écrit : > Hello Luceners > > I have started a new project and need to index pdf documents. > There are several projects around, which allow to extract the content, > like pdfbox, xpdf and pjclassic. > > As far as I studied the FAQ's and examples, all these > tools allow simple text ex

Re: Update boost factor for indexed document using setBoost()

2007-08-14 Thread Koji Sekiguchi
Hi Rohit, The way I showed you doesn't suit your need, because FieldNormModifier should be used for modifying all fieldNorm values of the field specified at the command line parameter in batch mode. You can have an extra field other than content and register the point to the field. Then use Fu

Re: formalizing a query

2007-08-14 Thread Abu Abdulla alhanbali
Thanks for the help, please provide the code to do that. I tried with this one but it didn't work: Query filterQuery = MultiFieldQueryParser.parse(new String{query1, query2, query3, query4, }, new String{field1, field2, field1, field2, ... }, new KeywordAnalyzer()); this results in: field

Re: performance on filtering against thousands of different publications

2007-08-14 Thread mark harwood
>>Do u mean it will count the number of documents for each publication source ? Lucene does that for all terms. The Luke plugin simply offers a visualisation of the variance in term frequencies for a field. It looks something like this: http://www.ucl.ac.uk/~ucbplrd/zipf.png >>each set can be

Re: Update boost factor for indexed document using setBoost()

2007-08-14 Thread rohit saini
Hi koji, please give me an example. Let me explain what I want to do: I have indexed some documents. Now I want to update the ranking of the documents based on following criteria: 1.) The documents which come into search result should get one point 2.) The documents which are viewed by the user s

Similarity

2007-08-14 Thread Enis Soztutar
Hi, I want to define different implementations for the functions in Similarity class. For example i need to define sloppyFreq() different for fields "foo" and "bar". Is there a way around this? Also i wonder why the fieldname is passed to some of the functions in Similarity (such as Similari