Re: Scoring a document (count?)

2006-07-31 Thread Chris Hostetter
it would certainly be possible to get a score that was a simple count of the number of matching clauses of a boolean query -- probably just with a modified Similarity (no coord, 1/0 tf, no idf, no norms) but you *might* need a slightly modified TermScorer to do that. In general though, i think yo

Re: java.lang.IllegalAccessError: tried to access method org.apache.lucene.search.HitDoc.

2006-07-31 Thread Chris Hostetter
What JVM are you using? Can you post a small sample program (or better yet: jUnit test) that causes this problem ? : Date: Sun, 30 Jul 2006 07:31:55 -0700 : From: Alan Ezust <[EMAIL PROTECTED]> : Reply-To: java-user@lucene.apache.org : To: java-user@lucene.apache.org : Subject: java.lang.Illegal

Re: Sorting

2006-07-31 Thread Chris Hostetter
1) I didn't know there were any JVMs that limited the heap size to 1GB ... a 32bit address space would impose a hard limit of 4GB, and I've heard that Windows limits process to 2GB, but I don't know of any JVMs that have 1GB limits. If you really need to deal with indexes big enough for that to m

Re: Sorting

2006-07-31 Thread Andrzej Bialecki
Chris Hostetter wrote: 1) I didn't know there were any JVMs that limited the heap size to 1GB ... a 32bit address space would impose a hard limit of 4GB, and I've heard that Windows limits process to 2GB, but I don't know of any JVMs that have 1GB limits. I believe all Win32 JVM-s have a lim

Re: Sorting

2006-07-31 Thread karl wettin
On Mon, 2006-07-31 at 11:54 +0200, Andrzej Bialecki wrote: > Chris Hostetter wrote: > > 1) I didn't know there were any JVMs that limited the heap size to 1GB ... > > a 32bit address space would impose a hard limit of 4GB, and I've heard > > that Windows limits process to 2GB, but I don't know of a

RE: Sorting

2006-07-31 Thread Rob Staveley (Tom)
Ref 1: I was just about to show you a link at Sun but I realise that it was my misread! OK, so the maximum heap is 2G on a 32-bit Linux platform, which doubles the numbers, and yes indeed 64 bits seems like a good idea, if having sort indexes in RAM is a good use of resources. But there must be a b

Index empty fields

2006-07-31 Thread Simon Willnauer
Hello, I do have a question about fields with empty content should be added to the document / index or not. I do have a index schema, which defines all field a document can have. if one of the real documents has no content for a certain field. should that field be added to the index or not. Would

RE: About search performance

2006-07-31 Thread Russell M. Allen
You should build your own performance test cases to see what works for your data. That being said, here are some numbers from a similar test I ran: I did the following: 1) run a single term query which resulted in about half of the total set of documents being returned. (~36,000) 2) built a Bo

Re: PerFieldAnalyzerWrapper use? Analyzer's not being used as expected....

2006-07-31 Thread Michael J. Prichard
Awesome! Thanks! Otis Gospodnetic wrote: Or simpler: wr = new IndexWriter(indexDir, aWrapper, !IndexReader.indexExists(indexDir)); - Original Message From: Michael J. Prichard <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Sunday, July 30, 2006 1:35:29 PM Subject: Re: PerF

Re: EMAIL ADDRESS: Tokenize (i.e. an EmailAnalyzer)

2006-07-31 Thread Michael J. Prichard
Hey Otis, Sure I would love to! Can you ping me at [EMAIL PROTECTED] and let me know what I need to do? Do I just post it to JIRA? Thanks, Michael Otis Gospodnetic wrote: A good place for that in JIRA. could you put it there? We have a bunch of analyzers in Lucene's contrib, so if you

Filters or BooleanQuery

2006-07-31 Thread Michael J. Prichard
This is more of a design question. I have a ton of email that is indexed. I need to search based on a date range so I use a RangeQuery added to a BooleanQuery to search. This works. Now I need to include another clause that will narrow the result even more. AND on top of that I will need s

Re: EMAIL ADDRESS: Tokenize (i.e. an EmailAnalyzer)

2006-07-31 Thread Suba Suresh
I would like to use the email analyzer code. I am thinking of using it along with java mail api. I have two different projects. In one I have to parse the emails sent and extract the subject and the email address. The other project I have to parse and index it in lucene for later search and re

Re: email libraries

2006-07-31 Thread Suba Suresh
Thanks for all the response. I am going to investigate java mail api along with Michael's "Email Analyzer" code that was posted in this group. thanks, suba suresh. John Haxby wrote: Andrzej Bialecki wrote: Just for the record - I've been using javamail POP and IMAP providers in the past, and

RE: Scoring a document (count?)

2006-07-31 Thread Russell M. Allen
Thank you for the reply. I am certainly open to different ways of organizing / indexing our documents. However, the example I provided was simplified for the sake of the discussion. In truth, what I was calling a category may be an arbitrary set of movie ids (determined by a previous query). Th

RE: Scoring a document (count?)

2006-07-31 Thread Russell M. Allen
Thank you for the reply Doran! You are exactly right about the sql count(*). I need the equivalent of group by, and count(). We have considered a 'joined' index where we would have a document for each permutation. We discarded it (possibly prematurely) based on the rapid explosion in the number

Re: Filter updating

2006-07-31 Thread Erick Erickson
Of course, another approach doesn't occur to me until the weekend. But, even if building a filter is a time-consuming process, you could always build them as a warm-up when your searcher starts, and cache them *then*. That way, the user doesn't see a long pause when the filter is built the fir

Cypher - Natural Language to RDF/SeRQL for the Semantic Web

2006-07-31 Thread Sherman Monroe
Hi All, I thought our technology might interest the group. Cypher is one of the first software program available which generates metadata represention of natural language input. The program outputs RDF graph and SeRQL query representations of a sentences, clauses, and phrases. The Cypher framewo

Re: EMAIL ADDRESS: Tokenize (i.e. an EmailAnalyzer)

2006-07-31 Thread Steven Rowe
Michael J. Prichard wrote: > Hey Otis, > > Sure I would love to! Can you ping me at [EMAIL PROTECTED] and > let me know what I need to do? Do I just post it to JIRA? > > Thanks, > Michael > > Otis Gospodnetic wrote: > >> A good place for that in JIRA. could you put it there? We have a >> b

Re: EMAIL ADDRESS: Tokenize (i.e. an EmailAnalyzer)

2006-07-31 Thread Michael J. Prichard
Steven Rowe wrote: Michael J. Prichard wrote: Hey Otis, Sure I would love to! Can you ping me at [EMAIL PROTECTED] and let me know what I need to do? Do I just post it to JIRA? Thanks, Michael Otis Gospodnetic wrote: A good place for that in JIRA. could you put it there? We ha

Re: Index empty fields

2006-07-31 Thread Otis Gospodnetic
Hi Simon, If you want to be able to run a "give me all documents that have an empty field F", then you'll actually have to stuff a "dummy" value when no real value for field F is present. If you have an index schema, perhaps that's a good place to add a 'defaultValue'-type attribute with that

Re: Index empty fields

2006-07-31 Thread Simon Willnauer
Hi Otis, well if i have to such a query I need a "dummy" value. To point that out a bit more in detail... A xml document has "n" mandatory elements described by a schema or dtd. Some of the could have empty values. Would it make any difference to the index / searching if I just index an empty st

Re: Index empty fields

2006-07-31 Thread Otis Gospodnetic
Hi Simon, You can index an empty field ("" value), but there is no point in doing that, really. If you index am empty string, you will not be able to find documents that had that field empty. You will not be able to do a WHERE foo IS NULL type of query, unless you detect an empty field during i

Seattle, August 9, Lucene/Nutch/Hadoop Meetup

2006-07-31 Thread Michael Cafarella
Hi everyone, If you're in Seattle for SIGIR, come to this meeting of FOHLNs (Friends Of Hadoop, Lucene and Nutch). We'll talk about search and get something to eat and drink. Please RSVP via the Evite below so I can get a bigger venue if necessary. http://www.evite.com/app/publicUrl/[EMAIL PRO

Re: Index empty fields

2006-07-31 Thread Yonik Seeley
On 7/31/06, Simon Willnauer <[EMAIL PROTECTED]> wrote: Hello, I do have a question about fields with empty content should be added to the document / index or not. I do have a index schema, which defines all field a document can have. if one of the real documents has no content for a certain fiel

Search matching

2006-07-31 Thread Rajiv Roopan
Hello, I have an index of locations for example. I'm indexing one field using SimpleAnalyzer. doc1: albany ny doc2: hudson ny doc3: new york ny doc4: new york mills ny when I search for "new york ny" , the first result returned is always "new york mills ny". Am I doing something incorrect? than