Re: which HTML parser is better?

2005-02-01 Thread Michael Giles
When I tested parsers a year or so ago for intensive use in Furl, the best (tolerant of bad HTML) and fastest (tested on a 1.5M HTML page) parser by far was TagSoup ( http://www.tagsoup.info ). It is actively maintained and improved and I have never had any problems with it. -Mike Jingkang Zhang

Re: Title of PDF

2004-06-28 Thread Michael Giles
Don, I think you misunderstood Otis. You will need to use some sort of parser (i.e. pdfbox, xpdf) to get the title text from the PDF (I assume you are indexing the documents, so you have this already). Then you create a "title" field in your index and store the text of the title in there (so

Performance profile of optimization...

2004-05-24 Thread Michael Giles
What is the performance profile of optimizing an index? By that I mean, what are the primary variables that negatively impact its speed (i.e. index size (bytes, docs), number of adds/deletes since last optimization, etc). For example, if I add a single document to a small (i.e. < 10K docs) in

Re: Internal full content store within Lucene

2004-05-18 Thread Michael Giles
Certainly any advancement in this area seems like a good idea. I'll throw a use case on the pile as well. For my own interest, the biggest need is in highlighting (i.e. highlighting relevant segments within the full text of documents). I need to provide highlighted abstracts in the search resu

Re: Storing numbers

2004-03-09 Thread Michael Giles
Tim, Looks like you can only access it with a subscription. :( Sounds good, though. -Mike At 02:39 PM 3/9/2004, you wrote: [EMAIL PROTECTED] wrote: Hi! I want to store numbers (id) in my index: long id = 1069421083284; doc.add(Field.UnStored("in", String.valueOf(id))); But se

Filtering out duplicate documents...

2004-03-08 Thread Michael Giles
I'm looking for a way to filter out duplicate documents from an index (either while indexing, or after the fact). It seems like there should be an approach of comparing the terms for two documents, but I'm wondering if any other folks (i.e. nutch) have come up with a solution to this problem.

Re: Concurrency

2004-02-20 Thread Michael Giles
It would be great if we could come up with way to integrate the Lucene locking information with something more incremental like rsync. At Furl ( http://www.furl.net ) we have this problem in spades because we have thousands (and thousands) of indexes that need to be backed up. Currently, we r

See Lucene in action at Furl...

2004-01-25 Thread Michael Giles
Furl - http://www.furl.net I've been meaning to write to the list about Furl for a while as it is a pretty cool use of Lucene (Otis finally connected the dots and tracked me down last week). Furl (http://www.furl.net) is basically an Internet filing cabinet for useful web pages. Or to put it

RE: Ordening documents

2004-01-16 Thread Michael Giles
William, The order of the results are going to be based on how well they match the query (i.e. weighted by relevancy). So although all of those values contain the term "Palm", I would assume you would get the shorter entries (i.e. 1 & 3) before the longer ones (2) as they have a higher percent

Re: Multiple Creation of Writers

2004-01-14 Thread Michael Giles
Couldn't you solve this by creating your own synchronized getWriter method? I'm thinking something like (pseudo code): protected void myProgram() { ... File dir = new File("c:/import/test"); IndexWriter wrt = getWriter(dir, new StandardAnalyzer(), create(dir)); ... } protected bo

RE: Returning one result

2003-12-05 Thread Michael Giles
Tracy, I believe what Dror was referring to was the call to MultiFieldQueryParser.parse(). The second argument to that call is a String[] of field names on which to execute the query. If the field that contains "AR345" isn't listed in that array, you will not get any results. -Mike At 03:14

Re: Collaborative Filtering API

2003-11-25 Thread Michael Giles
Yes, he was the lead Ph.D. student on the GroupLens project at Minnesota. -Mike At 12:18 PM 11/25/2003, you wrote: Hello Mike, I had a quick look over the javadoc and it looks promising, as you said. Did Jon Herlocker worked on GroupLens? I know GroupLens was quite a pioneer work in the early da

Re: Collaborative Filtering API

2003-11-25 Thread Michael Giles
You should check out the work of Jon Herlocker at Oregon State (http://eecs.oregonstate.edu/iis/). They have written a CF engine that has been on my to-do list to check out for a few months (sounds good on "paper"). If you get the chance to play with it, I'd be curious to hear your feedback.

Re: MultiFieldQueryParser default operator

2003-10-30 Thread Michael Giles
This would be great to get fixed (I think I emailed a similar question a month or so ago). If MultiFieldQueryParser is being mucked with, the constructor should be updated to take an array of fields instead of the single field it takes currently. The code snippet below is actually passing the

Re: Dash Confusion in QueryParser - Bug? Feature?

2003-10-20 Thread Michael Giles
ng in this case, the Analyzer will be able to handle it the same way it did when indexing (which is what we want). -Mike At 12:57 PM 10/20/2003, you wrote: On Wednesday, October 15, 2003, at 10:24 AM, Michael Giles wrote: I looked at the patch here: http://nagoya.apache.org/bugzilla/s

Re: Dash Confusion in QueryParser - Bug? Feature?

2003-10-15 Thread Michael Giles
So how do we move this issue forward. I can't think of a single case where a "-" with no whitespace on either side (i.e. t-shirt, Wal-Mart) should be interpreted as a NOT command. Is there a feeling that changing the interpretation of such cases is a break in compatibility? I agree that it w

Re: Dash Confusion in QueryParser - Bug? Feature?

2003-10-14 Thread Michael Giles
So what do we need to do to resolve this? Has the discussion stopped because this is the "user" list and not "dev" or did it move over to the dev list? -Mike At 03:49 AM 10/13/2003, you wrote: Michael Giles wrote: He is probably using the StandardAnalyzer. I was about

Re: Dash Confusion in QueryParser - Bug? Feature?

2003-10-11 Thread Michael Giles
He is probably using the StandardAnalyzer. I was about to write the exact same email (but using Wal-Mart as an example on this page - http://www.benchmark.com/cgi-bin/suid/~bcmlp/newsletter.cgi?mode=show&year=2003&date=2003-10-07). I index and search with the same analyzer (Standard), but when

Default AND for multi-field queries...

2003-10-07 Thread Michael Giles
As with many people, I want the default query behavior to be AND (instead of OR). However, I'm also (always) creating multi-field queries. I don't see a way to accomplish this cleanly in the API. It would be great if MultiFieldQueryParser had a constructor that took an array of fields (i.e.

Re: HTML Parsing problems...

2003-09-22 Thread Michael Giles
Yeah, I was using HTMLParser for a few days until I tried to parse a 400K document and it spun at 100% CPU for a very long time. It is tolerant of bad HTML, but does not appear to scale. TagSoup processed the same document in a second or less at <25% CPU. -Mike At 02:42 PM 9/22/2003 +0200, y

Re: HTML Parsing problems...

2003-09-20 Thread Michael Giles
Erik, Probably a good idea to swap something else in, although Neko introduces a dependency on Xerces. I didn't play with Neko because I am currently using a different XML parser and didn't want to deal with the conflicts (and also find dependencies on specific parsers annoying). However, yes

Re: HTML Parsing problems...

2003-09-19 Thread Michael Giles
Tatu, Thanks for the reply. See below for comments. > just ignore everything inside of

HTML Parsing problems...

2003-09-18 Thread Michael Giles
I know, I know, the HTML Parser in the demo is just that (i.e. a demo), but I also know that it is updated from time to time and performs much better than the other ones that I have tested. Frustratingly, the very first page I tried to parse failed (