which HTML parser is better?

2005-02-01 Thread Jingkang Zhang
Three HTML parsers(Lucene web application demo,CyberNeko HTML Parser,JTidy) are mentioned in Lucene FAQ 1.3.27.Which is the best?Can it filter tags that are auto-created by MS-word 'Save As HTML files' function? _ Do You Yahoo!? 150MP3

Can I sort search results by score and docID at one time?

2005-02-01 Thread Jingkang Zhang
Lucene support sort by score or docID.Now I want to sort search results by score and docID or by two fields at one time, like sql command order by score,docID , how can I do it? _ Do You Yahoo!? 150MP3 http://music.yisou.com/

lucene docs in bulk read?

2005-02-01 Thread Chris Fraschetti
Hey folks.. thanks in advance to any who respond... I do a good deal of post-search processing and the file io to read the fields I need becomes horribly costly and is definitely a problem. Is there any way to either retrieve 1. the entire doc (all fields that can be retrieved) and/or 2. a group

Re: which HTML parser is better?

2005-02-01 Thread sergiu gordea
Jingkang Zhang wrote: Three HTML parsers(Lucene web application demo,CyberNeko HTML Parser,JTidy) are mentioned in Lucene FAQ 1.3.27.Which is the best?Can it filter tags that are auto-created by MS-word 'Save As HTML files' function? maybe you can try this library...

Source code for an accent-removal filter

2005-02-01 Thread Peter Pimley
Hi. In December I made some posts concerning a filter that could work by getting the unicode name of a character and trying to figure out the closest latin equivalent. For example, if it encountered character 00C1 LATIN CAPITAL LETTER A WITH ACUTE, it would be clever enough to replace that

Adding Fields to Document (with same name)

2005-02-01 Thread TheRanger
Hi, what happens when I add two fields with the same name to one Document? Document doc = new Document(); doc.add(Field.Text(bla, this is my first text)); doc.add(Field.Text(bla, this is my second text)); Will the second text overwrite the first, because only one field can be held with the same

Re: Adding Fields to Document (with same name)

2005-02-01 Thread Chris Lamprecht
Hi Karl, From _Lucene in Action_, section 2.2, when you add the same field with different values, Internally, Lucene appends all the words together and index them in a single Field ..., allowing you to use any of the given words when searching. See also

Re: Can I sort search results by score and docID at one time?

2005-02-01 Thread Erik Hatcher
On Feb 1, 2005, at 4:21 AM, Jingkang Zhang wrote: Lucene support sort by score or docID.Now I want to sort search results by score and docID or by two fields at one time, like sql command order by score,docID , how can I do it? Sorting by multiple fields (including score and document id) is

Re: which HTML parser is better?

2005-02-01 Thread Michael Giles
When I tested parsers a year or so ago for intensive use in Furl, the best (tolerant of bad HTML) and fastest (tested on a 1.5M HTML page) parser by far was TagSoup ( http://www.tagsoup.info ). It is actively maintained and improved and I have never had any problems with it. -Mike Jingkang Zhang

Duplicate Hits

2005-02-01 Thread Jerry Jalenak
Is there a way to eliminate duplicate hits being returned from the index? Jerry Jalenak Senior Programmer / Analyst, Web Publishing LabOne, Inc. 10101 Renner Blvd. Lenexa, KS 66219 (913) 577-1496 [EMAIL PROTECTED] This transmission (and any information attached to it) may be confidential and

Re: lucene docs in bulk read?

2005-02-01 Thread Kelvin Tan
Hi Chris, are your fields string or reader? How large do your fields get? Kelvin On Tue, 1 Feb 2005 01:40:39 -0800, Chris Fraschetti wrote:  Hey folks.. thanks in advance to any who respond...  I do a good deal of post-search processing and the file io to read  the fields I need becomes

Re: Duplicate Hits

2005-02-01 Thread Erik Hatcher
On Feb 1, 2005, at 9:01 AM, Jerry Jalenak wrote: Is there a way to eliminate duplicate hits being returned from the index? Sure, don't put duplicate documents in the index :) Erik - To unsubscribe, e-mail: [EMAIL

RE: Duplicate Hits

2005-02-01 Thread Jerry Jalenak
Ok, OK. Should have that response coming 8-) The documents I'm indexing are sent from a legacy system, and can be sent multiple times - but I only want to keep the documents if something has changed. If the indexed fields match exactly, I don't want to index the second (or third, forth, etc)

User Rights Management in Lucene

2005-02-01 Thread Verma Atul (extern)
Hi, I'm new to Lucene and want to know, whether Lucene has the capability of displaying the search results based the Users Rights. For Example: There are suppose some resources, like : Resource 1 Resource 2 Resource 3 Resource 4 And there are say 2 users with User 1 having access to

Re: User Rights Management in Lucene

2005-02-01 Thread PA
On Feb 01, 2005, at 16:01, Verma Atul (extern) wrote: I'm new to Lucene and want to know, whether Lucene has the capability of displaying the search results based the Users Rights. Not by itself. But you can make it so. Cheers -- PA, Onnay Equitursay http://alt.textdrive.com/

Re: Duplicate Hits

2005-02-01 Thread John Haxby
Jerry Jalenak wrote: Given Erik's response of 'don't put duplicate documents in the index', how can I accomplish this in the IndexWriter? I was dealing with a similar requirement recently. I eventually decided on storing the MD5 checksum of the document as a keyword. It means reading it

RE: User Rights Management in Lucene

2005-02-01 Thread Verma Atul (extern)
Thanks for the help. This means that the User management has to be done over Lucene. -Original Message- From: PA [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 01, 2005 4:06 PM To: Lucene Users List Subject: Re: User Rights Management in Lucene On Feb 01, 2005, at 16:01, Verma Atul

RE: Duplicate Hits

2005-02-01 Thread Jerry Jalenak
Nice idea John - one I hadn't considered. Once you have the checksum, do you 'check' in the index first before storing the second document? Or do you filter on the query side? Jerry Jalenak Senior Programmer / Analyst, Web Publishing LabOne, Inc. 10101 Renner Blvd. Lenexa, KS 66219 (913)

Re: User Rights Management in Lucene

2005-02-01 Thread PA
On Feb 01, 2005, at 16:07, Verma Atul (extern) wrote: Thanks for the help. This means that the User management has to be done over Lucene. Your choice. But in a nutshell, yes. Cheers -- PA, Onnay Equitursay http://alt.textdrive.com/

Re: Duplicate Hits

2005-02-01 Thread Erik Hatcher
On Feb 1, 2005, at 9:49 AM, Jerry Jalenak wrote: Given Erik's response of 'don't put duplicate documents in the index', how can I accomplish this in the IndexWriter? As John said - you'll have to come up with some way of knowing whether you should index or not. For example, when dealing with

Re: User Rights Management in Lucene

2005-02-01 Thread Erik Hatcher
On Feb 1, 2005, at 10:01 AM, Verma Atul (extern) wrote: Hi, I'm new to Lucene and want to know, whether Lucene has the capability of displaying the search results based the Users Rights. For Example: There are suppose some resources, like : Resource 1 Resource 2 Resource 3 Resource 4 And there

Re: Duplicate Hits

2005-02-01 Thread John Haxby
Jerry Jalenak wrote: Nice idea John - one I hadn't considered. Once you have the checksum, do you 'check' in the index first before storing the second document? Or do you filter on the query side? I do a quick search for the md5 checksum before indexing. Although I suspect not applicable in

RE: Duplicate Hits

2005-02-01 Thread Jerry Jalenak
Just to make sure I understand Do you keep an IndexReader open at the same time you are running the IndexWriter? From what I can see in the JavaDocs, it looks like only IndexReader (or IndexSearch) can peek into the index and see if a document exists or not Thanks! Jerry Jalenak Senior

RE: Duplicate Hits

2005-02-01 Thread Jerry Jalenak
OK - but I'm dealing with indexing between 1.5 and 2 million documents, so I really don't want to 'batch' them up if I can avoid it. And I also don't think I can keep an IndexRead open to the index at the same time I have an IndexWriter open. I may have to try and deal with this issue through

Re: Duplicate Hits

2005-02-01 Thread John Haxby
Jerry Jalenak wrote: OK - but I'm dealing with indexing between 1.5 and 2 million documents, so I really don't want to 'batch' them up if I can avoid it. And I also don't think I can keep an IndexRead open to the index at the same time I have an IndexWriter open. I may have to try and deal with

IndexSearcher close

2005-02-01 Thread Ravi
Is there a way to check if an IndexSearcher is closed? Thanks in advance, Ravi. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: Duplicate Hits

2005-02-01 Thread Erik Hatcher
On Feb 1, 2005, at 10:51 AM, Jerry Jalenak wrote: OK - but I'm dealing with indexing between 1.5 and 2 million documents, so I really don't want to 'batch' them up if I can avoid it. And I also don't think I can keep an IndexRead open to the index at the same time I have an IndexWriter open.

How to get document count?

2005-02-01 Thread Jim Lynch
I've indexed a large set of documents and think that something may have gone wrong somewhere in the middle. Is there a way I can display the count of documents in the index? Thanks, Jim. - To unsubscribe, e-mail: [EMAIL

Re: How to get document count?

2005-02-01 Thread Luke Shannon
Not sure if the API provides a method for this, but you could use Luke: http://www.getopt.org/luke/ It gives you a count and lets you step through each Doc looking at their fields. - Original Message - From: Jim Lynch [EMAIL PROTECTED] To: Lucene Users List

RE: How to get document count?

2005-02-01 Thread Ravi
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexW riter.html#docCount() You can try this. -Original Message- From: Luke Shannon [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 01, 2005 11:33 AM To: Lucene Users List Subject: Re: How to get document count? Not

RE: which HTML parser is better?

2005-02-01 Thread Chuck Williams
I think that depends on what you want to do. The Lucene demo parser does simple mapping of HTML files into Lucene Documents; it does not give you a parse tree for the HTML doc. CyberNeko is an extension of Xerces (uses the same API; will likely become part of Xerces), and so maps an HTML

Re: lucene docs in bulk read?

2005-02-01 Thread Chris Fraschetti
Well all my fields are strings when I index them. They're all very short strings, dates, hashes, etc. The largest field has a cap of 256 chars and there is only one of them, the rest are all fairly small. Can you explain what you meant by 'string or reader' ? Thanks, Chris On Tue, 1 Feb 2005

Re: Duplicate Hits

2005-02-01 Thread sergiu gordea
Erik Hatcher wrote: On Feb 1, 2005, at 10:51 AM, Jerry Jalenak wrote: OK - but I'm dealing with indexing between 1.5 and 2 million documents, so I really don't want to 'batch' them up if I can avoid it. And I also don't think I can keep an IndexRead open to the index at the same time I have an

Re: How to get document count?

2005-02-01 Thread Jim Lynch
That works, thanks. I can't use Luke on this system. It fails for some reason. Jim. Ravi wrote: http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexW riter.html#docCount() You can try this. -Original Message- From: Luke Shannon [mailto:[EMAIL PROTECTED] Sent:

competition - Re: rackmount lucene/nutch - Re: google mini? who needs it when Lucene is there

2005-02-01 Thread David Spencer
I wasn't sure where in this thread to reply so I'm replying to myself :) What search appliances exist now? I only found 3: [1] Google [2] Thunderstone http://www.thunderstone.com/texis/site/pages/Appliance.html [3] IndexEngines (not out yet)

How do I delete?

2005-02-01 Thread Jim Lynch
I've been merrily cooking along, thinking I was replacing documents when I haven't. My logic is to go through a batch of documents, get a field called reference which is unique build a term from it and delete it via the reader.delete() method. Then I close the reader and open a writer and

Re: How do I delete?

2005-02-01 Thread Joseph Ottinger
I've had success with deletion by running IndexReader.delete(int), then getting an IndexWriter and optimizing the directory. I don't know if that's the right way to do it or not. On Tue, 1 Feb 2005, Jim Lynch wrote: I've been merrily cooking along, thinking I was replacing documents when I

Re: lucene docs in bulk read?

2005-02-01 Thread Kelvin Tan
Please see inline. On Tue, 1 Feb 2005 09:27:26 -0800, Chris Fraschetti wrote:  Well all my fields are strings when I index them. They're all very  short strings, dates, hashes, etc. The largest field has a cap of  256 chars and there is only one of them, the rest are all fairly  small.  Can you

Re: How do I delete?

2005-02-01 Thread Jim Lynch
Thanks, I'd try that, but I don't think it will make any difference. If I modify the code to not reindex the documents, no files in the index directory are touched, hence there is no record of the deletions anywhere. I checked the count coming back from the delete operation and it is zero.

Re: How do I delete?

2005-02-01 Thread Joseph Ottinger
Well, in LuceneRAR, the delete by id code does exactly what I said: gets the indexreader, deletes the doc id, then it opens a writer and optimizes. Nothing else. On Tue, 1 Feb 2005, Jim Lynch wrote: Thanks, I'd try that, but I don't think it will make any difference. If I modify the code to

Re: lucene docs in bulk read?

2005-02-01 Thread Chris Fraschetti
Definitely a good idea on the one line idea... that could possibly save a good amount of time. I'm using .stringValue ... in reality, I hadn't ever even considered readerValue ... is there a strong performance difference between the two? or is it simply on the functionality side? The basic post

Combining Documents

2005-02-01 Thread Luke Shannon
Hello; I have a situation where I need to combine the fields returned from one document to an existing document. Is there something in the API for this that I'm missing or is this the best way: //add the fields contained in the PDF document to the existing doc Document Document attachedDoc =

Re: lucene docs in bulk read?

2005-02-01 Thread Kelvin Tan
On Tue, 1 Feb 2005 14:12:54 -0800, Chris Fraschetti wrote:  Definitely a good idea on the one line idea... that could possibly  save a good amount of time. I'm using .stringValue ... in reality,  I hadn't ever even considered readerValue ... is there a strong  performance difference between the

Query Format

2005-02-01 Thread Hetan Shah
Hello All, What should my query look like if I want to search all or any of the following key words. Sun Linux Red Hat Advance Server replies are much appreciated. -H - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional

Results

2005-02-01 Thread Hetan Shah
Another question for the day: How to make sure that the results shown are the only one containing the keywords specified? e.g. the result for the query Red AND HAT AND Linux should result in documents which has all the three key words and not show documents that only has one or two keywords?

Re: Query Format

2005-02-01 Thread Erik Hatcher
How are you indexing your document? If you're using QueryParser with the default operator set to OR (which is the default), then you've already provided the expression you need :) Erik On Feb 1, 2005, at 6:29 PM, Hetan Shah wrote: Hello All, What should my query look like if I want to

Re: Results

2005-02-01 Thread Erik Hatcher
On Feb 1, 2005, at 7:36 PM, Hetan Shah wrote: Another question for the day: How to make sure that the results shown are the only one containing the keywords specified? e.g. the result for the query Red AND HAT AND Linux should result in documents which has all the three key words and not show

Re: Re-Indexing a moving target???

2005-02-01 Thread Nader Henein
details? Yousef Ourabi wrote: Saad, Here is what I got. I will post again, and be more specific. -Y --- Nader Henein [EMAIL PROTECTED] wrote: We'll need a little more detail to help you, what are the sizes of your updates and how often are they updated. 1) No just re-open the index writer

Re: User Rights Management in Lucene

2005-02-01 Thread Chandrashekhar
Hi, If you r working on some CMS or similar app and want to have user rights module then you can use metadata for rights information and add this metadata into index information then you can search on this metadata. With Regards, Chandrashekhar V Deshmukh - Original Message - From:

when indexing, java.io.FileNotFoundException

2005-02-01 Thread Chris Lu
Hi, I am getting this exception now and then when I am indexing content. It doesn't always happen. But when it happens, I have to delete the index and start over again. This is a serious problem. In this email, Doug was say it has something to do with win32's lack of atomic renaming.

Re: How do I delete?

2005-02-01 Thread Chris Hostetter
: anywhere. I checked the count coming back from the delete operation and : it is zero. I even tried to delete another unique term with similar : results. First off, are you absolutely certain you are closing the reader? it's not in the code you listed. Second, I'd bet $1 that when your

enquiries - pls help, thanks

2005-02-01 Thread jac jac
Hi May I know whether Lucene currently supports indexing of xml documents? I tried building an index to index all my directories in webapps: via: java org.apache.lucene.demo.IndexFiles /homedir/tomcat/webapps then I tried using the following command to search: java