which HTML parser is better?
Three HTML parsers(Lucene web application demo,CyberNeko HTML Parser,JTidy) are mentioned in Lucene FAQ 1.3.27.Which is the best?Can it filter tags that are auto-created by MS-word 'Save As HTML files' function? _ Do You Yahoo!? 150MP3 http://music.yisou.com/ http://image.yisou.com 1G1000 http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/mail_1g/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Can I sort search results by score and docID at one time?
Lucene support sort by score or docID.Now I want to sort search results by score and docID or by two fields at one time, like sql command order by score,docID , how can I do it? _ Do You Yahoo!? 150MP3 http://music.yisou.com/ http://image.yisou.com 1G1000 http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/mail_1g/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
lucene docs in bulk read?
Hey folks.. thanks in advance to any who respond... I do a good deal of post-search processing and the file io to read the fields I need becomes horribly costly and is definitely a problem. Is there any way to either retrieve 1. the entire doc (all fields that can be retrieved) and/or 2. a group of docs.. specified by say an array of doc ids? I've optimized to retrieve the entire list of fields instead of 1 by 1.. and also retrieve only the minimal number of fields that I can.. but still my profilers show me that the lucene io to read the doc fields is where I spend 95% of my time. Of course this is obvious given the nature of how it all works.. but can anyone think of a better way to go about retrieving docs in bulk? Are the different types of fields quicker/slower than others when retrieving them from the index? -- ___ Chris Fraschetti e [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: which HTML parser is better?
Jingkang Zhang wrote: Three HTML parsers(Lucene web application demo,CyberNeko HTML Parser,JTidy) are mentioned in Lucene FAQ 1.3.27.Which is the best?Can it filter tags that are auto-created by MS-word 'Save As HTML files' function? maybe you can try this library... http://htmlparser.sourceforge.net/ I use the following code to get the text from HTML files, it was not intensively tested, but it works. import org.htmlparser.Node; import org.htmlparser.Parser; import org.htmlparser.util.NodeIterator; import org.htmlparser.util.Translate; Parser parser = new Parser(source.getAbsolutePath()); NodeIterator iter = parser.elements(); while (iter.hasMoreNodes()) { Node element = (Node) iter.nextNode(); //System.out.println(1: + element.getText()); String text = Translate.decode(element.toPlainTextString()); if (Utils.notEmptyString(text)) writer.write(text); } Sergiu _ Do You Yahoo!? 150MP3 http://music.yisou.com/ http://image.yisou.com 1G1000 http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/mail_1g/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Source code for an accent-removal filter
Hi. In December I made some posts concerning a filter that could work by getting the unicode name of a character and trying to figure out the closest latin equivalent. For example, if it encountered character 00C1 LATIN CAPITAL LETTER A WITH ACUTE, it would be clever enough to replace that with regular 'A'. I got moved onto another project for a while so I've not looked at the problem much since then. I'm back on it for a few days now though :) The following perl program generates some Java source for a filter that carries out the above task. Get 'UnicodeData.txt' from www.unicode.org, and then do the following: perl make_accent_filter.pl make.this.java.Class UnicodeData.txt to generate make/this/java/Class.java This comes with no license and no warranty ;) Do not think this is the full solution to your unicode-mangling problems. I'm using it as a last resort catch-all after some other filters that use the IBM ICU4J library to do all sorts of decomposition and character-category magic. Once I get it all working I should be able to post some pointers and code snippets up here. Peter ---8--- # usage: perl make_accent_filter.pl my.full.ClassName UnicodeData.txt # # creates my/full/ClassName.java use strict; use warnings; use File::Path; use File::Basename; # decompose the classname that they gave us. # # TODO: this doesn't work if the classname has no dots (i.e. it's not in a # package) my $full_class = shift; my @parts = $full_class =~ '^(.*)\.(.*)$'; my $package = shift @parts; my $class = shift @parts; # print to the correct place my $path = $full_class; $path =~ s/\./\//g; $path = $path.java; mkpath dirname $path; open STDOUT, $path or die Could not redirect stdout; print END_JAVA; // THIS FILE WAS AUTOGENERATED BY make_accent_filter.pl, DO NOT EDIT BY HAND. package $package; import org.apache.lucene.analysis.*; import java.io.*; import java.util.*; public class $class extends TokenFilter { public $class (TokenStream input) { super (input); createHash(); } // The replacement character, indexed by unicode value. // (i.e Character objects indexed by Integer objects) private static Hashtable values = null; // Creates a HashTable from the array at the bottom of this file. private void createHash () { // only run this for the first object of this class if (values != null) return; values = new Hashtable (); int i = 0; while (true) { if (array[i] == null) break; // 'array' is null terminated. Object number = array[i++]; Object replacement = array[i++]; values.put (number, replacement); } // we're done with 'array', it can be garbage collected array = null; } public Token next () throws IOException { Token t = input.next (); if (t == null) return null; // eof String s = t.termText(); s = substituteAZString (s); return new Token (s, t.startOffset(), t.endOffset()); } private String substituteAZString (String s) { char [] current = s.toCharArray (); char [] AZ = new char [current.length]; int AZi = 0; for (int i=0; icurrent.length; i++) { AZ[AZi++] = substituteAZChar (current[i]); } s = new String (AZ); return s; } private char substituteAZChar (char c) { Integer key = new Integer ((int) c); if (values.containsKey(key)) { c = ((Character)values.get(key)).charValue(); } return c; } private static Object [] array = { END_JAVA # we only care about characters whose names are of the form: my $latin_pattern = 'LATIN (.*) LETTER (.)( .*)$'; while (STDIN) { my @parts = split ;; my $num = shift @parts; my $name = shift @parts; my @matches; if (@matches = ($name =~ $latin_pattern)) { my $case = shift @matches; my $convert_to_lc = $case eq SMALL; my $letter = shift @matches; $letter = lc $letter if $convert_to_lc; printf new Integer (0x%s), new Character ('%s'), // %s\n, $num, $letter, $name; } } print END_JAVA; null }; } END_JAVA - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Adding Fields to Document (with same name)
Hi, what happens when I add two fields with the same name to one Document? Document doc = new Document(); doc.add(Field.Text(bla, this is my first text)); doc.add(Field.Text(bla, this is my second text)); Will the second text overwrite the first, because only one field can be held with the same name in one document? Will the first and the second text be merged, when I search in the field bla (e.g. with query bla:text) ? I am working on XML indexing and did not get an error when having repeated XML fields. Now I am wondering... Karl -- Sparen beginnt mit GMX DSL: http://www.gmx.net/de/go/dsl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Adding Fields to Document (with same name)
Hi Karl, From _Lucene in Action_, section 2.2, when you add the same field with different values, Internally, Lucene appends all the words together and index them in a single Field ..., allowing you to use any of the given words when searching. See also http://www.lucenebook.com/search?query=appendable+fields -chris On Tue, 1 Feb 2005 11:42:23 +0100 (MET), [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Hi, what happens when I add two fields with the same name to one Document? Document doc = new Document(); doc.add(Field.Text(bla, this is my first text)); doc.add(Field.Text(bla, this is my second text)); Will the second text overwrite the first, because only one field can be held with the same name in one document? Will the first and the second text be merged, when I search in the field bla (e.g. with query bla:text) ? I am working on XML indexing and did not get an error when having repeated XML fields. Now I am wondering... Karl -- Sparen beginnt mit GMX DSL: http://www.gmx.net/de/go/dsl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Can I sort search results by score and docID at one time?
On Feb 1, 2005, at 4:21 AM, Jingkang Zhang wrote: Lucene support sort by score or docID.Now I want to sort search results by score and docID or by two fields at one time, like sql command order by score,docID , how can I do it? Sorting by multiple fields (including score and document id) is supported. Here's an example: new Sort(new SortField[]{ new SortField(category), SortField.FIELD_SCORE, new SortField(pubmonth, SortField.INT, true) }) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: which HTML parser is better?
When I tested parsers a year or so ago for intensive use in Furl, the best (tolerant of bad HTML) and fastest (tested on a 1.5M HTML page) parser by far was TagSoup ( http://www.tagsoup.info ). It is actively maintained and improved and I have never had any problems with it. -Mike Jingkang Zhang wrote: Three HTML parsers(Lucene web application demo,CyberNeko HTML Parser,JTidy) are mentioned in Lucene FAQ 1.3.27.Which is the best?Can it filter tags that are auto-created by MS-word 'Save As HTML files' function? _ Do You Yahoo!? 150MP3 http://music.yisou.com/ http://image.yisou.com 1G1000 http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/mail_1g/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Duplicate Hits
Is there a way to eliminate duplicate hits being returned from the index? Jerry Jalenak Senior Programmer / Analyst, Web Publishing LabOne, Inc. 10101 Renner Blvd. Lenexa, KS 66219 (913) 577-1496 [EMAIL PROTECTED] This transmission (and any information attached to it) may be confidential and is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient or the person responsible for delivering the transmission to the intended recipient, be advised that you have received this transmission in error and that any use, dissemination, forwarding, printing, or copying of this information is strictly prohibited. If you have received this transmission in error, please immediately notify LabOne at the following email address: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: lucene docs in bulk read?
Hi Chris, are your fields string or reader? How large do your fields get? Kelvin On Tue, 1 Feb 2005 01:40:39 -0800, Chris Fraschetti wrote: Hey folks.. thanks in advance to any who respond... I do a good deal of post-search processing and the file io to read the fields I need becomes horribly costly and is definitely a problem. Is there any way to either retrieve 1. the entire doc (all fields that can be retrieved) and/or 2. a group of docs.. specified by say an array of doc ids? I've optimized to retrieve the entire list of fields instead of 1 by 1.. and also retrieve only the minimal number of fields that I can.. but still my profilers show me that the lucene io to read the doc fields is where I spend 95% of my time. Of course this is obvious given the nature of how it all works.. but can anyone think of a better way to go about retrieving docs in bulk? Are the different types of fields quicker/slower than others when retrieving them from the index? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Duplicate Hits
On Feb 1, 2005, at 9:01 AM, Jerry Jalenak wrote: Is there a way to eliminate duplicate hits being returned from the index? Sure, don't put duplicate documents in the index :) Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Duplicate Hits
Ok, OK. Should have that response coming 8-) The documents I'm indexing are sent from a legacy system, and can be sent multiple times - but I only want to keep the documents if something has changed. If the indexed fields match exactly, I don't want to index the second (or third, forth, etc) documents. If the indexed fields have changed, then I want to index the 'new' document, and keep it. Given Erik's response of 'don't put duplicate documents in the index', how can I accomplish this in the IndexWriter? Jerry Jalenak Senior Programmer / Analyst, Web Publishing LabOne, Inc. 10101 Renner Blvd. Lenexa, KS 66219 (913) 577-1496 [EMAIL PROTECTED] -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 01, 2005 8:35 AM To: Lucene Users List Subject: Re: Duplicate Hits On Feb 1, 2005, at 9:01 AM, Jerry Jalenak wrote: Is there a way to eliminate duplicate hits being returned from the index? Sure, don't put duplicate documents in the index :) Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] This transmission (and any information attached to it) may be confidential and is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient or the person responsible for delivering the transmission to the intended recipient, be advised that you have received this transmission in error and that any use, dissemination, forwarding, printing, or copying of this information is strictly prohibited. If you have received this transmission in error, please immediately notify LabOne at the following email address: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
User Rights Management in Lucene
Hi, I'm new to Lucene and want to know, whether Lucene has the capability of displaying the search results based the Users Rights. For Example: There are suppose some resources, like : Resource 1 Resource 2 Resource 3 Resource 4 And there are say 2 users with User 1 having access to Resource 1, Resource 2 and Resource 4; and User 2 having access to Resource 1 and Resource 3 So when User 1 searches the database, then he should get results from Resource 1, 2 and 4, but When User 2 searches the databse, then he should get results from Resource 1 and 3. Regards Atul Verma
Re: User Rights Management in Lucene
On Feb 01, 2005, at 16:01, Verma Atul (extern) wrote: I'm new to Lucene and want to know, whether Lucene has the capability of displaying the search results based the Users Rights. Not by itself. But you can make it so. Cheers -- PA, Onnay Equitursay http://alt.textdrive.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Duplicate Hits
Jerry Jalenak wrote: Given Erik's response of 'don't put duplicate documents in the index', how can I accomplish this in the IndexWriter? I was dealing with a similar requirement recently. I eventually decided on storing the MD5 checksum of the document as a keyword. It means reading it twice (once to calculate the checksum, once to index it), but it seems to do the trick. jch - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: User Rights Management in Lucene
Thanks for the help. This means that the User management has to be done over Lucene. -Original Message- From: PA [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 01, 2005 4:06 PM To: Lucene Users List Subject: Re: User Rights Management in Lucene On Feb 01, 2005, at 16:01, Verma Atul (extern) wrote: I'm new to Lucene and want to know, whether Lucene has the capability of displaying the search results based the Users Rights. Not by itself. But you can make it so. Cheers -- PA, Onnay Equitursay http://alt.textdrive.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Duplicate Hits
Nice idea John - one I hadn't considered. Once you have the checksum, do you 'check' in the index first before storing the second document? Or do you filter on the query side? Jerry Jalenak Senior Programmer / Analyst, Web Publishing LabOne, Inc. 10101 Renner Blvd. Lenexa, KS 66219 (913) 577-1496 [EMAIL PROTECTED] -Original Message- From: John Haxby [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 01, 2005 9:06 AM To: Lucene Users List Subject: Re: Duplicate Hits Jerry Jalenak wrote: Given Erik's response of 'don't put duplicate documents in the index', how can I accomplish this in the IndexWriter? I was dealing with a similar requirement recently. I eventually decided on storing the MD5 checksum of the document as a keyword. It means reading it twice (once to calculate the checksum, once to index it), but it seems to do the trick. jch - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] This transmission (and any information attached to it) may be confidential and is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient or the person responsible for delivering the transmission to the intended recipient, be advised that you have received this transmission in error and that any use, dissemination, forwarding, printing, or copying of this information is strictly prohibited. If you have received this transmission in error, please immediately notify LabOne at the following email address: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: User Rights Management in Lucene
On Feb 01, 2005, at 16:07, Verma Atul (extern) wrote: Thanks for the help. This means that the User management has to be done over Lucene. Your choice. But in a nutshell, yes. Cheers -- PA, Onnay Equitursay http://alt.textdrive.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Duplicate Hits
On Feb 1, 2005, at 9:49 AM, Jerry Jalenak wrote: Given Erik's response of 'don't put duplicate documents in the index', how can I accomplish this in the IndexWriter? As John said - you'll have to come up with some way of knowing whether you should index or not. For example, when dealing with filesystem files, the Ant index task (in the sandbox) checks last modified date and only indexes new files. Using a unique id on your data (primary key from a DB, URL from web pages, etc) is generally what people use for this. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: User Rights Management in Lucene
On Feb 1, 2005, at 10:01 AM, Verma Atul (extern) wrote: Hi, I'm new to Lucene and want to know, whether Lucene has the capability of displaying the search results based the Users Rights. For Example: There are suppose some resources, like : Resource 1 Resource 2 Resource 3 Resource 4 And there are say 2 users with User 1 having access to Resource 1, Resource 2 and Resource 4; and User 2 having access to Resource 1 and Resource 3 So when User 1 searches the database, then he should get results from Resource 1, 2 and 4, but When User 2 searches the databse, then he should get results from Resource 1 and 3. Lucene in Action has a SecurityFilterTest example (grab the source code distribution). You can see a glimpse of this here: http://www.lucenebook.com/search?query=security So yes, its possible to index a username or roles alongside each document and apply that criteria to any search a user makes such that a user only gets documents allowed. How complex this gets depends on how you need the permissions to work - the LIA example is rudimentary and simply associates an owner with each document and users are only allowed to see the documents they own. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Duplicate Hits
Jerry Jalenak wrote: Nice idea John - one I hadn't considered. Once you have the checksum, do you 'check' in the index first before storing the second document? Or do you filter on the query side? I do a quick search for the md5 checksum before indexing. Although I suspect not applicable in your case, I also maintained a last time something was indexed time alongside the index. I used this to drastically prune the number of documents that needed to be considered for indexing if I restarted; anything modified before then wasn't a candidate. Since the MD5 checksum provides the definitive (for a sufficiently loose definition of definitive) indication of whether a document is indexed I didn't need to worry about ultra-fine granularity in the time stamp and I didn't need to worry about it being committed to disk; it generally got committed to the magnetic stuff every few seconds or so. It does help a lot though if documents have nice unique identifiers that you can use instead, then you can use the identifier and the last modified time to decide whether or not to re-index. jch - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Duplicate Hits
Just to make sure I understand Do you keep an IndexReader open at the same time you are running the IndexWriter? From what I can see in the JavaDocs, it looks like only IndexReader (or IndexSearch) can peek into the index and see if a document exists or not Thanks! Jerry Jalenak Senior Programmer / Analyst, Web Publishing LabOne, Inc. 10101 Renner Blvd. Lenexa, KS 66219 (913) 577-1496 [EMAIL PROTECTED] -Original Message- From: John Haxby [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 01, 2005 9:39 AM To: Lucene Users List Subject: Re: Duplicate Hits Jerry Jalenak wrote: Nice idea John - one I hadn't considered. Once you have the checksum, do you 'check' in the index first before storing the second document? Or do you filter on the query side? I do a quick search for the md5 checksum before indexing. Although I suspect not applicable in your case, I also maintained a last time something was indexed time alongside the index. I used this to drastically prune the number of documents that needed to be considered for indexing if I restarted; anything modified before then wasn't a candidate. Since the MD5 checksum provides the definitive (for a sufficiently loose definition of definitive) indication of whether a document is indexed I didn't need to worry about ultra-fine granularity in the time stamp and I didn't need to worry about it being committed to disk; it generally got committed to the magnetic stuff every few seconds or so. It does help a lot though if documents have nice unique identifiers that you can use instead, then you can use the identifier and the last modified time to decide whether or not to re-index. jch - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] This transmission (and any information attached to it) may be confidential and is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient or the person responsible for delivering the transmission to the intended recipient, be advised that you have received this transmission in error and that any use, dissemination, forwarding, printing, or copying of this information is strictly prohibited. If you have received this transmission in error, please immediately notify LabOne at the following email address: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Duplicate Hits
OK - but I'm dealing with indexing between 1.5 and 2 million documents, so I really don't want to 'batch' them up if I can avoid it. And I also don't think I can keep an IndexRead open to the index at the same time I have an IndexWriter open. I may have to try and deal with this issue through some sort of filter on the query side, provided it doesn't impact performance to much. Thanks. Jerry Jalenak Senior Programmer / Analyst, Web Publishing LabOne, Inc. 10101 Renner Blvd. Lenexa, KS 66219 (913) 577-1496 [EMAIL PROTECTED] -Original Message- From: John Haxby [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 01, 2005 9:48 AM To: Lucene Users List Subject: Re: Duplicate Hits Jerry Jalenak wrote: Just to make sure I understand Do you keep an IndexReader open at the same time you are running the IndexWriter? From what I can see in the JavaDocs, it looks like only IndexReader (or IndexSearch) can peek into the index and see if a document exists or not I slightly misled you: it wasn't Lucene that I was using at the time and in that system the distinction between IndexReader and IndexWriter didn't exist. I'm just getting to grips with Lucene really but it would seem to be possible to use a similar scheme, especially if you batch up your documents for indexing: as they come in, check the md5 checksum against what's already known and what's already queued and then when the time comes to process the queue you know what you've got needs to be indexed. jch - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] This transmission (and any information attached to it) may be confidential and is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient or the person responsible for delivering the transmission to the intended recipient, be advised that you have received this transmission in error and that any use, dissemination, forwarding, printing, or copying of this information is strictly prohibited. If you have received this transmission in error, please immediately notify LabOne at the following email address: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Duplicate Hits
Jerry Jalenak wrote: OK - but I'm dealing with indexing between 1.5 and 2 million documents, so I really don't want to 'batch' them up if I can avoid it. And I also don't think I can keep an IndexRead open to the index at the same time I have an IndexWriter open. I may have to try and deal with this issue through some sort of filter on the query side, provided it doesn't impact performance to much. I was thinking of indexing in batches of a few documents (10? 100? 1000?) which means flipping between IndexReaders and IndexWriters wouldn't be too onerous. jch - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
IndexSearcher close
Is there a way to check if an IndexSearcher is closed? Thanks in advance, Ravi. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Duplicate Hits
On Feb 1, 2005, at 10:51 AM, Jerry Jalenak wrote: OK - but I'm dealing with indexing between 1.5 and 2 million documents, so I really don't want to 'batch' them up if I can avoid it. And I also don't think I can keep an IndexRead open to the index at the same time I have an IndexWriter open. I may have to try and deal with this issue through some sort of filter on the query side, provided it doesn't impact performance to much. You can use an IndexReader and IndexWriter at the same time (the caveat is that you cannot delete with the IndexReader at the same time you're writing with an IndexWriter). Is there no other identifying information, though, on the incoming documents with a date stamp? Identifier? Or something unique you can go on? Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
How to get document count?
I've indexed a large set of documents and think that something may have gone wrong somewhere in the middle. Is there a way I can display the count of documents in the index? Thanks, Jim. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to get document count?
Not sure if the API provides a method for this, but you could use Luke: http://www.getopt.org/luke/ It gives you a count and lets you step through each Doc looking at their fields. - Original Message - From: Jim Lynch [EMAIL PROTECTED] To: Lucene Users List lucene-user@jakarta.apache.org Sent: Tuesday, February 01, 2005 11:28 AM Subject: How to get document count? I've indexed a large set of documents and think that something may have gone wrong somewhere in the middle. Is there a way I can display the count of documents in the index? Thanks, Jim. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: How to get document count?
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexW riter.html#docCount() You can try this. -Original Message- From: Luke Shannon [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 01, 2005 11:33 AM To: Lucene Users List Subject: Re: How to get document count? Not sure if the API provides a method for this, but you could use Luke: http://www.getopt.org/luke/ It gives you a count and lets you step through each Doc looking at their fields. - Original Message - From: Jim Lynch [EMAIL PROTECTED] To: Lucene Users List lucene-user@jakarta.apache.org Sent: Tuesday, February 01, 2005 11:28 AM Subject: How to get document count? I've indexed a large set of documents and think that something may have gone wrong somewhere in the middle. Is there a way I can display the count of documents in the index? Thanks, Jim. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: which HTML parser is better?
I think that depends on what you want to do. The Lucene demo parser does simple mapping of HTML files into Lucene Documents; it does not give you a parse tree for the HTML doc. CyberNeko is an extension of Xerces (uses the same API; will likely become part of Xerces), and so maps an HTML document into a full DOM that you can manipulate easily for a wide range of purposes. I haven't used JTidy at an API level and so don't know it as well -- based on its UI, it appears to be focused primarily on HTML validation and error detection/correction. I use CyberNeko for a range of operations on HTML documents that go beyond indexing them in Lucene, and really like it. It has been robust for me so far. Chuck -Original Message- From: Jingkang Zhang [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 01, 2005 1:15 AM To: lucene-user@jakarta.apache.org Subject: which HTML parser is better? Three HTML parsers(Lucene web application demo,CyberNeko HTML Parser,JTidy) are mentioned in Lucene FAQ 1.3.27.Which is the best?Can it filter tags that are auto-created by MS-word 'Save As HTML files' function? _ Do You Yahoo!? 150MP3 http://music.yisou.com/ http://image.yisou.com 1G1000 http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma il_1g/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: lucene docs in bulk read?
Well all my fields are strings when I index them. They're all very short strings, dates, hashes, etc. The largest field has a cap of 256 chars and there is only one of them, the rest are all fairly small. Can you explain what you meant by 'string or reader' ? Thanks, Chris On Tue, 1 Feb 2005 15:11:18 +0100, Kelvin Tan [EMAIL PROTECTED] wrote: Hi Chris, are your fields string or reader? How large do your fields get? Kelvin On Tue, 1 Feb 2005 01:40:39 -0800, Chris Fraschetti wrote: Hey folks.. thanks in advance to any who respond... I do a good deal of post-search processing and the file io to read the fields I need becomes horribly costly and is definitely a problem. Is there any way to either retrieve 1. the entire doc (all fields that can be retrieved) and/or 2. a group of docs.. specified by say an array of doc ids? I've optimized to retrieve the entire list of fields instead of 1 by 1.. and also retrieve only the minimal number of fields that I can.. but still my profilers show me that the lucene io to read the doc fields is where I spend 95% of my time. Of course this is obvious given the nature of how it all works.. but can anyone think of a better way to go about retrieving docs in bulk? Are the different types of fields quicker/slower than others when retrieving them from the index? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- ___ Chris Fraschetti e [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Duplicate Hits
Erik Hatcher wrote: On Feb 1, 2005, at 10:51 AM, Jerry Jalenak wrote: OK - but I'm dealing with indexing between 1.5 and 2 million documents, so I really don't want to 'batch' them up if I can avoid it. And I also don't think I can keep an IndexRead open to the index at the same time I have an IndexWriter open. I may have to try and deal with this issue through some sort of filter on the query side, provided it doesn't impact performance to much. You can use an IndexReader and IndexWriter at the same time (the caveat is that you cannot delete with the IndexReader at the same time you're writing with an IndexWriter). Is there no other identifying information, though, on the incoming documents with a date stamp? Identifier? Or something unique you can go on? Erik As Erick suggested earlier, I think that keeping the information in the database and indentifying the new entries at database level is a better approach. Indexing documents and optimizing the index on a that big index will be very time consuming information. Also .. consider that in the future you would like to modify the structure of your index. Think how much effort will be to split some fields in a few smaller parts. Or just to change the format of a field, let's say you have a date in DDMMYY format and you need to change to MMDD. And consider how much effort is needed to rebuild a completly new index from the database Of course, your requirements may not ask to have the information stored in the database, and ... it is up to you to use a DB + Lucene index, or just a Lucene index. Best, Sergiu - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to get document count?
That works, thanks. I can't use Luke on this system. It fails for some reason. Jim. Ravi wrote: http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexW riter.html#docCount() You can try this. -Original Message- From: Luke Shannon [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 01, 2005 11:33 AM To: Lucene Users List Subject: Re: How to get document count? Not sure if the API provides a method for this, but you could use Luke: http://www.getopt.org/luke/ It gives you a count and lets you step through each Doc looking at their fields. - Original Message - From: Jim Lynch [EMAIL PROTECTED] To: Lucene Users List lucene-user@jakarta.apache.org Sent: Tuesday, February 01, 2005 11:28 AM Subject: How to get document count? I've indexed a large set of documents and think that something may have gone wrong somewhere in the middle. Is there a way I can display the count of documents in the index? Thanks, Jim. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
competition - Re: rackmount lucene/nutch - Re: google mini? who needs it when Lucene is there
I wasn't sure where in this thread to reply so I'm replying to myself :) What search appliances exist now? I only found 3: [1] Google [2] Thunderstone http://www.thunderstone.com/texis/site/pages/Appliance.html [3] IndexEngines (not out yet) http://www.indexengines.com/ -- Also, out of curiosity, do people have appliance h/w vendors they like? These guys seem like they have nice options for pretty colors: http://www.mbx.com/oem/index.cfm http://www.mbx.com/oem/options/ David Spencer wrote: This reminds me, has anyone every discussed something similar: - rackmount server ( or for coolness factor, that mini mac) - web i/f for config/control - of course the server would have the following s/w: -- web server -- lucene / nutch Part of the work here I think is having a decent web i/f to configure the thing and to customize the LF of the search results. jian chen wrote: Hi, I was searching using google and just found that there was a new feature called google mini. Initially I thought it was another free service for small companies. Then I realized that it costs quite some money ($4,995) for the hardware and software. (I guess the proprietary software costs a whole lot more than actual hardware.) The nice feature is that, you can only index up to 50,000 documents with this price. If you need to index more, sorry, send in the check... It seems to me that any small biz will be ripped off if they install this google mini thing, compared to using Lucene to implement a easy to use search software, which could search up to whatever number of documents you could image. I hope the lucene project could get exposed more to the enterprise so that people know that they have not only cheaper but more importantly, BETTER alternatives. Jian - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
How do I delete?
I've been merrily cooking along, thinking I was replacing documents when I haven't. My logic is to go through a batch of documents, get a field called reference which is unique build a term from it and delete it via the reader.delete() method. Then I close the reader and open a writer and reprocess the batch indexing all. Here is the delete and associated code: reader = IndexReader.open(database); Term t = new Term(reference,reference); try { reader.delete(t); } catch (Exception e) { System.out.println(Delete exception;+e); } except it isn't working. I tried to do a commt and a doCommit, but those are both protected. I do a reader.close() after processing the batch the first time. What am I missing? I don't get an exception. Reference is definitely a valid field, 'cause I print out the value at search time and compare to the doc and they are identical. Thanks, Jim. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How do I delete?
I've had success with deletion by running IndexReader.delete(int), then getting an IndexWriter and optimizing the directory. I don't know if that's the right way to do it or not. On Tue, 1 Feb 2005, Jim Lynch wrote: I've been merrily cooking along, thinking I was replacing documents when I haven't. My logic is to go through a batch of documents, get a field called reference which is unique build a term from it and delete it via the reader.delete() method. Then I close the reader and open a writer and reprocess the batch indexing all. Here is the delete and associated code: reader = IndexReader.open(database); Term t = new Term(reference,reference); try { reader.delete(t); } catch (Exception e) { System.out.println(Delete exception;+e); } except it isn't working. I tried to do a commt and a doCommit, but those are both protected. I do a reader.close() after processing the batch the first time. What am I missing? I don't get an exception. Reference is definitely a valid field, 'cause I print out the value at search time and compare to the doc and they are identical. Thanks, Jim. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --- Joseph B. Ottinger http://enigmastation.com IT Consultant[EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: lucene docs in bulk read?
Please see inline. On Tue, 1 Feb 2005 09:27:26 -0800, Chris Fraschetti wrote: Well all my fields are strings when I index them. They're all very short strings, dates, hashes, etc. The largest field has a cap of 256 chars and there is only one of them, the rest are all fairly small. Can you explain what you meant by 'string or reader' ? Sorry, I meant to ask if you're using String fields (field.stringValue()) or reader fields (field.readerValue()). Can you elaborate on the post-processing you need to do? Have you thought about concatenating the fields you require into a single non-indexed field (Field.UnIndexed) for simple retrieval? It'll increase the size of your index, but should be faster to retrieve them all at one go. Kelvin - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How do I delete?
Thanks, I'd try that, but I don't think it will make any difference. If I modify the code to not reindex the documents, no files in the index directory are touched, hence there is no record of the deletions anywhere. I checked the count coming back from the delete operation and it is zero. I even tried to delete another unique term with similar results. How does one call the commit method anyway? Isn't it automatically called? Jim. Joseph Ottinger wrote: I've had success with deletion by running IndexReader.delete(int), then getting an IndexWriter and optimizing the directory. I don't know if that's the right way to do it or not. On Tue, 1 Feb 2005, Jim Lynch wrote: I've been merrily cooking along, thinking I was replacing documents when I haven't. My logic is to go through a batch of documents, get a field called reference which is unique build a term from it and delete it via the reader.delete() method. Then I close the reader and open a writer and reprocess the batch indexing all. Here is the delete and associated code: reader = IndexReader.open(database); Term t = new Term(reference,reference); try { reader.delete(t); } catch (Exception e) { System.out.println(Delete exception;+e); } except it isn't working. I tried to do a commt and a doCommit, but those are both protected. I do a reader.close() after processing the batch the first time. What am I missing? I don't get an exception. Reference is definitely a valid field, 'cause I print out the value at search time and compare to the doc and they are identical. Thanks, Jim. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --- Joseph B. Ottinger http://enigmastation.com IT Consultant[EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How do I delete?
Well, in LuceneRAR, the delete by id code does exactly what I said: gets the indexreader, deletes the doc id, then it opens a writer and optimizes. Nothing else. On Tue, 1 Feb 2005, Jim Lynch wrote: Thanks, I'd try that, but I don't think it will make any difference. If I modify the code to not reindex the documents, no files in the index directory are touched, hence there is no record of the deletions anywhere. I checked the count coming back from the delete operation and it is zero. I even tried to delete another unique term with similar results. How does one call the commit method anyway? Isn't it automatically called? Jim. Joseph Ottinger wrote: I've had success with deletion by running IndexReader.delete(int), then getting an IndexWriter and optimizing the directory. I don't know if that's the right way to do it or not. On Tue, 1 Feb 2005, Jim Lynch wrote: I've been merrily cooking along, thinking I was replacing documents when I haven't. My logic is to go through a batch of documents, get a field called reference which is unique build a term from it and delete it via the reader.delete() method. Then I close the reader and open a writer and reprocess the batch indexing all. Here is the delete and associated code: reader = IndexReader.open(database); Term t = new Term(reference,reference); try { reader.delete(t); } catch (Exception e) { System.out.println(Delete exception;+e); } except it isn't working. I tried to do a commt and a doCommit, but those are both protected. I do a reader.close() after processing the batch the first time. What am I missing? I don't get an exception. Reference is definitely a valid field, 'cause I print out the value at search time and compare to the doc and they are identical. Thanks, Jim. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --- Joseph B. Ottinger http://enigmastation.com IT Consultant[EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --- Joseph B. Ottinger http://enigmastation.com IT Consultant[EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: lucene docs in bulk read?
Definitely a good idea on the one line idea... that could possibly save a good amount of time. I'm using .stringValue ... in reality, I hadn't ever even considered readerValue ... is there a strong performance difference between the two? or is it simply on the functionality side? The basic post processing is a grouping of results... because of the time and space issues of my indexing process I am unable efficiently go back and reindex a document if I have found a duplicate (my search engine deals with multiple documents over time) .. so my post processing groups results in the top 5000 hits which are the same, except over different dates... But I need to grab the minimal data in order to do this... the URL of the original page, the date of the doc, etc... so that I can use only 1 doc, but if I find a duplicate, I can simple add the new date to already existing doc. I am only reading a few fields, but on a large scale of many documents, it hurts my timing quite a bit. -Chris On Tue, 1 Feb 2005 21:33:13 +0100, Kelvin Tan [EMAIL PROTECTED] wrote: Please see inline. On Tue, 1 Feb 2005 09:27:26 -0800, Chris Fraschetti wrote: Well all my fields are strings when I index them. They're all very short strings, dates, hashes, etc. The largest field has a cap of 256 chars and there is only one of them, the rest are all fairly small. Can you explain what you meant by 'string or reader' ? Sorry, I meant to ask if you're using String fields (field.stringValue()) or reader fields (field.readerValue()). Can you elaborate on the post-processing you need to do? Have you thought about concatenating the fields you require into a single non-indexed field (Field.UnIndexed) for simple retrieval? It'll increase the size of your index, but should be faster to retrieve them all at one go. Kelvin - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- ___ Chris Fraschetti e [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Combining Documents
Hello; I have a situation where I need to combine the fields returned from one document to an existing document. Is there something in the API for this that I'm missing or is this the best way: //add the fields contained in the PDF document to the existing doc Document Document attachedDoc = LucenePDFDocument.getDocument(attached); Enumeration docFields = attachedDoc.fields(); while (docFields.hasMoreElements()) { doc.add((Field)docFields.nextElement()); } Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: lucene docs in bulk read?
On Tue, 1 Feb 2005 14:12:54 -0800, Chris Fraschetti wrote: Definitely a good idea on the one line idea... that could possibly save a good amount of time. I'm using .stringValue ... in reality, I hadn't ever even considered readerValue ... is there a strong performance difference between the two? or is it simply on the functionality side? Not that I'm aware of (performance). Reader fields are useful when reading in bulky data which doesn't make sense to be loaded into mem as a String. K - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Query Format
Hello All, What should my query look like if I want to search all or any of the following key words. Sun Linux Red Hat Advance Server replies are much appreciated. -H - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Results
Another question for the day: How to make sure that the results shown are the only one containing the keywords specified? e.g. the result for the query Red AND HAT AND Linux should result in documents which has all the three key words and not show documents that only has one or two keywords? Any hints? Thanks. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Query Format
How are you indexing your document? If you're using QueryParser with the default operator set to OR (which is the default), then you've already provided the expression you need :) Erik On Feb 1, 2005, at 6:29 PM, Hetan Shah wrote: Hello All, What should my query look like if I want to search all or any of the following key words. Sun Linux Red Hat Advance Server replies are much appreciated. -H - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Results
On Feb 1, 2005, at 7:36 PM, Hetan Shah wrote: Another question for the day: How to make sure that the results shown are the only one containing the keywords specified? e.g. the result for the query Red AND HAT AND Linux should result in documents which has all the three key words and not show documents that only has one or two keywords? Huh? You would never get documents returned that only had two of those terms given that AND'd query. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Re-Indexing a moving target???
details? Yousef Ourabi wrote: Saad, Here is what I got. I will post again, and be more specific. -Y --- Nader Henein [EMAIL PROTECTED] wrote: We'll need a little more detail to help you, what are the sizes of your updates and how often are they updated. 1) No just re-open the index writer every time to re-index, according to you it's moderately changing index, just keep a flag on the rows and batch indexing every so often. 2) It all comes down to your needs, more detail would help us help you. Nader Henein Yousef Ourabi wrote: Hey, We are using lucene to index a moderatly changing database, and I have a couple of questions on a performance strategy. 1) Should we just have one index writer open unil the system comes down...or create a new index writer each time we re-index our data-set. 2) Does anyone have anythoughts...multi-threading and segments instead of one index? Thanks for your time and help. Best, Yousef - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Nader S. Henein Senior Applications Developer Bayt.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: User Rights Management in Lucene
Hi, If you r working on some CMS or similar app and want to have user rights module then you can use metadata for rights information and add this metadata into index information then you can search on this metadata. With Regards, Chandrashekhar V Deshmukh - Original Message - From: Verma Atul (extern) [EMAIL PROTECTED] To: lucene-user@jakarta.apache.org Sent: Tuesday, February 01, 2005 8:31 PM Subject: User Rights Management in Lucene Hi, I'm new to Lucene and want to know, whether Lucene has the capability of displaying the search results based the Users Rights. For Example: There are suppose some resources, like : Resource 1 Resource 2 Resource 3 Resource 4 And there are say 2 users with User 1 having access to Resource 1, Resource 2 and Resource 4; and User 2 having access to Resource 1 and Resource 3 So when User 1 searches the database, then he should get results from Resource 1, 2 and 4, but When User 2 searches the databse, then he should get results from Resource 1 and 3. Regards Atul Verma - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
when indexing, java.io.FileNotFoundException
Hi, I am getting this exception now and then when I am indexing content. It doesn't always happen. But when it happens, I have to delete the index and start over again. This is a serious problem. In this email, Doug was say it has something to do with win32's lack of atomic renaming. http://java2.5341.com/msg/1348.html But how can I prevent this? Chris Lu java.io.FileNotFoundException: C:\data\indexes\customer\_temp\0\_1e.fnm (The system cannot find the file specified) at java.io.RandomAccessFile.open(Native Method) at java.io.RandomAccessFile.init(RandomAccessFile.java:204) at org.apache.lucene.store.FSInputStream$Descriptor.init(FSDirectory.java:376) at org.apache.lucene.store.FSInputStream.init(FSDirectory.java:405) at org.apache.lucene.store.FSDirectory.openFile(FSDirectory.java:268) at org.apache.lucene.index.FieldInfos.init(FieldInfos.java:53) at org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:109) at org.apache.lucene.index.SegmentReader.init(SegmentReader.java:94) at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:480) at org.apache.lucene.index.IndexWriter.maybeMergeSegments(IndexWriter.java:458) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:310) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:294) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How do I delete?
: anywhere. I checked the count coming back from the delete operation and : it is zero. I even tried to delete another unique term with similar : results. First off, are you absolutely certain you are closing the reader? it's not in the code you listed. Second, I'd bet $1 that when your documents were indexed, your reference field was analyzed and parsed into multiple terms. Did you try searching for the Term you're trying to delete by? (I hear luke is a pretty handy tool for checking exactly which Terms are in your index) : Here is the delete and associated code: : : reader = IndexReader.open(database); : : Term t = new Term(reference,reference); : try { : reader.delete(t); : } catch (Exception e) { : System.out.println(Delete exception;+e); : } -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
enquiries - pls help, thanks
Hi May I know whether Lucene currently supports indexing of xml documents? I tried building an index to index all my directories in webapps: via: java org.apache.lucene.demo.IndexFiles /homedir/tomcat/webapps then I tried using the following command to search: java org.apache.lucene.demo.SearchFiles and i typed in my query. I was able to see the files which directs me the path which holds my data. However, when I do java org.apache.lucene.demo.IndexHTML -create -index /homedir/index .. and I went to my website I realised it can't serach for the data I wanted instead. I want to search data within XML documents... May I know if the current demo version allows indexing of XML documents? Why is it that after I do java org.apache.lucene.demo.IndexHTML -create -index /homedir/index .. then the data I wanted can't be searched? thanks alot! jac Yahoo! Mobile - Download the latest ringtones, games, and more!