which HTML parser is better?

2005-02-01 Thread Jingkang Zhang
Three HTML parsers(Lucene web application
demo,CyberNeko HTML Parser,JTidy) are mentioned in
Lucene FAQ
1.3.27.Which is the best?Can it filter tags that are
auto-created by MS-word 'Save As HTML files' function?

_
Do You Yahoo!?
150MP3
http://music.yisou.com/

http://image.yisou.com
1G1000
http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/mail_1g/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Can I sort search results by score and docID at one time?

2005-02-01 Thread Jingkang Zhang
Lucene support sort by score or docID.Now I want to
sort search results by score and docID or by two
fields at one time, like sql
command  order by score,docID , how can I do it?

_
Do You Yahoo!?
150MP3
http://music.yisou.com/

http://image.yisou.com
1G1000
http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/mail_1g/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



lucene docs in bulk read?

2005-02-01 Thread Chris Fraschetti
Hey folks.. thanks in advance to any who respond...

I do a good deal of post-search processing and the file io to read the
fields I need becomes horribly costly and is definitely a problem. Is
there any way to either retrieve 1. the entire doc (all fields that
can be retrieved) and/or 2. a group of docs.. specified by say an
array of doc ids?

I've optimized to retrieve the entire list of fields instead of 1 by
1.. and also retrieve only the minimal number of fields that I can..
but still my profilers show me that the lucene io to read the doc
fields is where I spend 95% of my time. Of course this is obvious
given the nature of how it all works.. but can anyone think of a
better way to go about retrieving docs in bulk? Are the different
types of fields quicker/slower than others when retrieving them from
the index?

-- 
___
Chris Fraschetti
e [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: which HTML parser is better?

2005-02-01 Thread sergiu gordea
Jingkang Zhang wrote:

Three HTML parsers(Lucene web application
demo,CyberNeko HTML Parser,JTidy) are mentioned in
Lucene FAQ
1.3.27.Which is the best?Can it filter tags that are
auto-created by MS-word 'Save As HTML files' function?
  


maybe you can try this library...

http://htmlparser.sourceforge.net/

I use the following code to get the text from HTML files,
it was not intensively tested, but it works.

import org.htmlparser.Node;
import org.htmlparser.Parser;
import org.htmlparser.util.NodeIterator;
import org.htmlparser.util.Translate;

Parser parser = new Parser(source.getAbsolutePath());
NodeIterator iter = parser.elements();
while (iter.hasMoreNodes()) {
Node element = (Node) iter.nextNode();
//System.out.println(1: + element.getText());
String text = Translate.decode(element.toPlainTextString());
if (Utils.notEmptyString(text))
writer.write(text);
}

Sergiu

_
Do You Yahoo!?
150MP3
http://music.yisou.com/

http://image.yisou.com
1G1000
http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/mail_1g/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

  



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Source code for an accent-removal filter

2005-02-01 Thread Peter Pimley
Hi.
In December I made some posts concerning a filter that could work by 
getting the unicode name of a character and trying to figure out the 
closest latin equivalent.  For example, if it encountered character 00C1 
LATIN CAPITAL LETTER A WITH ACUTE, it would be clever enough to replace 
that with regular 'A'.

I got moved onto another project for a while so I've not looked at the 
problem much since then.  I'm back on it for a few days now though :)

The following perl program generates some Java source for a filter that 
carries out the above task.

Get 'UnicodeData.txt' from www.unicode.org, and then do the following:
   perl make_accent_filter.pl make.this.java.Class  UnicodeData.txt
to generate make/this/java/Class.java
This comes with no license and no warranty  ;)
Do not think this is the full solution to your unicode-mangling 
problems.  I'm using it as a last resort catch-all after some other 
filters that use the IBM ICU4J library to do all sorts of decomposition 
and character-category magic.  Once I get it all working I should be 
able to post some pointers and code snippets up here.

Peter
---8---
# usage:  perl make_accent_filter.pl my.full.ClassName  UnicodeData.txt
#
# creates my/full/ClassName.java
use strict;
use warnings;
use File::Path;
use File::Basename;

# decompose the classname that they gave us.
#
# TODO: this doesn't work if the classname has no dots (i.e. it's not in a
# package)
my $full_class = shift;
my @parts = $full_class =~ '^(.*)\.(.*)$';
my $package = shift @parts;
my $class = shift @parts;
# print to the correct place
my $path = $full_class;
$path =~ s/\./\//g;
$path = $path.java;
mkpath dirname $path;
open STDOUT,  $path or die Could not redirect stdout;

print END_JAVA;
// THIS FILE WAS AUTOGENERATED BY make_accent_filter.pl, DO NOT EDIT BY 
HAND.

package $package;
import org.apache.lucene.analysis.*;
import java.io.*;
import java.util.*;
public class $class extends TokenFilter {
   public $class (TokenStream input) {
   super (input);
   createHash();
   }
   // The replacement character, indexed by unicode value.
   // (i.e Character objects indexed by Integer objects)
   private static Hashtable values = null;
   // Creates a HashTable from the array at the bottom of this file.
   private void createHash () {
   // only run this for the first object of this class
   if (values != null) return;
   values = new Hashtable ();
   int i = 0;
   while (true) {
   if (array[i] == null) break; // 'array' is null terminated.
   Object number = array[i++];
   Object replacement = array[i++];
   values.put (number, replacement);
   }
   // we're done with 'array', it can be garbage collected
   array = null;
   }
   public Token next () throws IOException {
   Token t = input.next ();
   if (t == null) return null; // eof
   String s = t.termText();
   s = substituteAZString (s);
   return new Token (s, t.startOffset(), t.endOffset());
   }
   private String substituteAZString (String s) {
   char [] current = s.toCharArray ();
   char [] AZ = new char [current.length];
   int AZi = 0;
   for (int i=0; icurrent.length; i++) {
   AZ[AZi++] = substituteAZChar (current[i]);
   }
   s = new String (AZ);
   return s;
   }

   private char substituteAZChar (char c) {
   Integer key = new Integer ((int) c);
   if (values.containsKey(key)) {
   c = ((Character)values.get(key)).charValue();
   }
   return c;
   }
   private static Object [] array = {
END_JAVA

# we only care about characters whose names are of the form:
my $latin_pattern = 'LATIN (.*) LETTER (.)( .*)$';
while (STDIN) {
   my @parts = split ;;
   my $num  = shift @parts;
   my $name = shift @parts;
   my @matches;
   if (@matches = ($name =~ $latin_pattern)) {
   my $case = shift @matches;
   my $convert_to_lc = $case eq SMALL;
   my $letter = shift @matches;
   $letter = lc $letter if $convert_to_lc;
   printf new Integer (0x%s), new Character ('%s'), // %s\n,
   $num, $letter, $name;
   }
}
print END_JAVA;
   null };
}
END_JAVA
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Adding Fields to Document (with same name)

2005-02-01 Thread TheRanger
Hi,

what happens when I add two fields with the same name to one Document?

Document doc = new Document();
doc.add(Field.Text(bla, this is my first text));
doc.add(Field.Text(bla, this is my second text));

Will the second text overwrite the first, because only one field can be held
with the same name in one document?

Will the first and the second text be merged, when I search in the field bla
(e.g. with query bla:text) ?

I am working on XML indexing and did not get an error when having repeated
XML fields. Now I am wondering...

Karl

-- 
Sparen beginnt mit GMX DSL: http://www.gmx.net/de/go/dsl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Adding Fields to Document (with same name)

2005-02-01 Thread Chris Lamprecht
Hi Karl,

From _Lucene in Action_, section 2.2, when you add the same field with
different values, Internally, Lucene appends all the words together
and index them in a single Field ..., allowing you to use any of the
given words when searching.

See also http://www.lucenebook.com/search?query=appendable+fields

-chris

On Tue, 1 Feb 2005 11:42:23 +0100 (MET), [EMAIL PROTECTED]
[EMAIL PROTECTED] wrote:
 Hi,
 
 what happens when I add two fields with the same name to one Document?
 
 Document doc = new Document();
 doc.add(Field.Text(bla, this is my first text));
 doc.add(Field.Text(bla, this is my second text));
 
 Will the second text overwrite the first, because only one field can be held
 with the same name in one document?
 
 Will the first and the second text be merged, when I search in the field bla
 (e.g. with query bla:text) ?
 
 I am working on XML indexing and did not get an error when having repeated
 XML fields. Now I am wondering...
 
 Karl
 
 --
 Sparen beginnt mit GMX DSL: http://www.gmx.net/de/go/dsl
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Can I sort search results by score and docID at one time?

2005-02-01 Thread Erik Hatcher
On Feb 1, 2005, at 4:21 AM, Jingkang Zhang wrote:
Lucene support sort by score or docID.Now I want to
sort search results by score and docID or by two
fields at one time, like sql
command  order by score,docID , how can I do it?
Sorting by multiple fields (including score and document id) is 
supported.  Here's an example:

 new Sort(new SortField[]{
  new SortField(category),
  SortField.FIELD_SCORE,
  new SortField(pubmonth, SortField.INT, true)
})

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: which HTML parser is better?

2005-02-01 Thread Michael Giles
When I tested parsers a year or so ago for intensive use in Furl, the
best (tolerant of bad HTML) and fastest (tested on a 1.5M HTML page)
parser by far was TagSoup ( http://www.tagsoup.info ). It is actively
maintained and improved and I have never had any problems with it.

-Mike

Jingkang Zhang wrote:

Three HTML parsers(Lucene web application
demo,CyberNeko HTML Parser,JTidy) are mentioned in
Lucene FAQ
1.3.27.Which is the best?Can it filter tags that are
auto-created by MS-word 'Save As HTML files' function?

_
Do You Yahoo!?
150MP3
http://music.yisou.com/

http://image.yisou.com
1G1000
http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/mail_1g/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

  


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Duplicate Hits

2005-02-01 Thread Jerry Jalenak
Is there a way to eliminate duplicate hits being returned from the index?

Jerry Jalenak
Senior Programmer / Analyst, Web Publishing
LabOne, Inc.
10101 Renner Blvd.
Lenexa, KS  66219
(913) 577-1496

[EMAIL PROTECTED]


This transmission (and any information attached to it) may be confidential and
is intended solely for the use of the individual or entity to which it is
addressed. If you are not the intended recipient or the person responsible for
delivering the transmission to the intended recipient, be advised that you
have received this transmission in error and that any use, dissemination,
forwarding, printing, or copying of this information is strictly prohibited.
If you have received this transmission in error, please immediately notify
LabOne at the following email address: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: lucene docs in bulk read?

2005-02-01 Thread Kelvin Tan
Hi Chris, are your fields string or reader? How large do your fields get?

Kelvin

On Tue, 1 Feb 2005 01:40:39 -0800, Chris Fraschetti wrote:
 Hey folks.. thanks in advance to any who respond...

 I do a good deal of post-search processing and the file io to read
 the fields I need becomes horribly costly and is definitely a
 problem. Is there any way to either retrieve 1. the entire doc (all
 fields that can be retrieved) and/or 2. a group of docs.. specified
 by say an array of doc ids?

 I've optimized to retrieve the entire list of fields instead of 1
 by 1.. and also retrieve only the minimal number of fields that I
 can.. but still my profilers show me that the lucene io to read the
 doc fields is where I spend 95% of my time. Of course this is
 obvious given the nature of how it all works.. but can anyone think
 of a better way to go about retrieving docs in bulk? Are the
 different types of fields quicker/slower than others when
 retrieving them from the index?



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Duplicate Hits

2005-02-01 Thread Erik Hatcher
On Feb 1, 2005, at 9:01 AM, Jerry Jalenak wrote:
Is there a way to eliminate duplicate hits being returned from the 
index?
Sure, don't put duplicate documents in the index :)
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Duplicate Hits

2005-02-01 Thread Jerry Jalenak
Ok, OK.  Should have that response coming  8-)

The documents I'm indexing are sent from a legacy system, and can be sent
multiple times - but I only want to keep the documents if something has
changed.  If the indexed fields match exactly, I don't want to index the
second (or third, forth, etc) documents.  If the indexed fields have
changed, then I want to index the 'new' document, and keep it.

Given Erik's response of 'don't put duplicate documents in the index', how
can I accomplish this in the IndexWriter?

Jerry Jalenak
Senior Programmer / Analyst, Web Publishing
LabOne, Inc.
10101 Renner Blvd.
Lenexa, KS  66219
(913) 577-1496

[EMAIL PROTECTED]


-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Tuesday, February 01, 2005 8:35 AM
To: Lucene Users List
Subject: Re: Duplicate Hits


On Feb 1, 2005, at 9:01 AM, Jerry Jalenak wrote:
 Is there a way to eliminate duplicate hits being returned from the 
 index?

Sure, don't put duplicate documents in the index :)

Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


This transmission (and any information attached to it) may be confidential and
is intended solely for the use of the individual or entity to which it is
addressed. If you are not the intended recipient or the person responsible for
delivering the transmission to the intended recipient, be advised that you
have received this transmission in error and that any use, dissemination,
forwarding, printing, or copying of this information is strictly prohibited.
If you have received this transmission in error, please immediately notify
LabOne at the following email address: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



User Rights Management in Lucene

2005-02-01 Thread Verma Atul (extern)
Hi,

I'm new to Lucene and want to know, whether Lucene has the capability of
displaying the search results based the Users Rights.

For Example:

There are suppose some resources, like :

Resource 1
Resource 2
Resource 3
Resource 4

And there are say 2 users with 

User 1 having access to Resource 1, Resource 2 and Resource 4; and User
2 having access to Resource 1 and Resource 3

So when User 1 searches the database, then he should get results from
Resource 1, 2 and 4, but

When User 2 searches the databse, then he should get results from
Resource 1 and 3.

Regards
Atul Verma


Re: User Rights Management in Lucene

2005-02-01 Thread PA
On Feb 01, 2005, at 16:01, Verma Atul (extern) wrote:
I'm new to Lucene and want to know, whether Lucene has the capability 
of
displaying the search results based the Users Rights.
Not by itself. But you can make it so.
Cheers
--
PA, Onnay Equitursay
http://alt.textdrive.com/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Duplicate Hits

2005-02-01 Thread John Haxby
Jerry Jalenak wrote:
Given Erik's response of 'don't put duplicate documents in the index', how
can I accomplish this in the IndexWriter?
 

I was dealing with a similar requirement recently.   I eventually 
decided on storing the MD5 checksum of the document as a keyword.   It 
means reading it twice (once to calculate the checksum, once to index 
it), but it seems to do the trick.

jch
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: User Rights Management in Lucene

2005-02-01 Thread Verma Atul (extern)
Thanks for the help. This means that the User management has to be done
over Lucene.

-Original Message-
From: PA [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, February 01, 2005 4:06 PM
To: Lucene Users List
Subject: Re: User Rights Management in Lucene


On Feb 01, 2005, at 16:01, Verma Atul (extern) wrote:

 I'm new to Lucene and want to know, whether Lucene has the capability 
 of
 displaying the search results based the Users Rights.

Not by itself. But you can make it so.

Cheers

--
PA, Onnay Equitursay
http://alt.textdrive.com/


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Duplicate Hits

2005-02-01 Thread Jerry Jalenak
Nice idea John - one I hadn't considered.  Once you have the checksum, do
you 'check' in the index first before storing the second document?  Or do
you filter on the query side?

Jerry Jalenak
Senior Programmer / Analyst, Web Publishing
LabOne, Inc.
10101 Renner Blvd.
Lenexa, KS  66219
(913) 577-1496

[EMAIL PROTECTED]


-Original Message-
From: John Haxby [mailto:[EMAIL PROTECTED]
Sent: Tuesday, February 01, 2005 9:06 AM
To: Lucene Users List
Subject: Re: Duplicate Hits


Jerry Jalenak wrote:

Given Erik's response of 'don't put duplicate documents in the index', how
can I accomplish this in the IndexWriter?
  

I was dealing with a similar requirement recently.   I eventually 
decided on storing the MD5 checksum of the document as a keyword.   It 
means reading it twice (once to calculate the checksum, once to index 
it), but it seems to do the trick.

jch

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


This transmission (and any information attached to it) may be confidential and
is intended solely for the use of the individual or entity to which it is
addressed. If you are not the intended recipient or the person responsible for
delivering the transmission to the intended recipient, be advised that you
have received this transmission in error and that any use, dissemination,
forwarding, printing, or copying of this information is strictly prohibited.
If you have received this transmission in error, please immediately notify
LabOne at the following email address: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: User Rights Management in Lucene

2005-02-01 Thread PA
On Feb 01, 2005, at 16:07, Verma Atul (extern) wrote:
Thanks for the help. This means that the User management has to be done
over Lucene.
Your choice. But in a nutshell, yes.
Cheers
--
PA, Onnay Equitursay
http://alt.textdrive.com/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Duplicate Hits

2005-02-01 Thread Erik Hatcher
On Feb 1, 2005, at 9:49 AM, Jerry Jalenak wrote:
Given Erik's response of 'don't put duplicate documents in the index', 
how
can I accomplish this in the IndexWriter?
As John said - you'll have to come up with some way of knowing whether 
you should index or not.  For example, when dealing with filesystem 
files, the Ant index task (in the sandbox) checks last modified date 
and only indexes new files.

Using a unique id on your data (primary key from a DB, URL from web 
pages, etc) is generally what people use for this.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: User Rights Management in Lucene

2005-02-01 Thread Erik Hatcher
On Feb 1, 2005, at 10:01 AM, Verma Atul (extern) wrote:
Hi,
I'm new to Lucene and want to know, whether Lucene has the capability 
of
displaying the search results based the Users Rights.

For Example:
There are suppose some resources, like :
Resource 1
Resource 2
Resource 3
Resource 4
And there are say 2 users with
User 1 having access to Resource 1, Resource 2 and Resource 4; and User
2 having access to Resource 1 and Resource 3
So when User 1 searches the database, then he should get results from
Resource 1, 2 and 4, but
When User 2 searches the databse, then he should get results from
Resource 1 and 3.
Lucene in Action has a SecurityFilterTest example (grab the source code 
distribution).  You can see a glimpse of this here:

http://www.lucenebook.com/search?query=security
So yes, its possible to index a username or roles alongside each 
document and apply that criteria to any search a user makes such that a 
user only gets documents allowed.  How complex this gets depends on how 
you need the permissions to work - the LIA example is rudimentary and 
simply associates an owner with each document and users are only 
allowed to see the documents they own.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Duplicate Hits

2005-02-01 Thread John Haxby
Jerry Jalenak wrote:
Nice idea John - one I hadn't considered.  Once you have the checksum, do
you 'check' in the index first before storing the second document?  Or do
you filter on the query side?
 

I do a quick search for the md5 checksum before indexing.
Although I suspect not applicable in your case, I also maintained a 
last time something was indexed time alongside the index.  I used this 
to drastically prune the number of documents that needed to be 
considered for indexing if I restarted; anything modified before then 
wasn't a candidate.  Since the MD5 checksum provides the definitive (for 
a sufficiently loose definition of definitive) indication of whether a 
document is indexed I didn't need to worry about ultra-fine granularity 
in the time stamp and I didn't need to worry about it being committed to 
disk; it generally got committed to the magnetic stuff every few seconds 
or so.

It does help a lot though if documents have nice unique identifiers that 
you can use instead, then you can use the identifier and the last 
modified time to decide whether or not to re-index.

jch
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Duplicate Hits

2005-02-01 Thread Jerry Jalenak
Just to make sure I understand

Do you keep an IndexReader open at the same time you are running the
IndexWriter?  From what I can see in the JavaDocs, it looks like only
IndexReader (or IndexSearch) can peek into the index and see if a document
exists or not

Thanks!

Jerry Jalenak
Senior Programmer / Analyst, Web Publishing
LabOne, Inc.
10101 Renner Blvd.
Lenexa, KS  66219
(913) 577-1496

[EMAIL PROTECTED]


-Original Message-
From: John Haxby [mailto:[EMAIL PROTECTED]
Sent: Tuesday, February 01, 2005 9:39 AM
To: Lucene Users List
Subject: Re: Duplicate Hits


Jerry Jalenak wrote:

Nice idea John - one I hadn't considered.  Once you have the checksum, do
you 'check' in the index first before storing the second document?  Or do
you filter on the query side?
  

I do a quick search for the md5 checksum before indexing.

Although I suspect not applicable in your case, I also maintained a 
last time something was indexed time alongside the index.  I used this 
to drastically prune the number of documents that needed to be 
considered for indexing if I restarted; anything modified before then 
wasn't a candidate.  Since the MD5 checksum provides the definitive (for 
a sufficiently loose definition of definitive) indication of whether a 
document is indexed I didn't need to worry about ultra-fine granularity 
in the time stamp and I didn't need to worry about it being committed to 
disk; it generally got committed to the magnetic stuff every few seconds 
or so.

It does help a lot though if documents have nice unique identifiers that 
you can use instead, then you can use the identifier and the last 
modified time to decide whether or not to re-index.

jch

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


This transmission (and any information attached to it) may be confidential and
is intended solely for the use of the individual or entity to which it is
addressed. If you are not the intended recipient or the person responsible for
delivering the transmission to the intended recipient, be advised that you
have received this transmission in error and that any use, dissemination,
forwarding, printing, or copying of this information is strictly prohibited.
If you have received this transmission in error, please immediately notify
LabOne at the following email address: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Duplicate Hits

2005-02-01 Thread Jerry Jalenak
OK - but I'm dealing with indexing between 1.5 and 2 million documents, so I
really don't want to 'batch' them up if I can avoid it.  And I also don't
think I can keep an IndexRead open to the index at the same time I have an
IndexWriter open.  I may have to try and deal with this issue through some
sort of filter on the query side, provided it doesn't impact performance to
much.

Thanks.

Jerry Jalenak
Senior Programmer / Analyst, Web Publishing
LabOne, Inc.
10101 Renner Blvd.
Lenexa, KS  66219
(913) 577-1496

[EMAIL PROTECTED]


-Original Message-
From: John Haxby [mailto:[EMAIL PROTECTED]
Sent: Tuesday, February 01, 2005 9:48 AM
To: Lucene Users List
Subject: Re: Duplicate Hits


Jerry Jalenak wrote:

Just to make sure I understand

Do you keep an IndexReader open at the same time you are running the
IndexWriter?  From what I can see in the JavaDocs, it looks like only
IndexReader (or IndexSearch) can peek into the index and see if a document
exists or not
  

I slightly misled you: it wasn't Lucene that I was using at the time and 
in that system the distinction between IndexReader and IndexWriter 
didn't exist.   I'm just getting to grips with Lucene really but it 
would seem to be possible to use a similar scheme, especially if you 
batch up your documents for indexing: as they come in, check the md5 
checksum against what's already known and what's already queued and then 
when the time comes to process the queue you know what you've got needs 
to be indexed.

jch

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


This transmission (and any information attached to it) may be confidential and
is intended solely for the use of the individual or entity to which it is
addressed. If you are not the intended recipient or the person responsible for
delivering the transmission to the intended recipient, be advised that you
have received this transmission in error and that any use, dissemination,
forwarding, printing, or copying of this information is strictly prohibited.
If you have received this transmission in error, please immediately notify
LabOne at the following email address: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Duplicate Hits

2005-02-01 Thread John Haxby
Jerry Jalenak wrote:
OK - but I'm dealing with indexing between 1.5 and 2 million documents, so I
really don't want to 'batch' them up if I can avoid it.  And I also don't
think I can keep an IndexRead open to the index at the same time I have an
IndexWriter open.  I may have to try and deal with this issue through some
sort of filter on the query side, provided it doesn't impact performance to
much.
 

I was thinking of indexing in batches of a few documents (10? 100? 
1000?) which means flipping between IndexReaders and IndexWriters 
wouldn't be too onerous.

jch
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


IndexSearcher close

2005-02-01 Thread Ravi
Is there a way to check if an IndexSearcher is closed? 


Thanks in advance,
Ravi.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Duplicate Hits

2005-02-01 Thread Erik Hatcher
On Feb 1, 2005, at 10:51 AM, Jerry Jalenak wrote:
OK - but I'm dealing with indexing between 1.5 and 2 million 
documents, so I
really don't want to 'batch' them up if I can avoid it.  And I also 
don't
think I can keep an IndexRead open to the index at the same time I 
have an
IndexWriter open.  I may have to try and deal with this issue through 
some
sort of filter on the query side, provided it doesn't impact 
performance to
much.
You can use an IndexReader and IndexWriter at the same time (the caveat 
is that you cannot delete with the IndexReader at the same time you're 
writing with an IndexWriter).  Is there no other identifying 
information, though, on the incoming documents with a date stamp?  
Identifier?  Or something unique you can go on?

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


How to get document count?

2005-02-01 Thread Jim Lynch
I've indexed a large set of documents and think that something may have 
gone wrong somewhere in the middle.  Is there a way I can display the 
count of documents in the index? 

Thanks,
Jim.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: How to get document count?

2005-02-01 Thread Luke Shannon
Not sure if the API provides a method for this, but you could use Luke:

http://www.getopt.org/luke/

It gives you a count and lets you step through each Doc looking at their
fields.

- Original Message - 
From: Jim Lynch [EMAIL PROTECTED]
To: Lucene Users List lucene-user@jakarta.apache.org
Sent: Tuesday, February 01, 2005 11:28 AM
Subject: How to get document count?


 I've indexed a large set of documents and think that something may have
 gone wrong somewhere in the middle.  Is there a way I can display the
 count of documents in the index?

 Thanks,
 Jim.

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: How to get document count?

2005-02-01 Thread Ravi
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexW
riter.html#docCount()

You can try this.

-Original Message-
From: Luke Shannon [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, February 01, 2005 11:33 AM
To: Lucene Users List
Subject: Re: How to get document count?

Not sure if the API provides a method for this, but you could use Luke:

http://www.getopt.org/luke/

It gives you a count and lets you step through each Doc looking at their
fields.

- Original Message - 
From: Jim Lynch [EMAIL PROTECTED]
To: Lucene Users List lucene-user@jakarta.apache.org
Sent: Tuesday, February 01, 2005 11:28 AM
Subject: How to get document count?


 I've indexed a large set of documents and think that something may
have
 gone wrong somewhere in the middle.  Is there a way I can display the
 count of documents in the index?

 Thanks,
 Jim.

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: which HTML parser is better?

2005-02-01 Thread Chuck Williams
I think that depends on what you want to do.  The Lucene demo parser does 
simple mapping of HTML files into Lucene Documents; it does not give you a 
parse tree for the HTML doc.  CyberNeko is an extension of Xerces (uses the 
same API; will likely become part of Xerces), and so maps an HTML document into 
a full DOM that you can manipulate easily for a wide range of purposes.  I 
haven't used JTidy at an API level and so don't know it as well -- based on its 
UI, it appears to be focused primarily on HTML validation and error 
detection/correction.

I use CyberNeko for a range of operations on HTML documents that go beyond 
indexing them in Lucene, and really like it.  It has been robust for me so far.

Chuck

   -Original Message-
   From: Jingkang Zhang [mailto:[EMAIL PROTECTED]
   Sent: Tuesday, February 01, 2005 1:15 AM
   To: lucene-user@jakarta.apache.org
   Subject: which HTML parser is better?
   
   Three HTML parsers(Lucene web application
   demo,CyberNeko HTML Parser,JTidy) are mentioned in
   Lucene FAQ
   1.3.27.Which is the best?Can it filter tags that are
   auto-created by MS-word 'Save As HTML files' function?
   
   _
   Do You Yahoo!?
   150MP3
   http://music.yisou.com/
   
   http://image.yisou.com
   1G1000
   http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma
   il_1g/
   
   -
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: lucene docs in bulk read?

2005-02-01 Thread Chris Fraschetti
Well all my fields are strings when I index them. They're all very
short strings, dates, hashes, etc. The largest field has a cap of 256
chars and there is only one of them, the rest are all fairly small.

Can you explain what you meant by 'string or reader' ?

Thanks,
Chris


On Tue, 1 Feb 2005 15:11:18 +0100, Kelvin Tan [EMAIL PROTECTED] wrote:
 Hi Chris, are your fields string or reader? How large do your fields get?
 
 Kelvin
 
 On Tue, 1 Feb 2005 01:40:39 -0800, Chris Fraschetti wrote:
  Hey folks.. thanks in advance to any who respond...
 
  I do a good deal of post-search processing and the file io to read
  the fields I need becomes horribly costly and is definitely a
  problem. Is there any way to either retrieve 1. the entire doc (all
  fields that can be retrieved) and/or 2. a group of docs.. specified
  by say an array of doc ids?
 
  I've optimized to retrieve the entire list of fields instead of 1
  by 1.. and also retrieve only the minimal number of fields that I
  can.. but still my profilers show me that the lucene io to read the
  doc fields is where I spend 95% of my time. Of course this is
  obvious given the nature of how it all works.. but can anyone think
  of a better way to go about retrieving docs in bulk? Are the
  different types of fields quicker/slower than others when
  retrieving them from the index?
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-- 
___
Chris Fraschetti
e [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Duplicate Hits

2005-02-01 Thread sergiu gordea
Erik Hatcher wrote:
On Feb 1, 2005, at 10:51 AM, Jerry Jalenak wrote:
OK - but I'm dealing with indexing between 1.5 and 2 million 
documents, so I
really don't want to 'batch' them up if I can avoid it.  And I also 
don't
think I can keep an IndexRead open to the index at the same time I 
have an
IndexWriter open.  I may have to try and deal with this issue through 
some
sort of filter on the query side, provided it doesn't impact 
performance to
much.

You can use an IndexReader and IndexWriter at the same time (the 
caveat is that you cannot delete with the IndexReader at the same time 
you're writing with an IndexWriter).  Is there no other identifying 
information, though, on the incoming documents with a date stamp?  
Identifier?  Or something unique you can go on?

Erik
As Erick suggested earlier, I think that keeping the information in the 
database and indentifying the new entries at database level is a better 
approach.
Indexing documents and optimizing the index on a that big index will be 
very time consuming information.
Also .. consider that in the future you would like to modify the 
structure of your index.

Think how much effort will be to split some fields in a few smaller 
parts. Or just to change the format of a field,
let's say you have a date in DDMMYY format and you need to change to 
MMDD.

And consider how much effort is needed to rebuild a completly new index 
from the database

Of course, your requirements may not ask to have the information stored 
in the database, and ... it is up to you to use a DB + Lucene index,

or just a Lucene index.
Best,
Sergiu

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: How to get document count?

2005-02-01 Thread Jim Lynch
That works, thanks.  I can't use Luke on this system.   It fails for 
some reason.

Jim.
Ravi wrote:
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexW
riter.html#docCount()
You can try this.
-Original Message-
From: Luke Shannon [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, February 01, 2005 11:33 AM
To: Lucene Users List
Subject: Re: How to get document count?

Not sure if the API provides a method for this, but you could use Luke:
http://www.getopt.org/luke/
It gives you a count and lets you step through each Doc looking at their
fields.
- Original Message - 
From: Jim Lynch [EMAIL PROTECTED]
To: Lucene Users List lucene-user@jakarta.apache.org
Sent: Tuesday, February 01, 2005 11:28 AM
Subject: How to get document count?

 

I've indexed a large set of documents and think that something may
   

have
 

gone wrong somewhere in the middle.  Is there a way I can display the
count of documents in the index?
Thanks,
Jim.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


competition - Re: rackmount lucene/nutch - Re: google mini? who needs it when Lucene is there

2005-02-01 Thread David Spencer
I wasn't sure where in this thread to reply so I'm replying to myself :)
What search appliances exist now?
I only found 3:
[1] Google
[2] Thunderstone
http://www.thunderstone.com/texis/site/pages/Appliance.html
[3] IndexEngines (not out yet)
http://www.indexengines.com/

--
Also, out of curiosity, do people have appliance h/w vendors they like?
These guys seem like they have nice options for pretty colors:
http://www.mbx.com/oem/index.cfm
http://www.mbx.com/oem/options/

David Spencer wrote:
This reminds me, has anyone every discussed something similar:
- rackmount server ( or for coolness factor, that mini mac)
- web i/f for config/control
- of course the server would have the following s/w:
-- web server
-- lucene / nutch
Part of the work here I think is having a decent web i/f to configure 
the thing and to customize the LF of the search results.


jian chen wrote:
Hi,
I was searching using google and just found that there was a new
feature called google mini. Initially I thought it was another free
service for small companies. Then I realized that it costs quite some
money ($4,995) for the hardware and software. (I guess the proprietary
software costs a whole lot more than actual hardware.)
The nice feature is that, you can only index up to 50,000 documents
with this price. If you need to index more, sorry, send in the
check...
It seems to me that any small biz will be ripped off if they install
this google mini thing, compared to using Lucene to implement a easy
to use search software, which could search up to whatever number of
documents you could image.
I hope the lucene project could get exposed more to the enterprise so
that people know that they have not only cheaper but more importantly,
BETTER alternatives.
Jian
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


How do I delete?

2005-02-01 Thread Jim Lynch
I've been merrily cooking along, thinking I was replacing documents when 
I haven't.  My logic is to go through a batch of documents, get a field 
called reference which is unique build a term from it and delete it 
via the reader.delete() method.  Then I close the reader and open a 
writer and reprocess the batch indexing all. 

Here is the delete and associated code:
 reader = IndexReader.open(database);
 Term t = new Term(reference,reference);
 try {
   reader.delete(t);
 } catch (Exception e) {
   System.out.println(Delete exception;+e);
 }
except it isn't working.  I tried to do a commt and a doCommit, but 
those are both protected.  I do a reader.close() after processing the 
batch the first time. 

What am I missing?  I don't get an exception.  Reference is definitely a 
valid field, 'cause I print out the value at search time and compare to 
the doc and they are identical.

Thanks,
Jim.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: How do I delete?

2005-02-01 Thread Joseph Ottinger
I've had success with deletion by running IndexReader.delete(int), then
getting an IndexWriter and optimizing the directory. I don't know if
that's the right way to do it or not.

On Tue, 1 Feb 2005, Jim Lynch wrote:

 I've been merrily cooking along, thinking I was replacing documents when
 I haven't.  My logic is to go through a batch of documents, get a field
 called reference which is unique build a term from it and delete it
 via the reader.delete() method.  Then I close the reader and open a
 writer and reprocess the batch indexing all.

 Here is the delete and associated code:

   reader = IndexReader.open(database);

   Term t = new Term(reference,reference);
   try {
 reader.delete(t);
   } catch (Exception e) {
 System.out.println(Delete exception;+e);
   }

 except it isn't working.  I tried to do a commt and a doCommit, but
 those are both protected.  I do a reader.close() after processing the
 batch the first time.

 What am I missing?  I don't get an exception.  Reference is definitely a
 valid field, 'cause I print out the value at search time and compare to
 the doc and they are identical.

 Thanks,
 Jim.

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


---
Joseph B. Ottinger http://enigmastation.com
IT Consultant[EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: lucene docs in bulk read?

2005-02-01 Thread Kelvin Tan
Please see inline.

On Tue, 1 Feb 2005 09:27:26 -0800, Chris Fraschetti wrote:
 Well all my fields are strings when I index them. They're all very
 short strings, dates, hashes, etc. The largest field has a cap of
 256 chars and there is only one of them, the rest are all fairly
 small.

 Can you explain what you meant by 'string or reader' ?

Sorry, I meant to ask if you're using String fields (field.stringValue()) or 
reader fields (field.readerValue()).

Can you elaborate on the post-processing you need to do? Have you thought about 
concatenating the fields you require into a single non-indexed field 
(Field.UnIndexed) for simple retrieval? It'll increase the size of your index, 
but should be faster to retrieve them all at one go.

Kelvin


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How do I delete?

2005-02-01 Thread Jim Lynch
Thanks, I'd try that, but I don't think it will make any difference.  If 
I modify the code to not reindex the documents, no files in the index 
directory are touched, hence there is no record of the deletions 
anywhere.  I checked the count coming back from the delete operation and 
it is zero.  I even tried to delete another unique term with similar 
results.

How does one call the commit method anyway? Isn't it automatically called?
Jim.
Joseph Ottinger wrote:
I've had success with deletion by running IndexReader.delete(int), then
getting an IndexWriter and optimizing the directory. I don't know if
that's the right way to do it or not.
On Tue, 1 Feb 2005, Jim Lynch wrote:
 

I've been merrily cooking along, thinking I was replacing documents when
I haven't.  My logic is to go through a batch of documents, get a field
called reference which is unique build a term from it and delete it
via the reader.delete() method.  Then I close the reader and open a
writer and reprocess the batch indexing all.
Here is the delete and associated code:
 reader = IndexReader.open(database);
 Term t = new Term(reference,reference);
 try {
   reader.delete(t);
 } catch (Exception e) {
   System.out.println(Delete exception;+e);
 }
except it isn't working.  I tried to do a commt and a doCommit, but
those are both protected.  I do a reader.close() after processing the
batch the first time.
What am I missing?  I don't get an exception.  Reference is definitely a
valid field, 'cause I print out the value at search time and compare to
the doc and they are identical.
Thanks,
Jim.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   

---
Joseph B. Ottinger http://enigmastation.com
IT Consultant[EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: How do I delete?

2005-02-01 Thread Joseph Ottinger
Well, in LuceneRAR, the delete by id code does exactly what I said: gets
the indexreader, deletes the doc id, then it opens a writer and optimizes.
Nothing else.

On Tue, 1 Feb 2005, Jim Lynch wrote:

 Thanks, I'd try that, but I don't think it will make any difference.  If
 I modify the code to not reindex the documents, no files in the index
 directory are touched, hence there is no record of the deletions
 anywhere.  I checked the count coming back from the delete operation and
 it is zero.  I even tried to delete another unique term with similar
 results.

 How does one call the commit method anyway? Isn't it automatically called?

 Jim.

 Joseph Ottinger wrote:

 I've had success with deletion by running IndexReader.delete(int), then
 getting an IndexWriter and optimizing the directory. I don't know if
 that's the right way to do it or not.
 
 On Tue, 1 Feb 2005, Jim Lynch wrote:
 
 
 
 I've been merrily cooking along, thinking I was replacing documents when
 I haven't.  My logic is to go through a batch of documents, get a field
 called reference which is unique build a term from it and delete it
 via the reader.delete() method.  Then I close the reader and open a
 writer and reprocess the batch indexing all.
 
 Here is the delete and associated code:
 
   reader = IndexReader.open(database);
 
   Term t = new Term(reference,reference);
   try {
 reader.delete(t);
   } catch (Exception e) {
 System.out.println(Delete exception;+e);
   }
 
 except it isn't working.  I tried to do a commt and a doCommit, but
 those are both protected.  I do a reader.close() after processing the
 batch the first time.
 
 What am I missing?  I don't get an exception.  Reference is definitely a
 valid field, 'cause I print out the value at search time and compare to
 the doc and they are identical.
 
 Thanks,
 Jim.
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 
 ---
 Joseph B. Ottinger http://enigmastation.com
 IT Consultant[EMAIL PROTECTED]
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


---
Joseph B. Ottinger http://enigmastation.com
IT Consultant[EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: lucene docs in bulk read?

2005-02-01 Thread Chris Fraschetti
Definitely a good idea on the one line idea... that could possibly
save a good amount of time. I'm using .stringValue ... in reality, I
hadn't ever even considered readerValue ... is there a strong
performance difference between the two? or is it simply on the
functionality side?

The basic post processing is a grouping of results... because of the
time and space issues of my indexing process I am unable efficiently
go back and reindex a document if I have found a duplicate (my search
engine deals with multiple documents over time) .. so my post
processing groups results in the top 5000 hits which are the same,
except over different dates... But I need to grab the minimal data in
order to do this... the URL of the original page, the date of the doc,
etc... so that I can use only 1 doc, but if I find a duplicate, I can
simple add the new date to already existing doc. I am only reading a
few fields, but on a large scale of many documents, it hurts my timing
quite a bit.

-Chris


On Tue, 1 Feb 2005 21:33:13 +0100, Kelvin Tan [EMAIL PROTECTED] wrote:
 Please see inline.
 
 On Tue, 1 Feb 2005 09:27:26 -0800, Chris Fraschetti wrote:
  Well all my fields are strings when I index them. They're all very
  short strings, dates, hashes, etc. The largest field has a cap of
  256 chars and there is only one of them, the rest are all fairly
  small.
 
  Can you explain what you meant by 'string or reader' ?
 
 Sorry, I meant to ask if you're using String fields (field.stringValue()) or 
 reader fields (field.readerValue()).
 
 Can you elaborate on the post-processing you need to do? Have you thought 
 about concatenating the fields you require into a single non-indexed field 
 (Field.UnIndexed) for simple retrieval? It'll increase the size of your 
 index, but should be faster to retrieve them all at one go.
 
 Kelvin
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-- 
___
Chris Fraschetti
e [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Combining Documents

2005-02-01 Thread Luke Shannon
Hello;

I have a situation where I need to combine the fields returned from one
document to an existing document.

Is there something in the API for this that I'm missing or is this the best
way:

//add the fields contained in the PDF document to the existing doc Document
Document attachedDoc = LucenePDFDocument.getDocument(attached);
Enumeration docFields = attachedDoc.fields();
 while (docFields.hasMoreElements()) {
 doc.add((Field)docFields.nextElement());
  }

Luke



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: lucene docs in bulk read?

2005-02-01 Thread Kelvin Tan


On Tue, 1 Feb 2005 14:12:54 -0800, Chris Fraschetti wrote:
 Definitely a good idea on the one line idea... that could possibly
 save a good amount of time. I'm using .stringValue ... in reality,
 I hadn't ever even considered readerValue ... is there a strong
 performance difference between the two? or is it simply on the
 functionality side?

Not that I'm aware of (performance). Reader fields are useful when reading in 
bulky data which doesn't make sense to be loaded into mem as a String.

K



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Query Format

2005-02-01 Thread Hetan Shah
Hello All,
What should my query look like if I want to search all or any of the 
following key words.

Sun Linux Red Hat Advance Server
replies are much appreciated.
-H
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Results

2005-02-01 Thread Hetan Shah
Another question for the day:
How to make sure that the results shown are the only one containing the 
keywords specified?

e.g.
the result for the query Red AND HAT AND Linux
should result in documents which has all the three key words and not 
show documents that only has one or two keywords?

Any hints?
Thanks.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Query Format

2005-02-01 Thread Erik Hatcher
How are you indexing your document?
If you're using QueryParser with the default operator set to OR (which 
is the default), then you've already provided the expression you need 
:)

Erik
On Feb 1, 2005, at 6:29 PM, Hetan Shah wrote:
Hello All,
What should my query look like if I want to search all or any of the 
following key words.

Sun Linux Red Hat Advance Server
replies are much appreciated.
-H
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Results

2005-02-01 Thread Erik Hatcher
On Feb 1, 2005, at 7:36 PM, Hetan Shah wrote:
Another question for the day:
How to make sure that the results shown are the only one containing 
the keywords specified?

e.g.
the result for the query Red AND HAT AND Linux
should result in documents which has all the three key words and not 
show documents that only has one or two keywords?
Huh?  You would never get documents returned that only had two of those 
terms given that AND'd query.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Re-Indexing a moving target???

2005-02-01 Thread Nader Henein
details?
Yousef Ourabi wrote:
Saad,
Here is what I got. I will post again, and be more
specific.
-Y
--- Nader Henein [EMAIL PROTECTED] wrote:
 

We'll need a little more detail to help you, what
are the sizes of your 
updates and how often are they updated.

1) No just re-open the index writer every time to
re-index, according to 
you it's moderately changing index, just keep a flag
on the rows and 
batch indexing every so often.
2) It all comes down to your needs, more detail
would help us help you.

Nader Henein
Yousef Ourabi wrote:
   

Hey,
We are using lucene to index a moderatly changing
database, and I have a couple of questions on a
performance strategy.
1) Should we just have one index writer open unil
 

the
   

system comes down...or create a new index writer
 

each
   

time we re-index our data-set.
2) Does anyone have anythoughts...multi-threading
 

and
   

segments instead of one index?
Thanks for your time and help.
Best,
Yousef
 

-
   

To unsubscribe, e-mail:
 

[EMAIL PROTECTED]
   

For additional commands, e-mail:
 

[EMAIL PROTECTED]
   



 

--
Nader S. Henein
Senior Applications Developer
Bayt.com

   

-
 

To unsubscribe, e-mail:
[EMAIL PROTECTED]
For additional commands, e-mail:
[EMAIL PROTECTED]
   


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: User Rights Management in Lucene

2005-02-01 Thread Chandrashekhar
Hi,
If you r working on some CMS or similar app and want to  have user rights
module then you can use metadata for rights information and add this
metadata into index information then you can search on this metadata.


With Regards,
Chandrashekhar V Deshmukh
- Original Message - 
From: Verma Atul (extern) [EMAIL PROTECTED]
To: lucene-user@jakarta.apache.org
Sent: Tuesday, February 01, 2005 8:31 PM
Subject: User Rights Management in Lucene


Hi,

I'm new to Lucene and want to know, whether Lucene has the capability of
displaying the search results based the Users Rights.

For Example:

There are suppose some resources, like :

Resource 1
Resource 2
Resource 3
Resource 4

And there are say 2 users with

User 1 having access to Resource 1, Resource 2 and Resource 4; and User
2 having access to Resource 1 and Resource 3

So when User 1 searches the database, then he should get results from
Resource 1, 2 and 4, but

When User 2 searches the databse, then he should get results from
Resource 1 and 3.

Regards
Atul Verma


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



when indexing, java.io.FileNotFoundException

2005-02-01 Thread Chris Lu
Hi,
I am getting this exception now and then when I am indexing content.
It doesn't always happen. But when it happens, I have to delete the 
index and start over again.
This is a serious problem.

In this email, Doug was say it has something to do with win32's lack of 
atomic renaming.
http://java2.5341.com/msg/1348.html

But how can I prevent this?
Chris Lu
java.io.FileNotFoundException: C:\data\indexes\customer\_temp\0\_1e.fnm 
(The system cannot find the file specified)
   at java.io.RandomAccessFile.open(Native Method)
   at java.io.RandomAccessFile.init(RandomAccessFile.java:204)
   at 
org.apache.lucene.store.FSInputStream$Descriptor.init(FSDirectory.java:376)
   at org.apache.lucene.store.FSInputStream.init(FSDirectory.java:405)
   at org.apache.lucene.store.FSDirectory.openFile(FSDirectory.java:268)
   at org.apache.lucene.index.FieldInfos.init(FieldInfos.java:53)
   at 
org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:109)
   at org.apache.lucene.index.SegmentReader.init(SegmentReader.java:94)
   at 
org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:480)
   at 
org.apache.lucene.index.IndexWriter.maybeMergeSegments(IndexWriter.java:458)
   at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:310)
   at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:294)

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: How do I delete?

2005-02-01 Thread Chris Hostetter

: anywhere.  I checked the count coming back from the delete operation and
: it is zero.  I even tried to delete another unique term with similar
: results.

First off, are you absolutely certain you are closing the reader?  it's
not in the code you listed.

Second, I'd bet $1 that when your documents were indexed, your reference
field was analyzed and parsed into multiple terms.  Did you try searching
for the Term you're trying to delete by?

(I hear luke is a pretty handy tool for checking exactly which Terms are
in your index)

: Here is the delete and associated code:
: 
:   reader = IndexReader.open(database);
: 
:   Term t = new Term(reference,reference);
:   try {
: reader.delete(t);
:   } catch (Exception e) {
: System.out.println(Delete exception;+e);
:   }


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



enquiries - pls help, thanks

2005-02-01 Thread jac jac

Hi

May I know whether Lucene currently supports indexing of xml documents?

I tried building an index to index all my directories in webapps:
via: 

java org.apache.lucene.demo.IndexFiles /homedir/tomcat/webapps

then I tried using the following command to search:

java org.apache.lucene.demo.SearchFiles

and i typed in my query. I was able to see the files which directs me the path 
which holds my data. 

However, when I do
java org.apache.lucene.demo.IndexHTML -create -index /homedir/index ..

and I went to my website I realised it can't serach for the data I wanted 
instead. 

I want to search data within XML documents... May I know if the current demo 
version allows indexing of XML documents? 

Why is it that after I do java org.apache.lucene.demo.IndexHTML -create -index 
/homedir/index .. then the data I wanted can't be searched? thanks alot!

 

jac



 

 Yahoo! Mobile
- Download the latest ringtones, games, and more!