Three HTML parsers(Lucene web application
demo,CyberNeko HTML Parser,JTidy) are mentioned in
Lucene FAQ
1.3.27.Which is the best?Can it filter tags that are
auto-created by MS-word 'Save As HTML files' function?
_
Do You Yahoo!?
150MP3
Lucene support sort by score or docID.Now I want to
sort search results by score and docID or by two
fields at one time, like sql
command order by score,docID , how can I do it?
_
Do You Yahoo!?
150MP3
http://music.yisou.com/
Hey folks.. thanks in advance to any who respond...
I do a good deal of post-search processing and the file io to read the
fields I need becomes horribly costly and is definitely a problem. Is
there any way to either retrieve 1. the entire doc (all fields that
can be retrieved) and/or 2. a group
Jingkang Zhang wrote:
Three HTML parsers(Lucene web application
demo,CyberNeko HTML Parser,JTidy) are mentioned in
Lucene FAQ
1.3.27.Which is the best?Can it filter tags that are
auto-created by MS-word 'Save As HTML files' function?
maybe you can try this library...
Hi.
In December I made some posts concerning a filter that could work by
getting the unicode name of a character and trying to figure out the
closest latin equivalent. For example, if it encountered character 00C1
LATIN CAPITAL LETTER A WITH ACUTE, it would be clever enough to replace
that
Hi,
what happens when I add two fields with the same name to one Document?
Document doc = new Document();
doc.add(Field.Text(bla, this is my first text));
doc.add(Field.Text(bla, this is my second text));
Will the second text overwrite the first, because only one field can be held
with the same
Hi Karl,
From _Lucene in Action_, section 2.2, when you add the same field with
different values, Internally, Lucene appends all the words together
and index them in a single Field ..., allowing you to use any of the
given words when searching.
See also
On Feb 1, 2005, at 4:21 AM, Jingkang Zhang wrote:
Lucene support sort by score or docID.Now I want to
sort search results by score and docID or by two
fields at one time, like sql
command order by score,docID , how can I do it?
Sorting by multiple fields (including score and document id) is
When I tested parsers a year or so ago for intensive use in Furl, the
best (tolerant of bad HTML) and fastest (tested on a 1.5M HTML page)
parser by far was TagSoup ( http://www.tagsoup.info ). It is actively
maintained and improved and I have never had any problems with it.
-Mike
Jingkang Zhang
Is there a way to eliminate duplicate hits being returned from the index?
Jerry Jalenak
Senior Programmer / Analyst, Web Publishing
LabOne, Inc.
10101 Renner Blvd.
Lenexa, KS 66219
(913) 577-1496
[EMAIL PROTECTED]
This transmission (and any information attached to it) may be confidential and
Hi Chris, are your fields string or reader? How large do your fields get?
Kelvin
On Tue, 1 Feb 2005 01:40:39 -0800, Chris Fraschetti wrote:
Hey folks.. thanks in advance to any who respond...
I do a good deal of post-search processing and the file io to read
the fields I need becomes
On Feb 1, 2005, at 9:01 AM, Jerry Jalenak wrote:
Is there a way to eliminate duplicate hits being returned from the
index?
Sure, don't put duplicate documents in the index :)
Erik
-
To unsubscribe, e-mail: [EMAIL
Ok, OK. Should have that response coming 8-)
The documents I'm indexing are sent from a legacy system, and can be sent
multiple times - but I only want to keep the documents if something has
changed. If the indexed fields match exactly, I don't want to index the
second (or third, forth, etc)
Hi,
I'm new to Lucene and want to know, whether Lucene has the capability of
displaying the search results based the Users Rights.
For Example:
There are suppose some resources, like :
Resource 1
Resource 2
Resource 3
Resource 4
And there are say 2 users with
User 1 having access to
On Feb 01, 2005, at 16:01, Verma Atul (extern) wrote:
I'm new to Lucene and want to know, whether Lucene has the capability
of
displaying the search results based the Users Rights.
Not by itself. But you can make it so.
Cheers
--
PA, Onnay Equitursay
http://alt.textdrive.com/
Jerry Jalenak wrote:
Given Erik's response of 'don't put duplicate documents in the index', how
can I accomplish this in the IndexWriter?
I was dealing with a similar requirement recently. I eventually
decided on storing the MD5 checksum of the document as a keyword. It
means reading it
Thanks for the help. This means that the User management has to be done
over Lucene.
-Original Message-
From: PA [mailto:[EMAIL PROTECTED]
Sent: Tuesday, February 01, 2005 4:06 PM
To: Lucene Users List
Subject: Re: User Rights Management in Lucene
On Feb 01, 2005, at 16:01, Verma Atul
Nice idea John - one I hadn't considered. Once you have the checksum, do
you 'check' in the index first before storing the second document? Or do
you filter on the query side?
Jerry Jalenak
Senior Programmer / Analyst, Web Publishing
LabOne, Inc.
10101 Renner Blvd.
Lenexa, KS 66219
(913)
On Feb 01, 2005, at 16:07, Verma Atul (extern) wrote:
Thanks for the help. This means that the User management has to be done
over Lucene.
Your choice. But in a nutshell, yes.
Cheers
--
PA, Onnay Equitursay
http://alt.textdrive.com/
On Feb 1, 2005, at 9:49 AM, Jerry Jalenak wrote:
Given Erik's response of 'don't put duplicate documents in the index',
how
can I accomplish this in the IndexWriter?
As John said - you'll have to come up with some way of knowing whether
you should index or not. For example, when dealing with
On Feb 1, 2005, at 10:01 AM, Verma Atul (extern) wrote:
Hi,
I'm new to Lucene and want to know, whether Lucene has the capability
of
displaying the search results based the Users Rights.
For Example:
There are suppose some resources, like :
Resource 1
Resource 2
Resource 3
Resource 4
And there
Jerry Jalenak wrote:
Nice idea John - one I hadn't considered. Once you have the checksum, do
you 'check' in the index first before storing the second document? Or do
you filter on the query side?
I do a quick search for the md5 checksum before indexing.
Although I suspect not applicable in
Just to make sure I understand
Do you keep an IndexReader open at the same time you are running the
IndexWriter? From what I can see in the JavaDocs, it looks like only
IndexReader (or IndexSearch) can peek into the index and see if a document
exists or not
Thanks!
Jerry Jalenak
Senior
OK - but I'm dealing with indexing between 1.5 and 2 million documents, so I
really don't want to 'batch' them up if I can avoid it. And I also don't
think I can keep an IndexRead open to the index at the same time I have an
IndexWriter open. I may have to try and deal with this issue through
Jerry Jalenak wrote:
OK - but I'm dealing with indexing between 1.5 and 2 million documents, so I
really don't want to 'batch' them up if I can avoid it. And I also don't
think I can keep an IndexRead open to the index at the same time I have an
IndexWriter open. I may have to try and deal with
Is there a way to check if an IndexSearcher is closed?
Thanks in advance,
Ravi.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
On Feb 1, 2005, at 10:51 AM, Jerry Jalenak wrote:
OK - but I'm dealing with indexing between 1.5 and 2 million
documents, so I
really don't want to 'batch' them up if I can avoid it. And I also
don't
think I can keep an IndexRead open to the index at the same time I
have an
IndexWriter open.
I've indexed a large set of documents and think that something may have
gone wrong somewhere in the middle. Is there a way I can display the
count of documents in the index?
Thanks,
Jim.
-
To unsubscribe, e-mail: [EMAIL
Not sure if the API provides a method for this, but you could use Luke:
http://www.getopt.org/luke/
It gives you a count and lets you step through each Doc looking at their
fields.
- Original Message -
From: Jim Lynch [EMAIL PROTECTED]
To: Lucene Users List
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexW
riter.html#docCount()
You can try this.
-Original Message-
From: Luke Shannon [mailto:[EMAIL PROTECTED]
Sent: Tuesday, February 01, 2005 11:33 AM
To: Lucene Users List
Subject: Re: How to get document count?
Not
I think that depends on what you want to do. The Lucene demo parser does
simple mapping of HTML files into Lucene Documents; it does not give you a
parse tree for the HTML doc. CyberNeko is an extension of Xerces (uses the
same API; will likely become part of Xerces), and so maps an HTML
Well all my fields are strings when I index them. They're all very
short strings, dates, hashes, etc. The largest field has a cap of 256
chars and there is only one of them, the rest are all fairly small.
Can you explain what you meant by 'string or reader' ?
Thanks,
Chris
On Tue, 1 Feb 2005
Erik Hatcher wrote:
On Feb 1, 2005, at 10:51 AM, Jerry Jalenak wrote:
OK - but I'm dealing with indexing between 1.5 and 2 million
documents, so I
really don't want to 'batch' them up if I can avoid it. And I also
don't
think I can keep an IndexRead open to the index at the same time I
have an
That works, thanks. I can't use Luke on this system. It fails for
some reason.
Jim.
Ravi wrote:
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexW
riter.html#docCount()
You can try this.
-Original Message-
From: Luke Shannon [mailto:[EMAIL PROTECTED]
Sent:
I wasn't sure where in this thread to reply so I'm replying to myself :)
What search appliances exist now?
I only found 3:
[1] Google
[2] Thunderstone
http://www.thunderstone.com/texis/site/pages/Appliance.html
[3] IndexEngines (not out yet)
I've been merrily cooking along, thinking I was replacing documents when
I haven't. My logic is to go through a batch of documents, get a field
called reference which is unique build a term from it and delete it
via the reader.delete() method. Then I close the reader and open a
writer and
I've had success with deletion by running IndexReader.delete(int), then
getting an IndexWriter and optimizing the directory. I don't know if
that's the right way to do it or not.
On Tue, 1 Feb 2005, Jim Lynch wrote:
I've been merrily cooking along, thinking I was replacing documents when
I
Please see inline.
On Tue, 1 Feb 2005 09:27:26 -0800, Chris Fraschetti wrote:
Well all my fields are strings when I index them. They're all very
short strings, dates, hashes, etc. The largest field has a cap of
256 chars and there is only one of them, the rest are all fairly
small.
Can you
Thanks, I'd try that, but I don't think it will make any difference. If
I modify the code to not reindex the documents, no files in the index
directory are touched, hence there is no record of the deletions
anywhere. I checked the count coming back from the delete operation and
it is zero.
Well, in LuceneRAR, the delete by id code does exactly what I said: gets
the indexreader, deletes the doc id, then it opens a writer and optimizes.
Nothing else.
On Tue, 1 Feb 2005, Jim Lynch wrote:
Thanks, I'd try that, but I don't think it will make any difference. If
I modify the code to
Definitely a good idea on the one line idea... that could possibly
save a good amount of time. I'm using .stringValue ... in reality, I
hadn't ever even considered readerValue ... is there a strong
performance difference between the two? or is it simply on the
functionality side?
The basic post
Hello;
I have a situation where I need to combine the fields returned from one
document to an existing document.
Is there something in the API for this that I'm missing or is this the best
way:
//add the fields contained in the PDF document to the existing doc Document
Document attachedDoc =
On Tue, 1 Feb 2005 14:12:54 -0800, Chris Fraschetti wrote:
Definitely a good idea on the one line idea... that could possibly
save a good amount of time. I'm using .stringValue ... in reality,
I hadn't ever even considered readerValue ... is there a strong
performance difference between the
Hello All,
What should my query look like if I want to search all or any of the
following key words.
Sun Linux Red Hat Advance Server
replies are much appreciated.
-H
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional
Another question for the day:
How to make sure that the results shown are the only one containing the
keywords specified?
e.g.
the result for the query Red AND HAT AND Linux
should result in documents which has all the three key words and not
show documents that only has one or two keywords?
How are you indexing your document?
If you're using QueryParser with the default operator set to OR (which
is the default), then you've already provided the expression you need
:)
Erik
On Feb 1, 2005, at 6:29 PM, Hetan Shah wrote:
Hello All,
What should my query look like if I want to
On Feb 1, 2005, at 7:36 PM, Hetan Shah wrote:
Another question for the day:
How to make sure that the results shown are the only one containing
the keywords specified?
e.g.
the result for the query Red AND HAT AND Linux
should result in documents which has all the three key words and not
show
details?
Yousef Ourabi wrote:
Saad,
Here is what I got. I will post again, and be more
specific.
-Y
--- Nader Henein [EMAIL PROTECTED] wrote:
We'll need a little more detail to help you, what
are the sizes of your
updates and how often are they updated.
1) No just re-open the index writer
Hi,
If you r working on some CMS or similar app and want to have user rights
module then you can use metadata for rights information and add this
metadata into index information then you can search on this metadata.
With Regards,
Chandrashekhar V Deshmukh
- Original Message -
From:
Hi,
I am getting this exception now and then when I am indexing content.
It doesn't always happen. But when it happens, I have to delete the
index and start over again.
This is a serious problem.
In this email, Doug was say it has something to do with win32's lack of
atomic renaming.
: anywhere. I checked the count coming back from the delete operation and
: it is zero. I even tried to delete another unique term with similar
: results.
First off, are you absolutely certain you are closing the reader? it's
not in the code you listed.
Second, I'd bet $1 that when your
Hi
May I know whether Lucene currently supports indexing of xml documents?
I tried building an index to index all my directories in webapps:
via:
java org.apache.lucene.demo.IndexFiles /homedir/tomcat/webapps
then I tried using the following command to search:
java
52 matches
Mail list logo