Hey
Look at the file Test.java under lucene1.4 ,it strips out html tagsand gives
u content...
with regards
Karthik
-Original Message-
From: root [mailto:root]On Behalf Of Mahesh
Sent: Thursday, May 20, 2004 11:13 AM
To: [EMAIL PROTECTED]
Subject: How do i prevent the HTML tags being add
I am using the lucene 1.4 to index the information.
I have lot of HTML tags in the information that i will be indexing ,so
let me know if their is any way of removing the HTML tags from being
indexed..
MAHESH
-
To unsubscribe
On May 20, 2004, at 04:38, Erik Hatcher wrote:
OffTopic: havoc and Struts go well together ;) Pick up Tapestry
instead!
Nah. Keep it really Simple [1] instead :o)
http://simpleweb.sourceforge.net/
PA.
-
To unsubscribe, e-mail: [E
On May 19, 2004, at 8:04 AM, Timothy Stone wrote:
Could you elaborate on what you mean by MVC here? A value list
handler piece has been developed and links posted to it on this list
- if this is the type of thing you're referring to.
Again, maybe I was naively associating the "SearchBean" with s
Morus Walter wrote:
Kevin Burton writes:
How much interest is there for this? I have to do this for work and
will certainly take the extra effort into making this a standard Lucene
feature.
Sounds interesting.
How would you handle deletions?
They aren't a requirement in our scenario
Morus Walter wrote:
I don't understand that.
You get the document object which does not contain the documents field
contents. It just provides access to this data.
It's up to you which fields you access.
And remember that you don't have to store fields at all, if you don't need
to retrieve them (e
> Thanks for "highlighting" the problem with the Javadocs...
Groan. :)
Regards,
Bruce Ritchie
smime.p7s
Description: S/MIME cryptographic signature
>>Was Investigating,found some Compile time error..
I see the code you have is taken from the example in the javadocs. Unfortunately that
example wasn't complete because the class didnt
include the method defined in the Formatter interface. I have updated the Javadocs to
correct this oversight.
Here is an example method in org.apache.lucene.demo.html HTMLParser that
uses a different buffered reader for a different encoding.
public Reader getReader() throws IOException
{
if (pipeIn == null)
{
pipeInStream = new MyPip
The tokenizers deal with unicode characters (CharStream, char), so the
problem is not there. This problem must be solved at the point where the
bytes from your source files are turned into CharSequences/Strings, i.e. by
connecting an InputStreamReader to your FileReader (or whatever you're
using)
Hi,
I had a quick look at the sandbox but my problem is that I don't need a
spanish stemmer. However there must be a replacement tokenizer that supports
foreign characters to go along with the foreign language snowball stemmers.
Does anyone know where I could find one?
In answer to Peters quest
Hi Hannah, Otis
I cannot help but I have excatly the same problems with special german
charcters. I used snowball analyser but this does not help because the
problem (tokenizing) appears before the analyser comes into action.
I just posted the question "Problem tokenizing UTF-8 with geman umlauts"
It looks like Snowball project supports Spanish:
http://www.google.com/search?q=snowball spanish
If it does, take a look at Lucene Sandbox. There is a project that
allows you to use Snowball analyzers with Lucene.
Otis
--- Hannah c <[EMAIL PROTECTED]> wrote:
>
> Hi,
>
> I am indexing a numb
Hi Kevin,
There is no API for this, and I agree it would be handy.
Otis
--- Kevin Burton <[EMAIL PROTECTED]> wrote:
> Say I have a query result for the term Linux... now I just want the
> TITLE of these documents not the BODY.
>
> To further this scenario imagine the TITLE is 500 bytes but the
Hello,
I have HTML-documents which are UTF-8 encoded and contain english and/or
german content. I have written my own Analyser and Filter to replace the
german umlauts with the commonly used pair of character (ü=ue, ä=ae, ö=oe)
to avoid any problems. Still in the HTML-code the german umlauts are sh
Hi,
I am indexing a number of English articles on Spanish resorts. As such
there are a number of spanish characters throught the text, most of these
are in the place names which are the type of words I would like to use as
queries. My problem is with the StandardTokenizer class which cuts the w
Thanks, I will look at the sorting code. Sorting results by date is
next on list. For now, I only have a small number of documents but the
set is to grow to over 8 million documents for the collection I am
working on. Another collection we have is 40 million documents or so.
From what you
Erik Hatcher wrote:
On May 18, 2004, at 1:43 PM, Timothy Stone wrote:
Erik Hatcher wrote:
Lucene 1.4 (now in release candidate stage) includes built-in sorting
capabilities, so I definitely recommend you have a look at that.
SearchBean is effectively deprecated based on this new much more
po
There is no problem with updating and searching simultaneously. Two threads updating
simultaneously on the same index on NFS can be a problem, as the locking does not work
reliably. Have a look through the archives for NFS, there are some solutions
scattered about.
David
-Original Messag
Hey Lucene Users
My original intension for indexing was to
index certain portions of HTML [ not the whole Document ],
if Jtidy is not supporting this then what are my optionals
Karthik
-Original Message-
From: Viparthi, Kiran (AFIS) [mailto:[EMAIL PROTECTED]
Sent: Wednesday, May 19, 200
I doubt if it can be used as a plug in.
Would be good to know if it can be used as a plug in.
Regards,
Kiran.
-Original Message-
From: Karthik N S [mailto:[EMAIL PROTECTED]
Sent: 17 May 2004 12:30
To: Lucene Users List
Subject: RE: SELECTIVE Indexing
Hi
Can I Use TIDY [as plug in ] wi
Hey Guys
Found some Highlighter Package on CVS
Directory
Was
Investigating,found some Compile time
error..
Please some body tell me what
this
The
Code:-
private IndexReader reader=null; private
Highlighter highlighter = null; public SearchFiles()
{ } public void searchIndex0(S
22 matches
Mail list logo