date:20040519

RE: How do i prevent the HTML tags being added to Lucene Index..

2004-05-19 Thread Karthik N S


Hey

Look at the file Test.java under lucene1.4 ,it strips out html tagsand gives
u content...

with regards
Karthik

-Original Message-
From: root [mailto:root]On Behalf Of Mahesh
Sent: Thursday, May 20, 2004 11:13 AM
To: [EMAIL PROTECTED]
Subject: How do i prevent the HTML tags being added to Lucene Index..


I am using the lucene 1.4 to index the information.
I have lot of HTML tags in the information that i will be indexing ,so
let me know if their is any way of removing the HTML tags from being
indexed..


MAHESH




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

How do i prevent the HTML tags being added to Lucene Index..

2004-05-19 Thread Mahesh

I am using the lucene 1.4 to index the information.
I have lot of HTML tags in the information that i will be indexing ,so
let me know if their is any way of removing the HTML tags from being
indexed..


MAHESH




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene and MVC (was Re: Bad file descriptor (IOException) using SearchBean contribution)

2004-05-19 Thread petite_abeille

On May 20, 2004, at 04:38, Erik Hatcher wrote:
OffTopic: havoc and Struts go well together ;)  Pick up Tapestry 
instead!
Nah. Keep it really Simple [1] instead :o)
http://simpleweb.sourceforge.net/
PA.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Lucene and MVC (was Re: Bad file descriptor (IOException) using SearchBean contribution)

2004-05-19 Thread Erik Hatcher

On May 19, 2004, at 8:04 AM, Timothy Stone wrote:
Could you elaborate on what you mean by MVC here?  A value list 
handler piece has been developed and links posted to it on this list 
- if this is the type of thing you're referring to.
Again, maybe I was naively associating the "SearchBean" with something 
that it was not suppose to be doing. To elaborate, I would like to 
take the demo, which has been working with some success for two years 
on my site, and follow the suggestions of Andrew C. Oliver and go 
"Model 2 on the demo."
I've never seen a user story (or use case) that said "this feature must 
use MVC" :)  What is the purpose of going MVC here?  Is it just for 
architectural purity?

So the SearchBean's purpose, as I understood it, was to provide a 
Model 2 component for use in JSPs.
Consider a query that generates a million hits.  How should the JSP 
iterate over them?  In a pure MVC world, the JSP would be pushed the 
hits and allowed to display them however it likes.  With Lucene Hits, 
you get this capability already.  I'm just not convinced a wrapper is 
needed, especially now that sorting is built-in.

Again, I'm open to being convinced otherwise.
A value list handler piece has been developed and links posted to it
on this list - if this is the type of thing you're referring to.
I tried looking for references to such, but no luck.
http://www.nitwit.de/vlh2/
Also, for JSP use, there is the taglib contribution in the sandbox that 
might be of interest to you.  I've not gotten it to work, yet, and it's 
not quite my cup of tea (being an anti-JSP kinda guy that is).

I must admit that I get the feeling that "newbies" to Lucene seem to 
get less attention on the list. I'm one that tries real hard to 
research my question first in the archives (marc.theaimsgroup.com) 
then on the web. Even I get frustrated on some lists where the most 
obvious question is being asked and the asker misses hints and 
outright help.
When I was first learning Ant, I lurked on the ant-user list and when a 
question came up that I knew I'd answer it.  When one came up that I 
didn't know, I'd research it by experimenting and cross-referencing in 
the source code to try to figure it out.  We really get out of this 
community what we put into it, in my opinion.  Newbies need to be savvy 
and do some homework and not expect everything to be spelled out 
beautifully - none of us have time to flesh out full-fledged example 
applications to answer every question.  Sometimes a question comes 
along that I could reply to, but I let it go because I'm crushed for 
time as it is.  Sometimes I'll answer - especially if the question 
piques my curiosity or has some aspect of a challenge for me to learn 
something new.

I personally try to answer professionally and thoroughly, but sometimes 
I might answer off-the-cuff or quickly and it comes out a bit tersely 
or perhaps intimidating.  My contributions as a whole, though, are 
hopefully taken positively by the community.

 The Lucene User list can be intimidating even for the advanced novice 
who may be on the right track but not phrasing or wording or 
describing the problem or task in front of him/her.
You are not the only one that gets blown away by things on this list.  
There are many times I've been baffled and completely mind-blown by 
things here - what underlies Lucene and what folks can build around it 
is simply astonishing.  This is no typical open source project we're 
dealing with here.  Thankfully the API is so straightforward to use, 
though, that Lucene usage is clear - its the bigger picture that is 
daunting (to me).

I'm personally reading Managing Gigabytes at the moment, and my head is 
spinning.  But it is helping me get a clearer picture of the underlying 
concepts that Lucene is built upon.

a new desire to tackle Struts, and well, havoc ensues.
OffTopic: havoc and Struts go well together ;)  Pick up Tapestry 
instead!

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Internal full content store within Lucene

2004-05-19 Thread Kevin Burton

Morus Walter wrote:
Kevin Burton writes:
 

How much interest is there for this?  I have to do this for work and 
will certainly take the extra effort into making this a standard Lucene 
feature. 

   

Sounds interesting.
How would you handle deletions?
 

They aren't a requirement in our scenario... It would probably be more 
efficient to just leave the content on disk.

If you want to GC over time the arc files can be grouped together by 
time so you can just eventually delete a whole arc file...

Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

Re: Possible to fetch a document without all fields for performance?

2004-05-19 Thread Kevin Burton

Morus Walter wrote:
I don't understand that.
You get the document object which does not contain the documents field
contents. It just provides access to this data.
It's up to you which fields you access.
And remember that you don't have to store fields at all, if you don't need 
to retrieve them (e.g. because the original documents are somewhere else). 
 

Nope... When you get the Document the fields are already pre-parsed from 
disk. If you don't call ANY methods to get fields it still has to read 
all the fields off disk.

Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

RE: org.apache.lucene.search.highlight.Highlighter

2004-05-19 Thread Bruce Ritchie


> Thanks for "highlighting" the problem with the Javadocs...

Groan. :)


Regards,

Bruce Ritchie


smime.p7s
Description: S/MIME cryptographic signature

Re: org.apache.lucene.search.highlight.Highlighter

2004-05-19 Thread markharw00d

>>Was Investigating,found some Compile time error..
 
I see the code you have is taken from the example in the javadocs. Unfortunately that 
example wasn't complete because the class didnt
include the method defined in the Formatter interface. I have updated the Javadocs to 
correct this oversight.

To correct your problem either make your class implement the Formatter interface to 
perform your choice of custom formatting or remove the "this" 
parameter from your call to create a new Highlighter with the default Formatter 
implementation.

Thanks for "highlighting" the problem with the Javadocs...

Cheers
Mark


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: AW: Problem indexing Spanish Characters

2004-05-19 Thread wallen

Here is an example method in org.apache.lucene.demo.html HTMLParser that
uses a different buffered reader for a different encoding. 

public Reader getReader() throws IOException
{
if (pipeIn == null)
{
pipeInStream = new MyPipedInputStream();
pipeOutStream = new PipedOutputStream(pipeInStream);
pipeIn = new InputStreamReader(pipeInStream);
pipeOut = new OutputStreamWriter(pipeOutStream);
//check the first 4 bytes for FFFE marker, if its
there we know its UTF-16 encoding
if (useUTF16)
{
try
{
pipeIn = new BufferedReader(new
InputStreamReader(pipeInStream, "UTF-16"));
}
catch (Exception e)
{
}
}
Thread thread = new ParserThread(this);
thread.start(); // start parsing
}
return pipeIn;
}

-Original Message-
From: Martin Remy [mailto:[EMAIL PROTECTED]
Sent: Wednesday, May 19, 2004 2:09 PM
To: 'Lucene Users List'
Subject: RE: AW: Problem indexing Spanish Characters


The tokenizers deal with unicode characters (CharStream, char), so the
problem is not there.  This problem must be solved at the point where the
bytes from your source files are turned into CharSequences/Strings, i.e. by
connecting an InputStreamReader to your FileReader (or whatever you're
using) and specifying "UTF-8" (or whatever encoding is appropriate) in the
InputStreamReader constructor.  

You must either detect the encoding from HTTP heaaders or XML declarations
or, if you know that it's the same for all of your source files, then just
hardcode UTF-8, for example.  

Martin

-Original Message-
From: Hannah c [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, May 19, 2004 10:35 AM
To: [EMAIL PROTECTED]
Subject: RE: AW: Problem indexing Spanish Characters

Hi,

I had a quick look at the sandbox but my problem is that I don't need a
spanish stemmer. However there must be a replacement tokenizer that supports
foreign characters to go along with the foreign language snowball stemmers. 
Does anyone know where I could find one?

In answer to Peters question -yes I'm also using "UTF-8" encoded XML
documents as the source.
I also put below an example of what is happening when I tokenize the text
using the StandardTokenizer below.

Thanks Hannah



--text I'm trying to index

century palace known as la "Fundación Hospital de Na. Señora del Pilar"

-tokens outputed from StandardTokenizer

century
palace
known
as
la
â
FundaciÃ*
n   *
Hospital
de
Na
SeÃ  *
ora   *
del
Pilar
â
---



>From: "Peter M Cipollone" <[EMAIL PROTECTED]>
>To: <[EMAIL PROTECTED]>
>Subject: Re: Problem indexing Spanish Characters
>Date: Wed, 19 May 2004 11:41:28 -0400
>
>could you send some sample text that causes this to happen?
>
>- Original Message -
>From: "Hannah c" <[EMAIL PROTECTED]>
>To: <[EMAIL PROTECTED]>
>Sent: Wednesday, May 19, 2004 11:30 AM
>Subject: Problem indexing Spanish Characters
>
>
> >
> > Hi,
> >
> > I  am indexing a number of English articles on Spanish resorts. As 
> > such there are a number of spanish characters throught the text, 
> > most of
>these
> > are in the place names which are the type of words I would like to 
> > use
>as
> > queries. My problem is with the StandardTokenizer class which cuts 
> > the
>word
> > into two when it comes across any of the spanish characters. I had a
>look
>at
> > the source but the code was generated by JavaCC and so is not very
>readable.
> > I was wondering if there was a way around this problem or which area 
> > of
>the
> > code I would need to change to avoid this.
> >
> > Thanks
> > Hannah Cumming
> >
> >
> >
> > 
> > - To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> >
>




>From: PEP AD Server Administrator
><[EMAIL PROTECTED]>
>Reply-To: "Lucene Users List" <[EMAIL PROTECTED]>
>To: "'Lucene Users List'" <[EMAIL PROTECTED]>
>Subject: AW: Problem indexing Spanish Characters
>Date: Wed, 19 May 2004 18:08:56 +0200
>
>Hi Hannah, Otis
>I cannot help but I have excatly the same problems with special german 
>charcters. I used snowball analyser but this does not help because the 
>problem (tokenizing) appears before the analyser comes into action.
>I just posted the question "Problem tokenizing UTF-8 with geman umlauts"
>some minutes ago which describes my problem and Hannahs seem to be similar.
>Do you have also UTF-8 encoded pages?
>
>Pet

RE: AW: Problem indexing Spanish Characters

2004-05-19 Thread Martin Remy

The tokenizers deal with unicode characters (CharStream, char), so the
problem is not there.  This problem must be solved at the point where the
bytes from your source files are turned into CharSequences/Strings, i.e. by
connecting an InputStreamReader to your FileReader (or whatever you're
using) and specifying "UTF-8" (or whatever encoding is appropriate) in the
InputStreamReader constructor.  

You must either detect the encoding from HTTP heaaders or XML declarations
or, if you know that it's the same for all of your source files, then just
hardcode UTF-8, for example.  

Martin

-Original Message-
From: Hannah c [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, May 19, 2004 10:35 AM
To: [EMAIL PROTECTED]
Subject: RE: AW: Problem indexing Spanish Characters

Hi,

I had a quick look at the sandbox but my problem is that I don't need a
spanish stemmer. However there must be a replacement tokenizer that supports
foreign characters to go along with the foreign language snowball stemmers. 
Does anyone know where I could find one?

In answer to Peters question -yes I'm also using "UTF-8" encoded XML
documents as the source.
I also put below an example of what is happening when I tokenize the text
using the StandardTokenizer below.

Thanks Hannah



--text I'm trying to index

century palace known as la Fundación Hospital de Na. Señora del Pilar

-tokens outputed from StandardTokenizer

century
palace
known
as
la
â
FundaciÃ*
n   *
Hospital
de
Na
SeÃ  *
ora   *
del
Pilar
â
---



>From: "Peter M Cipollone" <[EMAIL PROTECTED]>
>To: <[EMAIL PROTECTED]>
>Subject: Re: Problem indexing Spanish Characters
>Date: Wed, 19 May 2004 11:41:28 -0400
>
>could you send some sample text that causes this to happen?
>
>- Original Message -
>From: "Hannah c" <[EMAIL PROTECTED]>
>To: <[EMAIL PROTECTED]>
>Sent: Wednesday, May 19, 2004 11:30 AM
>Subject: Problem indexing Spanish Characters
>
>
> >
> > Hi,
> >
> > I  am indexing a number of English articles on Spanish resorts. As 
> > such there are a number of spanish characters throught the text, 
> > most of
>these
> > are in the place names which are the type of words I would like to 
> > use
>as
> > queries. My problem is with the StandardTokenizer class which cuts 
> > the
>word
> > into two when it comes across any of the spanish characters. I had a
>look
>at
> > the source but the code was generated by JavaCC and so is not very
>readable.
> > I was wondering if there was a way around this problem or which area 
> > of
>the
> > code I would need to change to avoid this.
> >
> > Thanks
> > Hannah Cumming
> >
> >
> >
> > 
> > - To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> >
>




>From: PEP AD Server Administrator
><[EMAIL PROTECTED]>
>Reply-To: "Lucene Users List" <[EMAIL PROTECTED]>
>To: "'Lucene Users List'" <[EMAIL PROTECTED]>
>Subject: AW: Problem indexing Spanish Characters
>Date: Wed, 19 May 2004 18:08:56 +0200
>
>Hi Hannah, Otis
>I cannot help but I have excatly the same problems with special german 
>charcters. I used snowball analyser but this does not help because the 
>problem (tokenizing) appears before the analyser comes into action.
>I just posted the question "Problem tokenizing UTF-8 with geman umlauts"
>some minutes ago which describes my problem and Hannahs seem to be similar.
>Do you have also UTF-8 encoded pages?
>
>Peter MH
>
>-Ursprüngliche Nachricht-
>Von: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
>Gesendet: Mittwoch, 19. Mai 2004 17:42
>An: Lucene Users List
>Betreff: Re: Problem indexing Spanish Characters
>
>
>It looks like Snowball project supports Spanish:
>http://www.google.com/search?q=snowball spanish
>
>If it does, take a look at Lucene Sandbox.  There is a project that 
>allows you to use Snowball analyzers with Lucene.
>
>Otis
>
>
>--- Hannah c <[EMAIL PROTECTED]> wrote:
> >
> > Hi,
> >
> > I  am indexing a number of English articles on Spanish resorts. As 
> > such there are a number of spanish characters throught the text, 
> > most of these are in the place names which are the type of words I 
> > would like to use as queries. My problem is with the 
> > StandardTokenizer class which cuts the word into two when it comes 
> > across any of the spanish characters. I had a look at the source but 
> > the code was generated by JavaCC and so is not very readable.
> > I was wondering if there was a way around this problem or which area 
> > of the code I would need to change to avoid this.
> >
> > Thanks
> > Hannah Cumming
>
>-
>To unsubscribe, e-mail: [EMAIL PROTECTED]
>For additional commands, e-mail: [EMAIL PROTECTED]
>

RE: AW: Problem indexing Spanish Characters

2004-05-19 Thread Hannah c

Hi,
I had a quick look at the sandbox but my problem is that I don't need a 
spanish stemmer. However there must be a replacement tokenizer that supports 
foreign characters to go along with the foreign language snowball stemmers. 
Does anyone know where I could find one?

In answer to Peters question -yes I'm also using "UTF-8" encoded XML 
documents as the source.
I also put below an example of what is happening when I tokenize the text 
using the StandardTokenizer below.

Thanks Hannah

--text I'm trying to index
century palace known as la Fundación Hospital de Na. Señora del Pilar
-tokens outputed from StandardTokenizer
century
palace
known
as
la
â
FundaciÃ*
n   *
Hospital
de
Na
SeÃ  *
ora   *
del
Pilar
â
---

From: "Peter M Cipollone" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Subject: Re: Problem indexing Spanish Characters
Date: Wed, 19 May 2004 11:41:28 -0400
could you send some sample text that causes this to happen?
- Original Message -
From: "Hannah c" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Wednesday, May 19, 2004 11:30 AM
Subject: Problem indexing Spanish Characters
>
> Hi,
>
> I  am indexing a number of English articles on Spanish resorts. As such
> there are a number of spanish characters throught the text, most of 
these
> are in the place names which are the type of words I would like to use 
as
> queries. My problem is with the StandardTokenizer class which cuts the
word
> into two when it comes across any of the spanish characters. I had a 
look
at
> the source but the code was generated by JavaCC and so is not very
readable.
> I was wondering if there was a way around this problem or which area of
the
> code I would need to change to avoid this.
>
> Thanks
> Hannah Cumming
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>



From: PEP AD Server Administrator 
<[EMAIL PROTECTED]>
Reply-To: "Lucene Users List" <[EMAIL PROTECTED]>
To: "'Lucene Users List'" <[EMAIL PROTECTED]>
Subject: AW: Problem indexing Spanish Characters
Date: Wed, 19 May 2004 18:08:56 +0200

Hi Hannah, Otis
I cannot help but I have excatly the same problems with special german
charcters. I used snowball analyser but this does not help because the
problem (tokenizing) appears before the analyser comes into action.
I just posted the question "Problem tokenizing UTF-8 with geman umlauts"
some minutes ago which describes my problem and Hannahs seem to be similar.
Do you have also UTF-8 encoded pages?
Peter MH
-Ursprüngliche Nachricht-
Von: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
Gesendet: Mittwoch, 19. Mai 2004 17:42
An: Lucene Users List
Betreff: Re: Problem indexing Spanish Characters
It looks like Snowball project supports Spanish:
http://www.google.com/search?q=snowball spanish
If it does, take a look at Lucene Sandbox.  There is a project that
allows you to use Snowball analyzers with Lucene.
Otis
--- Hannah c <[EMAIL PROTECTED]> wrote:
>
> Hi,
>
> I  am indexing a number of English articles on Spanish resorts. As
> such
> there are a number of spanish characters throught the text, most of
> these
> are in the place names which are the type of words I would like to
> use as
> queries. My problem is with the StandardTokenizer class which cuts
> the word
> into two when it comes across any of the spanish characters. I had a
> look at
> the source but the code was generated by JavaCC and so is not very
> readable.
> I was wondering if there was a way around this problem or which area
> of the
> code I would need to change to avoid this.
>
> Thanks
> Hannah Cumming
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Hannah 
Cumming
[EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

AW: Problem indexing Spanish Characters

2004-05-19 Thread PEP AD Server Administrator

Hi Hannah, Otis
I cannot help but I have excatly the same problems with special german
charcters. I used snowball analyser but this does not help because the
problem (tokenizing) appears before the analyser comes into action.
I just posted the question "Problem tokenizing UTF-8 with geman umlauts"
some minutes ago which describes my problem and Hannahs seem to be similar.
Do you have also UTF-8 encoded pages?

Peter MH

-Ursprüngliche Nachricht-
Von: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
Gesendet: Mittwoch, 19. Mai 2004 17:42
An: Lucene Users List
Betreff: Re: Problem indexing Spanish Characters

It looks like Snowball project supports Spanish:
http://www.google.com/search?q=snowball spanish

If it does, take a look at Lucene Sandbox.  There is a project that
allows you to use Snowball analyzers with Lucene.

Otis

--- Hannah c <[EMAIL PROTECTED]> wrote:
> 
> Hi,
> 
> I  am indexing a number of English articles on Spanish resorts. As
> such 
> there are a number of spanish characters throught the text, most of
> these 
> are in the place names which are the type of words I would like to
> use as 
> queries. My problem is with the StandardTokenizer class which cuts
> the word 
> into two when it comes across any of the spanish characters. I had a
> look at 
> the source but the code was generated by JavaCC and so is not very
> readable. 
> I was wondering if there was a way around this problem or which area
> of the 
> code I would need to change to avoid this.
> 
> Thanks
> Hannah Cumming

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Problem indexing Spanish Characters

2004-05-19 Thread Otis Gospodnetic

It looks like Snowball project supports Spanish:
http://www.google.com/search?q=snowball spanish

If it does, take a look at Lucene Sandbox.  There is a project that
allows you to use Snowball analyzers with Lucene.

Otis


--- Hannah c <[EMAIL PROTECTED]> wrote:
> 
> Hi,
> 
> I  am indexing a number of English articles on Spanish resorts. As
> such 
> there are a number of spanish characters throught the text, most of
> these 
> are in the place names which are the type of words I would like to
> use as 
> queries. My problem is with the StandardTokenizer class which cuts
> the word 
> into two when it comes across any of the spanish characters. I had a
> look at 
> the source but the code was generated by JavaCC and so is not very
> readable. 
> I was wondering if there was a way around this problem or which area
> of the 
> code I would need to change to avoid this.
> 
> Thanks
> Hannah Cumming
> 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
g snowball s

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Possible to fetch a document without all fields for performance?

2004-05-19 Thread Otis Gospodnetic

Hi Kevin,

There is no API for this, and I agree it would be handy.

Otis

--- Kevin Burton <[EMAIL PROTECTED]> wrote:
> Say I have a query result for the term Linux... now I just want the 
> TITLE of these documents not the BODY.
> 
> To further this scenario imagine the TITLE is 500 bytes but the  BODY
> is 
> 50M. 
> 
> The current impl of fetching a document will pull in ALL 50,000,500 
> bytes not just the 500 that I need. 
> 
> Obviously if I could just get the TITLE field this would be a HUGE
> speedup.
> 
> Is there a somewhat simple and efficient way to get a document with a
> 
> restricted set of fields?  Digging through the API it didnt' seem
> obvious.
> 
> Kevin
> 
> -- 
> 
> Please reply using PGP.
> 
> http://peerfear.org/pubkey.asc
> 
> NewsMonster - http://www.newsmonster.org/
> 
> Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
>AIM/YIM - sfburtonator,  Web - http://peerfear.org/
> GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
>   IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster
> 
> > begin:vcard
> fn:Kevin Burton
> n:Burton;Kevin
> email;internet:[EMAIL PROTECTED]
> tel;work:415-595-9965
> tel;home:415-595-9965
> tel;cell:415-595-9965
> x-mozilla-html:TRUE
> version:2.1
> end:vcard
> 
> 

> ATTACHMENT part 2 application/pgp-signature name=signature.asc



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Problem tokenizing UTF-8 with geman umlauts

2004-05-19 Thread PEP AD Server Administrator

Hello,
I have HTML-documents which are UTF-8 encoded and contain english and/or
german content. I have written my own Analyser and Filter to replace the
german umlauts with the commonly used pair of character (ü=ue, ä=ae, ö=oe)
to avoid any problems. Still in the HTML-code the german umlauts are shown
as a pair of character representing the UTF-8 encoding (I think). As a
result the StandardTokenizer is missinterpreting the string and splitting a
word with umlaut into 2 tokens which is of no use anymore.
Does anyone ahs experience in this case and can help me back on the road?

Peter MH

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Problem indexing Spanish Characters

2004-05-19 Thread Hannah c

Hi,
I  am indexing a number of English articles on Spanish resorts. As such 
there are a number of spanish characters throught the text, most of these 
are in the place names which are the type of words I would like to use as 
queries. My problem is with the StandardTokenizer class which cuts the word 
into two when it comes across any of the spanish characters. I had a look at 
the source but the code was generated by JavaCC and so is not very readable. 
I was wondering if there was a way around this problem or which area of the 
code I would need to change to avoid this.

Thanks
Hannah Cumming

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: How to handle range queries over large ranges and avoid Too Many Boolean clauses

2004-05-19 Thread Claude Devarenne

Thanks,  I will look at the sorting code.  Sorting results by date is 
next on list.  For now, I only have a small number of documents but the 
set is to grow to over 8 million documents for the collection I am 
working on.  Another collection we have is 40 million documents or so.  
From what you say it seems to me that sorting will not scale then when 
I get to larger number of documents.  I am considering using an SQL 
back end to implement sorting: bring back the unique IDs from lucene 
and then sort in SQL.

Claude
On May 18, 2004, at 11:23 PM, Morus Walter wrote:
Claude Devarenne writes:
Hi,
I have over 60,000 documents in my index which is slightly over a 1 GB
in size.  The documents range from the late seventies up to now.  I
have indexed dates as a keyword field using a string because the dates
are in MMDD format.  When I do range queries things are OK as long
as I don't exceed the built-in number of boolean clauses, so that's a
range of 3 years, e.g. 1979 to 1981.  The users are not only doing
complex queries but also want to query over long ranges, e.g. 
[19790101
TO 19991231].

Given these requirements, I am thinking of doing a query without the
date range, bring the unique ids back from the hits and then do a date
query in the SQL database I have that contains the same data.  Another
alternative is to do the query without the date range in Lucene and
then sort the results within the range.  I still have to learn how to
use the new sorting code and confessed I did not have time to look at
it yet.
Is there a simpler, easier way to do this?
I think it would be worth to take a look at the sorting code.
The idea of the sorting code is to have an array of the dates for each 
doc
in memory and access this array for sorting.
Now sorting isn't the only thing one might use this array for.
Doing a range check is another.
So you might extend the sorting code by a range selection.

There is no code for this in lucene and you have to create your own 
searcher
but it gives you a fast way to search and sort by date.

I did this independently from the new sorting code (I just started a 
little
to early) and it works quite well.
The only drawback from this (and the new sorting code) is, that it 
requires
an array of field values that must be rebuilt each time the index 
changes.
Shouldn't be a problem for 6 documents.

Morus
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Bad file descriptor (IOException) using SearchBean contribution

2004-05-19 Thread Timothy Stone

Erik Hatcher wrote:
On May 18, 2004, at 1:43 PM, Timothy Stone wrote:
Erik Hatcher wrote:
Lucene 1.4 (now in release candidate stage) includes built-in sorting 
 capabilities, so I definitely recommend you have a look at that.   
SearchBean is effectively deprecated based on this new much more  
powerful feature.
Erik

Forgive my naivety, but isn't the purpose of the SearchBean more than 
just sorting? Without the SearchBean, creating a MVC demo becomes a 
larger exercise to undertake.

Could you elaborate on what you mean by MVC here?  A value list handler 
piece has been developed and links posted to it on this list - if this 
is the type of thing you're referring to.
Again, maybe I was naively associating the "SearchBean" with something 
that it was not suppose to be doing. To elaborate, I would like to take 
the demo, which has been working with some success for two years on my 
site, and follow the suggestions of Andrew C. Oliver and go "Model 2 on 
the demo."

You and I have moved away in this thread from my original question, why 
I am getting the IOException: Bad File Descriptor, *and that is okay*, 
I'm learning a lot. However, I hope that we can come back to it later, 
if necessary off-list.

So the SearchBean's purpose, as I understood it, was to provide a Model 
2 component for use in JSPs.

A value list handler piece has been developed and links posted to it
on this list - if this is the type of thing you're referring to.
I tried looking for references to such, but no luck.
[snip]
I'd love to hear how folks are using SearchBean though, and why they 
feel it is beneficial.
See above as to how I think it could to be used. :)
I agree that Lucene offers a tremendous amount of power! Kudos to all of 
the developers working so hard on this. It is a testament to the 
flexibility of Java.

I must admit that I get the feeling that "newbies" to Lucene seem to get 
less attention on the list. I'm one that tries real hard to research my 
question first in the archives (marc.theaimsgroup.com) then on the web. 
Even I get frustrated on some lists where the most obvious question is 
being asked and the asker misses hints and outright help. The Lucene 
User list can be intimidating even for the advanced novice who may be on 
the right track but not phrasing or wording or describing the problem or 
task in front of him/her. So forgive me, Lucene is a very powerful 
API/library (see I understand what Lucene is ;) ) and I get lost in the 
new search terminology confronting me. Couple this with a new desire to 
tackle Struts, and well, havoc ensues.

Many thanks for your help and answers.
Tim
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: about search and update one index simultaneously

2004-05-19 Thread David Townsend

There is no problem with updating and searching simultaneously.  Two threads updating 
simultaneously on the same index on NFS can be a problem, as the locking does not work 
reliably.  Have a look through the archives for NFS, there are some solutions 
scattered about.

David

-Original Message-
From: xuemei li [mailto:[EMAIL PROTECTED]
Sent: 18 May 2004 23:01
To: [EMAIL PROTECTED]
Subject: about search and update one index simultaneously


Hi,all,

Can we do search and update one index simultaneously?Is someone know sth
about it? I had done some experiments.Now the search will be blocked
when the index is being updated.The error in search node is like this:
caught a class java.io.IOException
with message:Stale NFS file handle

Thanks

Xuemei Li




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: SELECTIVE Indexing

2004-05-19 Thread Karthik N S

Hey Lucene Users

My original intension for indexing was to
index certain portions of  HTML [ not the whole Document ],
if Jtidy is not supporting this then what are my optionals

Karthik

-Original Message-
From: Viparthi, Kiran (AFIS) [mailto:[EMAIL PROTECTED]
Sent: Wednesday, May 19, 2004 1:43 PM
To: 'Lucene Users List'
Subject: RE: SELECTIVE Indexing


I doubt if it can be used as a plug in.
Would be good to know if it can be used as a plug in.

Regards,
Kiran.

-Original Message-
From: Karthik N S [mailto:[EMAIL PROTECTED]
Sent: 17 May 2004 12:30
To: Lucene Users List
Subject: RE: SELECTIVE Indexing


Hi

Can I Use TIDY [as plug in ] with Lucene ...


with regards
Karthik

-Original Message-
From: Viparthi, Kiran (AFIS) [mailto:[EMAIL PROTECTED]
Sent: Monday, May 17, 2004 3:27 PM
To: 'Lucene Users List'
Subject: RE: SELECTIVE Indexing



Try using Tidy.
Creates a Document of the html and allows you to apply xpath. Hope this
helps.

Kiran.

-Original Message-
From: Karthik N S [mailto:[EMAIL PROTECTED]
Sent: 17 May 2004 11:59
To: Lucene Users List
Subject: SELECTIVE Indexing



Hi all

   Can Some Body tell me How to Index  CERTAIN PORTION OF THE HTML FILE Only

   ex:-

   

 


with regards
Karthik




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: SELECTIVE Indexing

2004-05-19 Thread Viparthi, Kiran (AFIS)

I doubt if it can be used as a plug in.
Would be good to know if it can be used as a plug in.

Regards,
Kiran.

-Original Message-
From: Karthik N S [mailto:[EMAIL PROTECTED] 
Sent: 17 May 2004 12:30
To: Lucene Users List
Subject: RE: SELECTIVE Indexing


Hi

Can I Use TIDY [as plug in ] with Lucene ...


with regards
Karthik

-Original Message-
From: Viparthi, Kiran (AFIS) [mailto:[EMAIL PROTECTED]
Sent: Monday, May 17, 2004 3:27 PM
To: 'Lucene Users List'
Subject: RE: SELECTIVE Indexing



Try using Tidy.
Creates a Document of the html and allows you to apply xpath. Hope this
helps.

Kiran.

-Original Message-
From: Karthik N S [mailto:[EMAIL PROTECTED]
Sent: 17 May 2004 11:59
To: Lucene Users List
Subject: SELECTIVE Indexing



Hi all

   Can Some Body tell me How to Index  CERTAIN PORTION OF THE HTML FILE Only

   ex:-

   

 


with regards
Karthik




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

org.apache.lucene.search.highlight.Highlighter

2004-05-19 Thread Karthik N S




Hey Guys
Found some Highlighter Package on CVS 
Directory
Was 
Investigating,found some Compile time 
error..
 
Please some body tell me what 
this
 
 
 
The 
Code:-
 
private IndexReader reader=null; private 
Highlighter highlighter = null;  public SearchFiles() 
{ }  public void searchIndex0(String srchkey,String 
pathfile)throws Exception {  IndexSearcher searcher = new 
IndexSearcher(pathfile);  Query query = 
QueryParser.parse(srchkey,"bookid", 
analyzer);  query=query.rewrite(reader); //required to expand 
search terms  Hits hits = 
searcher.search(query); highlighter = new 
Highlighter(this,new QueryScorer(query));  for (int i = 0; i < 
hits.length(); i++) {   String text = 
hits.doc(i).get(bookid);   TokenStream 
tokenStream=analyzer.tokenStream(bookid,new 
StringReader(text));   // Get 3 best fragments and seperate 
with a "..."    String result = 
highlighter.getBestFragments(tokenStream,text,3,"...");   System.out.println(result);  } } 
 
 
The 
Error:-
src\org\apache\lucene\search\higlight\SearchFiles.java:46: 
cannot resolve symbol symbol : constructor Highlighter 
(com.controlnet.higlight.SearchFiles,com.controlnet.higlight.QueryScorer) 
location: class org.apache.lucene.search.highlight.Highlighter highlighter =new 
Highlighter(this,new QueryScorer(query)); 
Also Reffrells 
 to  URL from archives lucene-dev is not avaliable for proper documentation
http://home.clara.net/markharwood/lucene/highlight.htm

  
  
 

  
WITH WARM REGARDS HAVE A NICE DAY [ N.S.KARTHIK]

RE: How do i prevent the HTML tags being added to Lucene Index..

How do i prevent the HTML tags being added to Lucene Index..

Re: Lucene and MVC (was Re: Bad file descriptor (IOException) using SearchBean contribution)

Lucene and MVC (was Re: Bad file descriptor (IOException) using SearchBean contribution)

Re: Internal full content store within Lucene

Re: Possible to fetch a document without all fields for performance?

RE: org.apache.lucene.search.highlight.Highlighter

Re: org.apache.lucene.search.highlight.Highlighter

RE: AW: Problem indexing Spanish Characters

RE: AW: Problem indexing Spanish Characters

RE: AW: Problem indexing Spanish Characters

AW: Problem indexing Spanish Characters

Re: Problem indexing Spanish Characters

Re: Possible to fetch a document without all fields for performance?

Problem tokenizing UTF-8 with geman umlauts

Problem indexing Spanish Characters

Re: How to handle range queries over large ranges and avoid Too Many Boolean clauses

Re: Bad file descriptor (IOException) using SearchBean contribution

RE: about search and update one index simultaneously

RE: SELECTIVE Indexing

RE: SELECTIVE Indexing

org.apache.lucene.search.highlight.Highlighter

22 matches

Site Navigation

Mail list logo

Footer information