Lucene cuts the search results ?

2005-02-15 Thread Pierre VANNIER
 Hi all,
I'm quite a newbie for Lucene, but I bought Lucene In Action and I'm 
trying to customize few examples caught from there.

I Have this sample code of JSP (bad JSP caus' I'm also a jsp newbie - :-)) :
Here's the code

.html head body 
%
long start = new Date().getTime();
Iterator myIterator = vIndexDir.iterator();
while(myIterator.hasNext())
{
IndexSearcher searcher = new IndexSearcher((String)myIterator.next());
Query query = new TermQuery(new Term(introduction, queryString));
Hits hits = searcher.search(query);
QueryScorer  scorer = new QueryScorer(query);
Highlighter highlighter = new Highlighter(scorer);
%
table width=70% cellpadding=2 cellspacing=2
%
 out.println(trtdhrbr/NUMBER OF MATCHING NEWS FOR \+ 
(String)myIterator.next() + \ -- +hits.length() + /td/tr);
for (int i = 0; i  hits.length(); i++)
{
String introduction = hits.doc(i).get(introduction);
TokenStream stream = new 
SimpleAnalyzer().tokenStream(introduction, new 
StringReader(introduction));
String fragment = highlighter.getBestFragment(stream, 
introduction);
String pubDate = hits.doc(i).get(pubDate).substring(0, 
hits.doc(i).get(pubDate).length()-13);
String link = hits.doc(i).get(link);
float score =  hits.score(i);
String title = hits.doc(i).get(title);
%
tr
 td
 Scoring : b%=score%/bbr/
 %=pubDate +
  a href=\#\  onClick=\window.open(' +
 link + ', 'news', 'width=760;height=600')\ +
 title +
 /a
 %
 br/
 %= fragment%
 br/br/
 /td
 /tr
%}%
/table
%
   }
long end = new Date().getTime();
long interval  = end - start;
%
brbrdiv align=rightbSystem time for query : %= interval% 
milliseconds/b/div

/body
/html
---
The output is all right, but at the en of this result page, the last 
hit is cut (I mean for example) :

Scoring : 0.9210043
Fri, 28 Jan 2005
-
I'm running all this in tomcat 5.0.28 and last nightly fresh build of 
lucene.

So, Could it be a caching problem ? Could this come from JSP or Lucene ?
Thanks, and please I do apologise for my poor english ;-)
Pierre VANNIER
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Lucene on PersonalJava ?? HELP!

2005-02-15 Thread Karl Koch
Hi, 

did anybody here run Lucene 1.3 or 1.2 under PersonalJava (equivalent to JDK
1.1) ? I have a friend who runs Lucene 1.3 under PersonalJava and it works.
Mine doesn't. When conmparing the the code I cannot find any difference. I
search the index for a Query. 

I get an error saying that the method java.io.File.createNewFile() is used
in Lucene. I have checked Java 1.1.8 and indeed this method does not exist.

Beside the question, how it can work on my friends system with the same
code, I am asking two more questions:

1) Did anybody here use Lucene on a PDA under Personal Java and can tell
some experience?

2) Is there anything else I should try or something I have forgotten?

Thanks for your help,
Karl

-- 
Lassen Sie Ihren Gedanken freien Lauf... z.B. per FreeSMS
GMX bietet bis zu 100 FreeSMS/Monat: http://www.gmx.net/de/go/mail

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene on PersonalJava ?? HELP!

2005-02-15 Thread Miles Barr
On Tue, 2005-02-15 at 14:05 +0100, Karl Koch wrote:
 did anybody here run Lucene 1.3 or 1.2 under PersonalJava (equivalent to JDK
 1.1) ? I have a friend who runs Lucene 1.3 under PersonalJava and it works.
 Mine doesn't. When conmparing the the code I cannot find any difference. I
 search the index for a Query. 
 
 I get an error saying that the method java.io.File.createNewFile() is used
 in Lucene. I have checked Java 1.1.8 and indeed this method does not exist.
 
 Beside the question, how it can work on my friends system with the same
 code, I am asking two more questions:
 
 1) Did anybody here use Lucene on a PDA under Personal Java and can tell
 some experience?
 
 2) Is there anything else I should try or something I have forgotten?

It might be the constructor the the IndexReader or IndexSearcher that
you're using. You can pass in a string that points to the directory or a
file object instead. Lucene might being using
java.io.File.createNewFile() if you pass in a string. 

A simple grep should find out where it's being used.



-- 
Miles Barr [EMAIL PROTECTED]
Runtime Collective Ltd.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene cuts the search results ?

2005-02-15 Thread Daniel Naber
On Tuesday 15 February 2005 09:39, Pierre VANNIER wrote:

          String fragment = highlighter.getBestFragment(stream,
 introduction);

The highlighter breaks up text into same-size chunks (100 characters by 
default). If the matching term now appears just at the end or at the start of 
such a chunk you'll get no context and it looks as if text was cut off.

Regards
 Daniel

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene cuts the search results ?

2005-02-15 Thread Pierre VANNIER
Thank for reply Daniel,
But is there anything to do then to avoid such a thing to happen ?
Regards
Daniel Naber a écrit :
On Tuesday 15 February 2005 09:39, Pierre VANNIER wrote:
 

String fragment = highlighter.getBestFragment(stream,
introduction);
   

The highlighter breaks up text into same-size chunks (100 characters by 
default). If the matching term now appears just at the end or at the start of 
such a chunk you'll get no context and it looks as if text was cut off.

Regards
Daniel
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene on PersonalJava ?? HELP!

2005-02-15 Thread Karl Koch
Hello,

thank you for the tip. I have solved the problem in a different way. If
anybody else want to run Lucene on PJava, he/she might go for the same.

I am using the cvm VM instead of the jeode VM. Then it works fine with
Lucene 1.2 withtout any change in my code or in the Lucene code. Perhaps
even with a newer version (but havn't tested yet). :-)

Thank you anyway,
Karl

 On Tue, 2005-02-15 at 14:05 +0100, Karl Koch wrote:
  did anybody here run Lucene 1.3 or 1.2 under PersonalJava (equivalent to
 JDK
  1.1) ? I have a friend who runs Lucene 1.3 under PersonalJava and it
 works.
  Mine doesn't. When conmparing the the code I cannot find any difference.
 I
  search the index for a Query. 
  
  I get an error saying that the method java.io.File.createNewFile() is
 used
  in Lucene. I have checked Java 1.1.8 and indeed this method does not
 exist.
  
  Beside the question, how it can work on my friends system with the same
  code, I am asking two more questions:
  
  1) Did anybody here use Lucene on a PDA under Personal Java and can tell
  some experience?
  
  2) Is there anything else I should try or something I have forgotten?
 
 It might be the constructor the the IndexReader or IndexSearcher that
 you're using. You can pass in a string that points to the directory or a
 file object instead. Lucene might being using
 java.io.File.createNewFile() if you pass in a string. 
 
 A simple grep should find out where it's being used.
 
 
 
 -- 
 Miles Barr [EMAIL PROTECTED]
 Runtime Collective Ltd.
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 

-- 
Lassen Sie Ihren Gedanken freien Lauf... z.B. per FreeSMS
GMX bietet bis zu 100 FreeSMS/Monat: http://www.gmx.net/de/go/mail

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene cuts the search results ?

2005-02-15 Thread markharw00d
Hi Pierre,
Here's the response I gave the last time this question was raised::
The highlighter uses a number of pluggable services, one of which is the
choice of Fragmenter implementation. This interface is for classes which
decide the boundaries where to cut the original text into snippets. The 
default
implementation used simply breaks up text into evenly sized chunks. A more
intelligent implementation could be made to detect sentence boundaries.
What you are asking for requires that the Fragmenter would know where the
upcoming query matches are and decides on fragment boundaries with this in
mind. To have this foresight would require a preliminary pass over the
TokenStream to identify the match points before calling the highlighter.

This Fragmenter implementation does not exist but it does not sound
unachievable. I would suggest that some knowledge of sentence boundaries
probably would probably help here too. I dont have any plans to write such a
Fragmenter now but this is how it could be done.
Hope this helps,
Cheers,
Mark

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Fieldinformation from Index

2005-02-15 Thread Karl Koch
Hello,

I have two questions which might be easy to answer from a Lucene expert:

1) I need to know which fields a collection of Documents has (given the fact
that not all docuemnts do necessaryly use all fields). This Documents are
all stored in one index. Is there a way (with Lucene 1.2 or 1.3) to find out
without going though each document and retrieving it?

2) I need to know which Analyzer was used to index a field. One important
rule, as we all know, is to use the same analyzer for indexing and searching
a field. Is this information stored in the index or in full responsibilty of
the application developer?

Karl

-- 
DSL Komplett von GMX +++ Supergünstig und stressfrei einsteigen!
AKTION Kein Einrichtungspreis nutzen: http://www.gmx.net/de/go/dsl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Opening up one large index takes 940M or memory?

2005-02-15 Thread Kevin A. Burton
Doug Cutting wrote:
Kevin A. Burton wrote:
Is there any way to reduce this footprint?  The index is fully 
optimized... I'm willing to take a performance hit if necessary.  Is 
this documented anywhere?

You can increase TermInfosWriter.indexInterval.  You'll need to 
re-write the .tii file for this to take effect.  The simplest way to 
do this is to use IndexWriter.addIndexes(), adding your index to a 
new, empty, directory.  This will of course take a while for a 60GB 
index...

(Note... when this works I'll note my findings in a wiki page for future 
developers)

Two more questions:
1.  Do I have to do this with a NEW directory?  Our nightly index merger 
uses an existing target index which I assume will re-use the same 
settings as before?  I did this last night and it still seems to use the 
same amount of memory.  Above you assert that I should use a new empty 
directory and I'll try that tonight.

2. This isn't destructive is it?  I mean I'll be able to move BACK to a 
TermInfosWriter.indexInterval of 128 right?

Thanks!
Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Opening up one large index takes 940M or memory?

2005-02-15 Thread Doug Cutting
Kevin A. Burton wrote:
1.  Do I have to do this with a NEW directory?  Our nightly index merger 
uses an existing target index which I assume will re-use the same 
settings as before?  I did this last night and it still seems to use the 
same amount of memory.  Above you assert that I should use a new empty 
directory and I'll try that tonight.
You need to re-write the entire index using a modified 
TermIndexWriter.java.  Optimize rewrites the entire index but is 
destructive.  Merging into a new empty directory is a non-destructive 
way to do this.

2. This isn't destructive is it?  I mean I'll be able to move BACK to a 
TermInfosWriter.indexInterval of 128 right?
Yes, you can go back if you re-optimize or re-merge again.
Also, there's no need to CC my personal email address.
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Fieldinformation from Index

2005-02-15 Thread Erik Hatcher
On Feb 15, 2005, at 11:45 AM, Karl Koch wrote:
2) I need to know which Analyzer was used to index a field. One 
important
rule, as we all know, is to use the same analyzer for indexing and 
searching
a field. Is this information stored in the index or in full 
responsibilty of
the application developer?
The analyzer is not stored in the index, nor its name.  I believe this 
was discussed in the past, though.

It's not a rule that the same analyzer be used for both indexing and 
searching, and there are cases where it makes sense to use different 
ones.  The analyzers must be compatible though.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Multiple Keywords/Keyphrases fields

2005-02-15 Thread Owen Densmore
From: Erik Hatcher [EMAIL PROTECTED]
Date: February 12, 2005 3:09:15 PM MST
To: Lucene Users List lucene-user@jakarta.apache.org
Subject: Re: Multiple Keywords/Keyphrases fields
The real question to answer is what types of queries you're planning 
on making.  Rather than look at it from indexing forward, consider it 
from searching backwards.

How will users query using those keyword phrases?
Hi Erik.  Good point.
There are two uses we are making of the keyphrases:
	- Graphical Navigation: A Flash graphical browser will allow users to 
fly around in a space of documents, choosing what to be viewing: 
Authors, Keyphrases and Textual terms.  In any of these cases, the 
closeness of any of the fields will govern how close they will appear 
graphically.  In the case of authors, we will weight collaboration .. 
how often the authors work together.  In the case of Keyphrases, we 
will want to use something like distance vectors like you show in the 
book using the cosine measure.  Thus the keyphrases need to be separate 
entities within the document .. it would be a bug for us if the terms 
leaked across the separate kephrases within the document.

	- Textual Search: In this case, we will have two ways to search the 
keyphrases.  The first would be like the graphical navigation above 
where searching for complex system should require the terms to be in 
a single keyphrase.  The second way will be looser, where we may simply 
pool the keyphrases with titles and abstract, and allow them all to be 
searched together within the document.

Does this make sense?  So the question from the search standpoint is: 
do multiple instances of a field act like there are barriers across the 
instances, or are they somehow treated as a single instance somehow.  
In terms of the closeness calculation, for example, can we get separate 
term vectors for each instance of the keyphrase field, or will we get a 
single vector combining all the keyphrase terms within a single 
document?

I hope this is clear!  Kinda hard to articulate.
Owen
Erik
On Feb 12, 2005, at 3:08 PM, Owen Densmore wrote:
I'm getting a bit more serious about the final form of our lucene 
index.  Each document has DocNumber, Authors, Title, Abstract, and 
Keywords.  By Keywords, I mean a comma separated list, each entry 
having possibly many terms in a phrase like:
	temporal infomax, finite state automata, Markov chains,
	conditional entropy, neural information processing

I presume I should be using a field Keywords which have many 
entries or instances per document (one per comma separated 
phrase).  But I'm not sure the right way to handle all this.  My 
assumption is that I should analyze them individually, just as we do 
for free text (the Abstract, for example), thus in the example above 
having 5 entries of the nature
	doc.add(Field.Text(Keywords, finite state automata));
etc, analyzing them because these are author-supplied strings with no 
canonical form.

For guidance, I looked in the archive and found the attached email, 
but I didn't see the answer.  (I'm not concerned about the dups, I 
presume that is equivalent to a boos of some sort) Does this seem 
right?

Thanks once again.
Owen
From: [EMAIL PROTECTED] [EMAIL PROTECTED]
Subject: Multiple equal Fields?
Date: Tue, 17 Feb 2004 12:47:58 +0100
Hi!
What happens if I do this:
doc.add(Field.Text(foo, bar));
doc.add(Field.Text(foo, blah));
Is there a field foo with value blah or are there two foos 
(actually not
possible) or is there one foo with the values bar and blah?

And what does happen in this case:
doc.add(Field.Text(foo, bar));
doc.add(Field.Text(foo, bar));
doc.add(Field.Text(foo, bar));
Does lucene store this only once?
Timo


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]