Fieldinformation from Index

2005-02-15 Thread Karl Koch
Hello,

I have two questions which might be easy to answer from a Lucene expert:

1) I need to know which fields a collection of Documents has (given the fact
that not all docuemnts do necessaryly use all fields). This Documents are
all stored in one index. Is there a way (with Lucene 1.2 or 1.3) to find out
without going though each document and retrieving it?

2) I need to know which Analyzer was used to index a field. One important
rule, as we all know, is to use the same analyzer for indexing and searching
a field. Is this information stored in the index or in full responsibilty of
the application developer?

Karl

-- 
DSL Komplett von GMX +++ Supergünstig und stressfrei einsteigen!
AKTION "Kein Einrichtungspreis" nutzen: http://www.gmx.net/de/go/dsl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene on PersonalJava ?? HELP!

2005-02-15 Thread Karl Koch
Hello,

thank you for the tip. I have solved the problem in a different way. If
anybody else want to run Lucene on PJava, he/she might go for the same.

I am using the cvm VM instead of the jeode VM. Then it works fine with
Lucene 1.2 withtout any change in my code or in the Lucene code. Perhaps
even with a newer version (but havn't tested yet). :-)

Thank you anyway,
Karl

> On Tue, 2005-02-15 at 14:05 +0100, Karl Koch wrote:
> > did anybody here run Lucene 1.3 or 1.2 under PersonalJava (equivalent to
> JDK
> > 1.1) ? I have a friend who runs Lucene 1.3 under PersonalJava and it
> works.
> > Mine doesn't. When conmparing the the code I cannot find any difference.
> I
> > search the index for a Query. 
> > 
> > I get an error saying that the method java.io.File.createNewFile() is
> used
> > in Lucene. I have checked Java 1.1.8 and indeed this method does not
> exist.
> > 
> > Beside the question, how it can work on my friends system with the same
> > code, I am asking two more questions:
> > 
> > 1) Did anybody here use Lucene on a PDA under Personal Java and can tell
> > some experience?
> > 
> > 2) Is there anything else I should try or something I have forgotten?
> 
> It might be the constructor the the IndexReader or IndexSearcher that
> you're using. You can pass in a string that points to the directory or a
> file object instead. Lucene might being using
> java.io.File.createNewFile() if you pass in a string. 
> 
> A simple grep should find out where it's being used.
> 
> 
> 
> -- 
> Miles Barr <[EMAIL PROTECTED]>
> Runtime Collective Ltd.
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-- 
Lassen Sie Ihren Gedanken freien Lauf... z.B. per FreeSMS
GMX bietet bis zu 100 FreeSMS/Monat: http://www.gmx.net/de/go/mail

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Lucene on PersonalJava ?? HELP!

2005-02-15 Thread Karl Koch
Hi, 

did anybody here run Lucene 1.3 or 1.2 under PersonalJava (equivalent to JDK
1.1) ? I have a friend who runs Lucene 1.3 under PersonalJava and it works.
Mine doesn't. When conmparing the the code I cannot find any difference. I
search the index for a Query. 

I get an error saying that the method java.io.File.createNewFile() is used
in Lucene. I have checked Java 1.1.8 and indeed this method does not exist.

Beside the question, how it can work on my friends system with the same
code, I am asking two more questions:

1) Did anybody here use Lucene on a PDA under Personal Java and can tell
some experience?

2) Is there anything else I should try or something I have forgotten?

Thanks for your help,
Karl

-- 
Lassen Sie Ihren Gedanken freien Lauf... z.B. per FreeSMS
GMX bietet bis zu 100 FreeSMS/Monat: http://www.gmx.net/de/go/mail

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: HELP! JIT error when searching... Lucene 1.3 on Java 1.1

2005-02-08 Thread Karl Koch
I have a colleague which uses Lucene 1.3 on PersonalJava (equally to Java
1.1.8). I can't find a significant difference to his code (sill searching)
but he did not many any changes. He did also not recompile Lucene 1.3 on
1.1.8 etc.

It must be something simple. I will look for that switch...

In the meantime, I am thankful for any other help.

Cheers,
Karl

> On Tuesday 08 February 2005 18:49, sergiu gordea wrote:
> > Karl Koch wrote:
> ...
> > >>A nonfatal internal JIT (3.10.107(x)) error 'chgTarg: Conditional' has
> > >>occurred in : 
> > >>  'org/apache/lucene/store/FSDirectory.getDirectory
> > >>(Ljava/io/File;Z)Lorg/apache/lucene/store/FSDirectory;': Interpreting
> > >>method.
> > >>  Please report this error in detail to
> > >>http://java.sun.com/cgi-bin/bugreport.cgi
> 
> Iirc java 1.1 had a switch to turn of JIT compilation. It did slow things
> down when I was using 1.1 (1.1.8?), but it might you help now...
> 
> Regards,
> Paul Elschot
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-- 
DSL Komplett von GMX +++ Supergünstig und stressfrei einsteigen!
AKTION "Kein Einrichtungspreis" nutzen: http://www.gmx.net/de/go/dsl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



HELP! JIT error when searching... Lucene 1.3 on Java 1.1

2005-02-08 Thread Karl Koch
When I switch to Java 1.2, I can also not run it. Also I cannot index
anything. I have no idea why...

Can sombody help me?

Karl

> Hello all,
> 
> I have heard that Lucene 1.3 Final should run under Java 1.1. (I need that
> because I want to run a search with a PDA using Java 1.1).
> 
> However, when I run my code. I get the following error:
> 
> --
> 
> A nonfatal internal JIT (3.10.107(x)) error 'chgTarg: Conditional' has
> occurred in : 
>   'org/apache/lucene/store/FSDirectory.getDirectory
> (Ljava/io/File;Z)Lorg/apache/lucene/store/FSDirectory;': Interpreting
> method.
>   Please report this error in detail to
> http://java.sun.com/cgi-bin/bugreport.cgi
> 
> Exception occured in StandardSearch:search(String, String[], String)!
> java.lang.IllegalMonitorStateException: current thread not owner
>   at org.apache.lucene.store.FSDirectory.makeLock(FSDirectory.java:312)
>   at org.apache.lucene.index.IndexReader.open(IndexReader.java, Compiled
> Code)
> 
> --
> 
> The error does not occur when I run it under Java 1.4.
> 
> What do I do wrong and what do I need to change in order to make it work.
> It
> must be my code. Here the code relevant to this error (the search method).
> 
> 
> public static Result search(String queryString, String[] searchFields, 
>   String indexDirectory) {
>   // create access to index
>   StandardAnalyzer analyser = new StandardAnalyzer();
>   Hits hits = null;
>   Result result = null;
>   try {
>   fsDirectory = 
> FSDirectory.getDirectory(StandardSearcher.indexDirectory, false);
>   IndexSearcher searcher = new IndexSearcher(fsDirectory);
>   ...
> }
> 
> 
> What is wrong here?
> 
> Best Regards,
> Karl
> 
> -- 
> DSL Komplett von GMX +++ Supergünstig und stressfrei einsteigen!
> AKTION "Kein Einrichtungspreis" nutzen: http://www.gmx.net/de/go/dsl
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-- 
DSL Komplett von GMX +++ Supergünstig und stressfrei einsteigen!
AKTION "Kein Einrichtungspreis" nutzen: http://www.gmx.net/de/go/dsl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



JIT error when searching... Lucene 1.3 on Java 1.1

2005-02-08 Thread Karl Koch
Hello all,

I have heard that Lucene 1.3 Final should run under Java 1.1. (I need that
because I want to run a search with a PDA using Java 1.1).

However, when I run my code. I get the following error:

--

A nonfatal internal JIT (3.10.107(x)) error 'chgTarg: Conditional' has
occurred in : 
  'org/apache/lucene/store/FSDirectory.getDirectory
(Ljava/io/File;Z)Lorg/apache/lucene/store/FSDirectory;': Interpreting
method.
  Please report this error in detail to
http://java.sun.com/cgi-bin/bugreport.cgi

Exception occured in StandardSearch:search(String, String[], String)!
java.lang.IllegalMonitorStateException: current thread not owner
at org.apache.lucene.store.FSDirectory.makeLock(FSDirectory.java:312)
at org.apache.lucene.index.IndexReader.open(IndexReader.java, Compiled
Code)

--

The error does not occur when I run it under Java 1.4.

What do I do wrong and what do I need to change in order to make it work. It
must be my code. Here the code relevant to this error (the search method).


public static Result search(String queryString, String[] searchFields, 
  String indexDirectory) {
  // create access to index
  StandardAnalyzer analyser = new StandardAnalyzer();
  Hits hits = null;
  Result result = null;
  try {
  fsDirectory = 
FSDirectory.getDirectory(StandardSearcher.indexDirectory, false);
  IndexSearcher searcher = new IndexSearcher(fsDirectory);
  ...
}


What is wrong here?

Best Regards,
Karl

-- 
DSL Komplett von GMX +++ Supergünstig und stressfrei einsteigen!
AKTION "Kein Einrichtungspreis" nutzen: http://www.gmx.net/de/go/dsl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Retrieve all documents - possible?

2005-02-07 Thread Karl Koch
Hi,

is it possible to retrieve ALL documents from a Lucene index? This should
then actually not be a search...

Karl

-- 
Lassen Sie Ihren Gedanken freien Lauf... z.B. per FreeSMS
GMX bietet bis zu 100 FreeSMS/Monat: http://www.gmx.net/de/go/mail

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Some questions about index...

2005-02-05 Thread Karl Koch
Thank you four fast, straight and usefull comments. Keeping in mind what was
said, did anybody actually think about implementing a kind of database layer
on top of a lucene index. A database would be an index, collumns would be
fields and entries documents. At least everything which would only require a
single table could be done. A SELECT would be search ...

:-)

Karl

> 
> On Feb 5, 2005, at 10:04 AM, Karl Koch wrote:
> > 1) Can I store all the information of the text file, but also apply a
> > analyser. E.g. I use the StopAnalyzer. After finding the document, I 
> > want to
> > extract the original text also from the index. Does this require that I
> > store the information twice in two different fields (one indexed and 
> > one
> > unindexed) ?
> 
> You should use a single stored, tokenized, and indexed field for this 
> purpose.  Be cautious of how you construct the Field object to achieve 
> this.
> 
> > 2) I would like to extract information from the index which can found 
> > in a
> > boolean way. I know that Lucene is a VSM which provides Boolean 
> > operators.
> > This however does not change its functioning. For example, I have a 
> > field
> > with contains an ID number and I want to use the search like a database
> > operatation (e.g. to find the document with id=1). I can solve the 
> > problem
> > by searching with query "id:1". However, this does not ensure that I 
> > will
> > only get one result. Usually the first result is the document I want. 
> > But it
> > could happen, that this sometimes does not work.
> 
> Why wouldn't it work?  For ID-type fields, use a Field.Keyword (stored, 
> indexed, but not tokenized).  Search for a specific ID using a 
> TermQuery (don't use QueryParser for this, please).  If the ID values 
> are unique, you'll either get zero or one result.
> 
> >  What happens if I should
> > get no results? I guess if I search for id=5 and 5 did not exist I 
> > would
> > probably get 50, 51, .. just because the contain 5. Did somebody work 
> > with
> > this and can suggest a stable solution?
> 
> No, this would not be the case, unless you're analyzing the ID field 
> with some strange character-by-character analyzer or doing a wildcard 
> "*5*" type query.
> 
> > A good solution for these two questions would help me avoiding a 
> > database
> > which would need to replicate most the data which I already have in my
> > Lucene index...
> 
> You're on the right track and avoiding a database when it is overkill 
> or duplicative is commendable :)
> 
>   Erik
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-- 
Lassen Sie Ihren Gedanken freien Lauf... z.B. per FreeSMS
GMX bietet bis zu 100 FreeSMS/Monat: http://www.gmx.net/de/go/mail

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Some questions about index...

2005-02-05 Thread Karl Koch
Hi all,

for simplicity reasons I would like to use the index as my data storage
whilst using the advantage of the highly optimised Lucene index structure.

1) Can I store all the information of the text file, but also apply a
analyser. E.g. I use the StopAnalyzer. After finding the document, I want to
extract the original text also from the index. Does this require that I
store the information twice in two different fields (one indexed and one
unindexed) ?

2) I would like to extract information from the index which can found in a
boolean way. I know that Lucene is a VSM which provides Boolean operators.
This however does not change its functioning. For example, I have a field
with contains an ID number and I want to use the search like a database
operatation (e.g. to find the document with id=1). I can solve the problem
by searching with query "id:1". However, this does not ensure that I will
only get one result. Usually the first result is the document I want. But it
could happen, that this sometimes does not work. What happens if I should
get no results? I guess if I search for id=5 and 5 did not exist I would
probably get 50, 51, .. just because the contain 5. Did somebody work with
this and can suggest a stable solution?

A good solution for these two questions would help me avoiding a database
which would need to replicate most the data which I already have in my
Lucene index...

Kind Regards,
Karl


-- 
DSL Komplett von GMX +++ Supergünstig und stressfrei einsteigen!
AKTION "Kein Einrichtungspreis" nutzen: http://www.gmx.net/de/go/dsl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: which HTML parser is better?

2005-02-04 Thread Karl Koch
The link does not work.

> 
> One which we've been using can be found at:
> http://www.ltg.ed.ac.uk/~richard/ftp-area/html-parser/
> 
> We absolutely need to be able to recover gracefully from malformed
> HTML and/or SGML.  Most of the nicer SAX/DOM/TLA parsers out there
> failed this criterion when we started our effort.  The above one is
> kind of SAX-y but doesn't fall over at the sight of a real web page
> ;-)
> 
> Ian
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-- 
DSL Komplett von GMX +++ Supergünstig und stressfrei einsteigen!
AKTION "Kein Einrichtungspreis" nutzen: http://www.gmx.net/de/go/dsl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: which HTML parser is better? - Thread closed

2005-02-03 Thread Karl Koch
Thank you, I will do that.

> Karl Koch wrote:
> 
> >I appologise in advance, if some of my writing here has been said before.
> >The last three answers to my question have been suggesting pattern
> matching
> >solutions and Swing. Pattern matching was introduced in Java 1.4 and
> Swing
> >is something I cannot use since I work with Java 1.1 on a PDA.
> >  
> >
> I see,
> 
> In this case you can read line by line your HTML file and then write 
> something like this:
> 
> String line;
> int startPos, endPos;
> StringBuffer text = new StringBuffer();
> while((line = reader.readLine()) != null   ){
> startPos = line.indexOf(">");
> endPos = line.indexOf("<");
> if(startPos >0 && endPos > startPos)
>   text.append(line.substring(startPos, endPos));
> }
> 
> This is just a sample code that should work if you have just one tag per 
> line in the HTML file.
> This can be a start point for you.
> 
>   Hope it helps,
> 
>  Best,
> 
>  Sergiu
> 
> >I am wondering if somebody knows a piece of simple sourcecode with low
> >requirement which is running under this tense specification.
> >
> >Thank you all,
> >Karl
> >
> >  
> >
> >>No one has yet mentioned using ParserDelegator and ParserCallback that 
> >>are part of HTMLEditorKit in Swing.  I have been successfully using 
> >>these classes to parse out the text of an HTML file.  You just need to 
> >>extend HTMLEditorKit.ParserCallback and override the various methods 
> >>that are called when different tags are encountered.
> >>
> >>
> >>On Feb 1, 2005, at 3:14 AM, Jingkang Zhang wrote:
> >>
> >>
> >>
> >>>Three HTML parsers(Lucene web application
> >>>demo,CyberNeko HTML Parser,JTidy) are mentioned in
> >>>Lucene FAQ
> >>>1.3.27.Which is the best?Can it filter tags that are
> >>>auto-created by MS-word 'Save As HTML files' function?
> >>>  
> >>>
> >>-- 
> >>Bill Tschumy
> >>Otherwise -- Austin, TX
> >>http://www.otherwise.com
> >>
> >>
> >>-
> >>To unsubscribe, e-mail: [EMAIL PROTECTED]
> >>For additional commands, e-mail: [EMAIL PROTECTED]
> >>
> >>
> >>
> >
> >  
> >
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-- 
10 GB Mailbox, 100 FreeSMS http://www.gmx.net/de/go/topmail
+++ GMX - die erste Adresse für Mail, Message, More +++

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: which HTML parser is better?

2005-02-03 Thread Karl Koch
I am using Java 1.1 with a Sharp Zaurus PDA. I have very limited memory
constraints. I do not think CPU performance is a big issues though. But I
have other parts in my application which use quite a lot of memory and
soemthing run short. I therefore do not look into solutions which build up
tag trees etc. More like a solution who reads a stream of HTML and
transforms it into a stream of text.

I see your point of using an external program. I am however not entirely
sure if this is available. Also it would be much simpler to have a 3-5 kB
solution in Java, perhaps encapsulated in a class which does the job without
the need for advanced libraries which need 100-200 KB on my internal
storage. 

I hope I could clarify my situation now.

Cheers,
Karl 

> Karl Koch wrote:
> 
> >Hello Sergiu,
> >
> >thank you for your help so far. I appreciate it.
> >
> >I am working with Java 1.1 which does not include regular expressions.
> >  
> >
> Why are you using Java 1.1? Are you so limited in resources?
> What operating system do you use?
> I asume that you just need to index the html files, and you need a 
> html2txt conversion.
> If  an external converter si a solution for you, you can use
> Runtime.executeCommnand(...) to run the converter that will extract the 
> information from your HTMLs
> and generate a .txt file. Then you can use a reader to index the txt.
> 
> As I told you before, the best solution depends on your constraints 
> (time, effort, hardware, performance) and requirements :)
> 
>   Best,
> 
>   Sergiu
> 
> >Your turn ;-)
> >Karl 
> >
> >  
> >
> >>Karl Koch wrote:
> >>
> >>
> >>
> >>>I am in control of the html, which means it is well formated HTML. I
> use
> >>>only HTML files which I have transformed from XML. No external HTML
> (e.g.
> >>>the web).
> >>>
> >>>Are there any very-short solutions for that?
> >>> 
> >>>
> >>>  
> >>>
> >>if you are using only correct formated HTML pages and you are in control
> >>of these pages.
> >>you can use a regular exprestion to remove the tags.
> >>
> >>something like
> >>replaceAll("<*>","");
> >>
> >>This is the ideea behind the operation. If you will search on google you
> >>will find a more robust
> >>regular expression.
> >>
> >>Using a simple regular expression will be a very cheap solution, that 
> >>can cause you a lot of problems in the future.
> >> 
> >> It's up to you to use it 
> >>
> >> Best,
> >> 
> >> Sergiu
> >>
> >>
> >>
> >>>Karl
> >>>
> >>> 
> >>>
> >>>  
> >>>
> >>>>Karl Koch wrote:
> >>>>
> >>>>   
> >>>>
> >>>>
> >>>>
> >>>>>Hi,
> >>>>>
> >>>>>yes, but the library your are using is quite big. I was thinking that
> a
> >>>>> 
> >>>>>
> >>>>>  
> >>>>>
> >>>>5kB
> >>>>   
> >>>>
> >>>>
> >>>>
> >>>>>code could actually do that. That sourceforge project is doing much
> >>>>>  
> >>>>>
> >>more
> >>
> >>
> >>>>>than that but I do not need it.
> >>>>>
> >>>>>
> >>>>> 
> >>>>>
> >>>>>  
> >>>>>
> >>>>you need just the htmlparser.jar 200k.
> >>>>... you know ... the functionality is strongly correclated with the
> >>>>
> >>>>
> >>size.
> >>
> >>
> >>>> You can use 3 lines of code with a good regular expresion to
> eliminate
> >>>>the html tags,
> >>>>but this won't give you any guarantie that the text from the bad 
> >>>>fromated html files will be
> >>>>correctly extracted...
> >>>>
> >>>> Best,
> >>>>
> >>>> Sergiu
> >>>>
> >>>>   
> >>>>
> >>>>
> >>>>
> >>>>>Karl
> >>>>>
> >>>>>
>

Re: which HTML parser is better?

2005-02-03 Thread Karl Koch
I appologise in advance, if some of my writing here has been said before.
The last three answers to my question have been suggesting pattern matching
solutions and Swing. Pattern matching was introduced in Java 1.4 and Swing
is something I cannot use since I work with Java 1.1 on a PDA.

I am wondering if somebody knows a piece of simple sourcecode with low
requirement which is running under this tense specification.

Thank you all,
Karl

> No one has yet mentioned using ParserDelegator and ParserCallback that 
> are part of HTMLEditorKit in Swing.  I have been successfully using 
> these classes to parse out the text of an HTML file.  You just need to 
> extend HTMLEditorKit.ParserCallback and override the various methods 
> that are called when different tags are encountered.
> 
> 
> On Feb 1, 2005, at 3:14 AM, Jingkang Zhang wrote:
> 
> > Three HTML parsers(Lucene web application
> > demo,CyberNeko HTML Parser,JTidy) are mentioned in
> > Lucene FAQ
> > 1.3.27.Which is the best?Can it filter tags that are
> > auto-created by MS-word 'Save As HTML files' function?
> -- 
> Bill Tschumy
> Otherwise -- Austin, TX
> http://www.otherwise.com
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-- 
Sparen beginnt mit GMX DSL: http://www.gmx.net/de/go/dsl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: which HTML parser is better?

2005-02-03 Thread Karl Koch
Unfortunaltiy I am faithful ;-). Just for practical reason I want to do that
in a single class or even method called by another part in my Java
application. It should also run on Java 1.1 and it should be small and
simple. As I said before, I am in control of the HTML and it will be well
formated, because I generate it from XML using XSLT.

Karl

> If you are not married to Java:
> http://search.cpan.org/~kilinrax/HTML-Strip-1.04/Strip.pm
> 
> Otis
> 
> --- sergiu gordea <[EMAIL PROTECTED]> wrote:
> 
> > Karl Koch wrote:
> > 
> > >I am in control of the html, which means it is well formated HTML. I
> > use
> > >only HTML files which I have transformed from XML. No external HTML
> > (e.g.
> > >the web).
> > >
> > >Are there any very-short solutions for that?
> > >  
> > >
> > if you are using only correct formated HTML pages and you are in
> > control 
> > of these pages.
> > you can use a regular exprestion to remove the tags.
> > 
> > something like
> > replaceAll("<*>","");
> > 
> > This is the ideea behind the operation. If you will search on google
> > you 
> > will find a more robust
> > regular expression.
> > 
> > Using a simple regular expression will be a very cheap solution, that
> > 
> > can cause you a lot of problems in the future.
> >  
> >  It's up to you to use it 
> > 
> >  Best,
> >  
> >  Sergiu
> > 
> > >Karl
> > >
> > >  
> > >
> > >>Karl Koch wrote:
> > >>
> > >>
> > >>
> > >>>Hi,
> > >>>
> > >>>yes, but the library your are using is quite big. I was thinking
> > that a
> > >>>  
> > >>>
> > >>5kB
> > >>
> > >>
> > >>>code could actually do that. That sourceforge project is doing
> > much more
> > >>>than that but I do not need it.
> > >>> 
> > >>>
> > >>>  
> > >>>
> > >>you need just the htmlparser.jar 200k.
> > >>... you know ... the functionality is strongly correclated with the
> > size.
> > >>
> > >>  You can use 3 lines of code with a good regular expresion to
> > eliminate 
> > >>the html tags,
> > >>but this won't give you any guarantie that the text from the bad 
> > >>fromated html files will be
> > >>correctly extracted...
> > >>
> > >>  Best,
> > >>
> > >>  Sergiu
> > >>
> > >>
> > >>
> > >>>Karl
> > >>>
> > >>> 
> > >>>
> > >>>  
> > >>>
> > >>>> Hi Karl,
> > >>>>
> > >>>>I already submitted a peace of code that removes the html tags.
> > >>>>Search for my previous answer in this thread.
> > >>>>
> > >>>> Best,
> > >>>>
> > >>>>  Sergiu
> > >>>>
> > >>>>Karl Koch wrote:
> > >>>>
> > >>>>   
> > >>>>
> > >>>>
> > >>>>
> > >>>>>Hello,
> > >>>>>
> > >>>>>I have  been following this thread and have another question. 
> > >>>>>
> > >>>>>Is there a piece of sourcecode (which is preferably very short
> > and
> > >>>>>  
> > >>>>>
> > >>simple
> > >>
> > >>
> > >>>>>(KISS)) which allows to remove all HTML tags from HTML content?
> > HTML
> > >>>>>  
> > >>>>>
> > >>3.2
> > >>
> > >>
> > >>>>>would be enough...also no frames, CSS, etc. 
> > >>>>>
> > >>>>>I do not need to have the HTML strucutre tree or any other
> > structure
> > >>>>>  
> > >>>>>
> > >>but
> > >>
> > >>
> > >>>>>need a facility to clean up HTML into its normal underlying
> > content
> > >>>>> 
> > >>>>>
> > >>>>>  
> > >>>>>
> > >>&g

Re: which HTML parser is better?

2005-02-03 Thread Karl Koch
Hello Sergiu,

thank you for your help so far. I appreciate it.

I am working with Java 1.1 which does not include regular expressions.

Your turn ;-)
Karl 

> Karl Koch wrote:
> 
> >I am in control of the html, which means it is well formated HTML. I use
> >only HTML files which I have transformed from XML. No external HTML (e.g.
> >the web).
> >
> >Are there any very-short solutions for that?
> >  
> >
> if you are using only correct formated HTML pages and you are in control 
> of these pages.
> you can use a regular exprestion to remove the tags.
> 
> something like
> replaceAll("<*>","");
> 
> This is the ideea behind the operation. If you will search on google you 
> will find a more robust
> regular expression.
> 
> Using a simple regular expression will be a very cheap solution, that 
> can cause you a lot of problems in the future.
>  
>  It's up to you to use it 
> 
>  Best,
>  
>  Sergiu
> 
> >Karl
> >
> >  
> >
> >>Karl Koch wrote:
> >>
> >>
> >>
> >>>Hi,
> >>>
> >>>yes, but the library your are using is quite big. I was thinking that a
> >>>  
> >>>
> >>5kB
> >>
> >>
> >>>code could actually do that. That sourceforge project is doing much
> more
> >>>than that but I do not need it.
> >>> 
> >>>
> >>>  
> >>>
> >>you need just the htmlparser.jar 200k.
> >>... you know ... the functionality is strongly correclated with the
> size.
> >>
> >>  You can use 3 lines of code with a good regular expresion to eliminate
> >>the html tags,
> >>but this won't give you any guarantie that the text from the bad 
> >>fromated html files will be
> >>correctly extracted...
> >>
> >>  Best,
> >>
> >>  Sergiu
> >>
> >>
> >>
> >>>Karl
> >>>
> >>> 
> >>>
> >>>  
> >>>
> >>>> Hi Karl,
> >>>>
> >>>>I already submitted a peace of code that removes the html tags.
> >>>>Search for my previous answer in this thread.
> >>>>
> >>>> Best,
> >>>>
> >>>>  Sergiu
> >>>>
> >>>>Karl Koch wrote:
> >>>>
> >>>>   
> >>>>
> >>>>
> >>>>
> >>>>>Hello,
> >>>>>
> >>>>>I have  been following this thread and have another question. 
> >>>>>
> >>>>>Is there a piece of sourcecode (which is preferably very short and
> >>>>>  
> >>>>>
> >>simple
> >>
> >>
> >>>>>(KISS)) which allows to remove all HTML tags from HTML content? HTML
> >>>>>  
> >>>>>
> >>3.2
> >>
> >>
> >>>>>would be enough...also no frames, CSS, etc. 
> >>>>>
> >>>>>I do not need to have the HTML strucutre tree or any other structure
> >>>>>  
> >>>>>
> >>but
> >>
> >>
> >>>>>need a facility to clean up HTML into its normal underlying content
> >>>>> 
> >>>>>
> >>>>>  
> >>>>>
> >>>>before
> >>>>   
> >>>>
> >>>>
> >>>>
> >>>>>indexing that content as a whole.
> >>>>>
> >>>>>Karl
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> 
> >>>>>
> >>>>>  
> >>>>>
> >>>>>>I think that depends on what you want to do.  The Lucene demo parser
> >>>>>>   
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>does
> >>>>   
> >>>>
> >>>>
> >>>>
> >>>>>>simple mapping of HTML files into Lucene Documents; it does not give
> >>>>>>
> >>>>>>
> >>you
> >>
> >>
> >>>>>>   
> >>>>>>
> >

Re: which HTML parser is better?

2005-02-02 Thread Karl Koch
I am in control of the html, which means it is well formated HTML. I use
only HTML files which I have transformed from XML. No external HTML (e.g.
the web).

Are there any very-short solutions for that?

Karl

> Karl Koch wrote:
> 
> >Hi,
> >
> >yes, but the library your are using is quite big. I was thinking that a
> 5kB
> >code could actually do that. That sourceforge project is doing much more
> >than that but I do not need it.
> >  
> >
> you need just the htmlparser.jar 200k.
> ... you know ... the functionality is strongly correclated with the size.
> 
>   You can use 3 lines of code with a good regular expresion to eliminate 
> the html tags,
> but this won't give you any guarantie that the text from the bad 
> fromated html files will be
> correctly extracted...
> 
>   Best,
> 
>   Sergiu
> 
> >Karl
> >
> >  
> >
> >>  Hi Karl,
> >>
> >> I already submitted a peace of code that removes the html tags.
> >> Search for my previous answer in this thread.
> >>
> >>  Best,
> >>
> >>   Sergiu
> >>
> >>Karl Koch wrote:
> >>
> >>
> >>
> >>>Hello,
> >>>
> >>>I have  been following this thread and have another question. 
> >>>
> >>>Is there a piece of sourcecode (which is preferably very short and
> simple
> >>>(KISS)) which allows to remove all HTML tags from HTML content? HTML
> 3.2
> >>>would be enough...also no frames, CSS, etc. 
> >>>
> >>>I do not need to have the HTML strucutre tree or any other structure
> but
> >>>need a facility to clean up HTML into its normal underlying content
> >>>  
> >>>
> >>before
> >>
> >>
> >>>indexing that content as a whole.
> >>>
> >>>Karl
> >>>
> >>>
> >>> 
> >>>
> >>>  
> >>>
> >>>>I think that depends on what you want to do.  The Lucene demo parser
> >>>>
> >>>>
> >>does
> >>
> >>
> >>>>simple mapping of HTML files into Lucene Documents; it does not give
> you
> >>>>
> >>>>
> >>a
> >>
> >>
> >>>>parse tree for the HTML doc.  CyberNeko is an extension of Xerces
> (uses
> >>>>   
> >>>>
> >>>>
> >>>>
> >>>the
> >>> 
> >>>
> >>>  
> >>>
> >>>>same API; will likely become part of Xerces), and so maps an HTML
> >>>>
> >>>>
> >>document
> >>
> >>
> >>>>into a full DOM that you can manipulate easily for a wide range of
> >>>>purposes.  I haven't used JTidy at an API level and so don't know it
> as
> >>>>   
> >>>>
> >>>>
> >>>>
> >>>well --
> >>> 
> >>>
> >>>  
> >>>
> >>>>based on its UI, it appears to be focused primarily on HTML validation
> >>>>
> >>>>
> >>and
> >>
> >>
> >>>>error detection/correction.
> >>>>
> >>>>I use CyberNeko for a range of operations on HTML documents that go
> >>>>
> >>>>
> >>beyond
> >>
> >>
> >>>>indexing them in Lucene, and really like it.  It has been robust for
> me
> >>>>
> >>>>
> >>so
> >>
> >>
> >>>>far.
> >>>>
> >>>>Chuck
> >>>>
> >>>> > -Original Message-
> >>>> > From: Jingkang Zhang [mailto:[EMAIL PROTECTED]
> >>>> > Sent: Tuesday, February 01, 2005 1:15 AM
> >>>> > To: lucene-user@jakarta.apache.org
> >>>> > Subject: which HTML parser is better?
> >>>> > 
> >>>> > Three HTML parsers(Lucene web application
> >>>> > demo,CyberNeko HTML Parser,JTidy) are mentioned in
> >>>> > Lucene FAQ
> >>>> > 1.3.27.Which is the best?Can it filter tags that are
> >>>> > auto-created by MS-word 'Save As HTML files' function?
> >>>> > 
> >>>> > _
> >>>> > Do You Yahoo!?
> >>>> > 150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌÃ
> >>>> > http://music.yisou.com/
> >>>> > ÃÀÅ®Ã÷ÐÇÓ¦Óо¡ÓУ¬ËѱéÃÀͼ¡¢ÑÞͼºÍ¿áͼ
> >>>> > http://image.yisou.com
> >>>> > 1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡
> >>>> >
>
>>>>http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma
> >>>> > il_1g/
> >>>> > 
> >>>> >
> >>>>
> >>>>
> >>-
> >>
> >>
> >>>> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> >>>> > For additional commands, e-mail:
> [EMAIL PROTECTED]
> >>>>
> >>>>
> >>>>-
> >>>>To unsubscribe, e-mail: [EMAIL PROTECTED]
> >>>>For additional commands, e-mail: [EMAIL PROTECTED]
> >>>>
> >>>>   
> >>>>
> >>>>
> >>>>
> >>> 
> >>>
> >>>  
> >>>
> >>-
> >>To unsubscribe, e-mail: [EMAIL PROTECTED]
> >>For additional commands, e-mail: [EMAIL PROTECTED]
> >>
> >>
> >>
> >
> >  
> >
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-- 
GMX im TV ... Die Gedanken sind frei ... Schon gesehen?
Jetzt Spot online ansehen: http://www.gmx.net/de/go/tv-spot

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: which HTML parser is better?

2005-02-02 Thread Karl Koch
Hi,

yes, but the library your are using is quite big. I was thinking that a 5kB
code could actually do that. That sourceforge project is doing much more
than that but I do not need it.

Karl

>   Hi Karl,
> 
>  I already submitted a peace of code that removes the html tags.
>  Search for my previous answer in this thread.
> 
>   Best,
> 
>    Sergiu
> 
> Karl Koch wrote:
> 
> >Hello,
> >
> >I have  been following this thread and have another question. 
> >
> >Is there a piece of sourcecode (which is preferably very short and simple
> >(KISS)) which allows to remove all HTML tags from HTML content? HTML 3.2
> >would be enough...also no frames, CSS, etc. 
> >
> >I do not need to have the HTML strucutre tree or any other structure but
> >need a facility to clean up HTML into its normal underlying content
> before
> >indexing that content as a whole.
> >
> >Karl
> >
> >
> >  
> >
> >>I think that depends on what you want to do.  The Lucene demo parser
> does
> >>simple mapping of HTML files into Lucene Documents; it does not give you
> a
> >>parse tree for the HTML doc.  CyberNeko is an extension of Xerces (uses
> >>
> >>
> >the
> >  
> >
> >>same API; will likely become part of Xerces), and so maps an HTML
> document
> >>into a full DOM that you can manipulate easily for a wide range of
> >>purposes.  I haven't used JTidy at an API level and so don't know it as
> >>
> >>
> >well --
> >  
> >
> >>based on its UI, it appears to be focused primarily on HTML validation
> and
> >>error detection/correction.
> >>
> >>I use CyberNeko for a range of operations on HTML documents that go
> beyond
> >>indexing them in Lucene, and really like it.  It has been robust for me
> so
> >>far.
> >>
> >>Chuck
> >>
> >>  > -Original Message-
> >>  > From: Jingkang Zhang [mailto:[EMAIL PROTECTED]
> >>  > Sent: Tuesday, February 01, 2005 1:15 AM
> >>  > To: lucene-user@jakarta.apache.org
> >>  > Subject: which HTML parser is better?
> >>  > 
> >>  > Three HTML parsers(Lucene web application
> >>  > demo,CyberNeko HTML Parser,JTidy) are mentioned in
> >>  > Lucene FAQ
> >>  > 1.3.27.Which is the best?Can it filter tags that are
> >>  > auto-created by MS-word 'Save As HTML files' function?
> >>  > 
> >>  > _
> >>  > Do You Yahoo!?
> >>  > 150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌÃ
> >>  > http://music.yisou.com/
> >>  > ÃÀÅ®Ã÷ÐÇÓ¦Óо¡ÓУ¬ËѱéÃÀͼ¡¢ÑÞͼºÍ¿áͼ
> >>  > http://image.yisou.com
> >>  > 1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡
> >>  >
> >>http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma
> >>  > il_1g/
> >>  > 
> >>  >
> -
> >>  > To unsubscribe, e-mail: [EMAIL PROTECTED]
> >>  > For additional commands, e-mail: [EMAIL PROTECTED]
> >>
> >>
> >>-
> >>To unsubscribe, e-mail: [EMAIL PROTECTED]
> >>For additional commands, e-mail: [EMAIL PROTECTED]
> >>
> >>
> >>
> >
> >  
> >
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-- 
10 GB Mailbox, 100 FreeSMS http://www.gmx.net/de/go/topmail
+++ GMX - die erste Adresse für Mail, Message, More +++

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: which HTML parser is better?

2005-02-02 Thread Karl Koch
Hello,

I have  been following this thread and have another question. 

Is there a piece of sourcecode (which is preferably very short and simple
(KISS)) which allows to remove all HTML tags from HTML content? HTML 3.2
would be enough...also no frames, CSS, etc. 

I do not need to have the HTML strucutre tree or any other structure but
need a facility to clean up HTML into its normal underlying content before
indexing that content as a whole.

Karl


> I think that depends on what you want to do.  The Lucene demo parser does
> simple mapping of HTML files into Lucene Documents; it does not give you a
> parse tree for the HTML doc.  CyberNeko is an extension of Xerces (uses
the
> same API; will likely become part of Xerces), and so maps an HTML document
> into a full DOM that you can manipulate easily for a wide range of
> purposes.  I haven't used JTidy at an API level and so don't know it as
well --
> based on its UI, it appears to be focused primarily on HTML validation and
> error detection/correction.
> 
> I use CyberNeko for a range of operations on HTML documents that go beyond
> indexing them in Lucene, and really like it.  It has been robust for me so
> far.
> 
> Chuck
> 
>   > -Original Message-
>   > From: Jingkang Zhang [mailto:[EMAIL PROTECTED]
>   > Sent: Tuesday, February 01, 2005 1:15 AM
>   > To: lucene-user@jakarta.apache.org
>   > Subject: which HTML parser is better?
>   > 
>   > Three HTML parsers(Lucene web application
>   > demo,CyberNeko HTML Parser,JTidy) are mentioned in
>   > Lucene FAQ
>   > 1.3.27.Which is the best?Can it filter tags that are
>   > auto-created by MS-word 'Save As HTML files' function?
>   > 
>   > _
>   > Do You Yahoo!?
>   > 150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌÃ
>   > http://music.yisou.com/
>   > ÃÀÅ®Ã÷ÐÇÓ¦Óо¡ÓУ¬ËѱéÃÀͼ¡¢ÑÞͼºÍ¿áͼ
>   > http://image.yisou.com
>   > 1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡
>   >
> http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma
>   > il_1g/
>   > 
>   > -
>   > To unsubscribe, e-mail: [EMAIL PROTECTED]
>   > For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-- 
GMX im TV ... Die Gedanken sind frei ... Schon gesehen?
Jetzt Spot online ansehen: http://www.gmx.net/de/go/tv-spot

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



XML index

2005-01-27 Thread Karl Koch
Hi,

I want to use kXML with Lucene to index XML files. I think it is possible to
dynamically assign Node names as Document fields and Node texts as Text
(after using an Analyser). 

I have seen some XML indexing in the Sandbox. Is anybody here which has done
something with a thin pull parser (perhaps even kXML)? Does anybody know of
a project or some sourcecode available which covers this topic?

Karl

 

-- 
Sparen beginnt mit GMX DSL: http://www.gmx.net/de/go/dsl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Different Documents (with fields) in one index?

2005-01-27 Thread Karl Koch
Hello all,

perhaps not such a sophisticated question: 

I would like to have a very diverse set of documents in one index. Depending
on the inside of text documents, I would like to put part of the text in
different fields. This means in the searches, when searching a particular
field, some of those documents won't be addressed at all.

Is it possible to have different kinds of Documents with different index
fields in ONE index? Or do I need one index for each set?

Karl

-- 
10 GB Mailbox, 100 FreeSMS http://www.gmx.net/de/go/topmail
+++ GMX - die erste Adresse für Mail, Message, More +++

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Lucene on JSE 1.1.8...

2005-01-25 Thread Karl Koch
Hello,

does somebody here know, which Lucene version runs on Java 1.1? Of course I
would like to run the best (latest) version possible :-)

Karl

-- 
10 GB Mailbox, 100 FreeSMS http://www.gmx.net/de/go/topmail
+++ GMX - die erste Adresse für Mail, Message, More +++

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



VSpace Model Index <-> Prob. Model Index - Difference?

2004-03-19 Thread Karl Koch
Hello group,

coming back to the discussion about probabilistic and vector space model
(which occured here some time ago), I would like to ask something related.

I only know the index structure Lucene offers. Does a IR system, based on
the probabilistic model (e.g. Okapi) look different from a VS model? If yes,
why? 

I hope this questions is not too stupid. I am mainly interested because of
some theoretical background...

Karl

> Uh, there are lots of ways to construct an inverted index.
> Citeseer will give you more than you can read on this topic.
> 
> As for Lucene, see File Formats section on the site.
> 
> Otis
> 
> --- Karl Koch <[EMAIL PROTECTED]> wrote:
> > If I create an standard index, what does Lucene store in this index?
> > 
> > What should be stored in an index at least? Just a link to the file
> > and
> > keywords? Or also wordnumbers? What else?
> > 
> > Does somebody know a paper which discusses this problem of "what to
> > put in
> > an good universal IR index" ?
> > 
> > Cheers,
> > Karl
> > 
> > -- 
> > +++ NEU bei GMX und erstmalig in Deutschland: TÜV-geprüfter
> > Virenschutz +++
> > 100% Virenerkennung nach Wildlist. Infos:
> > http://www.gmx.net/virenschutz
> > 
> > 
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> > 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-- 
+++ NEU bei GMX und erstmalig in Deutschland: TÜV-geprüfter Virenschutz +++
100% Virenerkennung nach Wildlist. Infos: http://www.gmx.net/virenschutz


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Lucene index - information

2004-03-19 Thread Karl Koch
If I create an standard index, what does Lucene store in this index?

What should be stored in an index at least? Just a link to the file and
keywords? Or also wordnumbers? What else?

Does somebody know a paper which discusses this problem of "what to put in
an good universal IR index" ?

Cheers,
Karl

-- 
+++ NEU bei GMX und erstmalig in Deutschland: TÜV-geprüfter Virenschutz +++
100% Virenerkennung nach Wildlist. Infos: http://www.gmx.net/virenschutz


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Search one keyword in two fields - How?

2004-03-02 Thread Karl Koch
Hi,

hat do I need to search for one single keyword in two fields of one index?
Can someone provide an easy example or tell me where to find it?

Cheers,
Karl

-- 
+++ NEU bei GMX und erstmalig in Deutschland: TÜV-geprüfter Virenschutz +++
100% Virenerkennung nach Wildlist. Infos: http://www.gmx.net/virenschutz


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: setMaxClauseCount ??

2004-01-21 Thread Karl Koch
Hello Doug,

that sounds interesting to me. I refer to a paper written by NIST about
Relevance Feedback which was doing test with 20 - 200 words. This is why I
thought it might be good to be able to use all non stopwords of a document for that
and see what is happening. Do you know good papers about strategies of how
to select keywords effectivly beyond the scope of stopword lists and stemming?

Using term frequencies of the document is not really possible since lucene
is not providing access to a document vector, isn't it?

By the way, could you send me the code of Dmitry about the Vector extension.
I have been asking in another thread but I did not get it so far. I really
would like to have a look... Also it would be nice to know about any status
regarding the progress of integrating it in Lucene 1.3. Who is working on it
and how could I contribute?

Cheers,
Karl


> Andrzej Bialecki wrote:
> > Karl Koch wrote:
> >> I actually wanted to add a large amount of text from an existing 
> >> document to
> >> find a close related one. Can you suggest another good way of doing 
> >> this.
> >
> > You should try to reduce the dimensionality by reducing the number of 
> > unique features. In this case, you could for example use only keywords 
> > (or key phrases) instead of the full content of documents.
> 
> Indeed, this is a good approach.  In my experience, six or eight terms 
> are usually enough, and they needn't all be required.
> 
> Doug
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: setMaxClauseCount ??

2004-01-21 Thread Karl Koch
Hi Doug,

thank you for the answer so far.

I actually wanted to add a large amount of text from an existing document to
find a close related one. Can you suggest another good way of doing this. A
direct match will not occur anyway. How can I make a most Vector Space Model
(VSM) like query (each word a dimension value - find documents close to
that)? You know as good as I that the standard VSM does not have any Boolean logic
inside... how do I need to formuate the query to make it as much similar to
a vector in order to find similar document in the vector space of the Lucene
index?

Cheers,
Karl

> setMaxClauseCount determines the maximum number of clauses, which is not 
> your problem here.  Your problem is with required clauses.  There may 
> only be a total of 31 required (or prohibited) clauses in a single 
> BooleanQuery.  If you need more, then create more BooleanQueries and 
> combine them with another BooleanQuery.  Perhaps this could be done 
> automatically, but I've never heard anyone encounter this limit before. 
>   Do you really mean for 32 different terms to be required?  Do any 
> documents actually match this query?
> 
> Doug
> 
> Karl Koch wrote:
> > Hi group,
> > 
> > I run over a IndexOutOfBoundsException:
> > 
> > -> java.lang.IndexOutOfBoundsException: More than 32 required/prohibited
> > clauses in query.
> > 
> > The reason: I have more then 32 BooleanCauses. From the Mailinglist I
> got
> > the info how to set the maxiumum number of clauses higher before a loop:
> > 
> > ...
> > myBooleanQuery.setMaxClauseCount(Integer.MAX_VALUE);
> > while (true){
> >   Token token = tokenStream.next();
> >   if (token == null) {
> > break;
> >   }
> >   myBooleanQuery.add(new TermQuery(new Term("bla", token.termText())),
> true,
> > false);
> > } ... 
> > 
> > However the error still remains, why?
> > 
> > Karl
> > 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How to limit terms in the index

2004-01-19 Thread Karl Koch
Hello,

I think you have to write your own Analyser who filters out all other words
before providing the text to an indexer. 

karl



> Hi,
> I'm using Lucene to index documents and I want just to limit
> the terms indexed by a list of terms provided by an Ontology.
> May someone help me to know how can I do that ?
> 
> Thanks,
> GD
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: difference in javadoc and faq similarity expression

2004-01-18 Thread Karl Koch
I would rely on the JavaDoc since this one is up to date. The latest version
1.3 final is just a few weeks old. Some entries in the FAQ however are still
from 2001...

Cheers,
Karl

> hy,
> i have troubles in find the correspondance betwwen the javadoc and faq
> similarity expression
> 
> in the Similarity Javadoc
> 
> score(q,d) =Sum [tf(t in d) * idf(t) * getBoost(t.field in d) *
> lengthNorm(t.field in d)  * coord(q,d) * queryNorm(q) ]
> 
> in the FAQ
> 
> score_d = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t * boost_t)
> *
> coord_q_d
> 
> In FAQ | In Javadoc
> 1 / norm_q = queryNorm(q)
> 1 / norm_d_t=lengthNorm(t.field in d)
> coord_q_d=coord(q,d)
> boost_t=getBoost(t.field in d)
> idf_t=idf(t)
> tf_d=tf(t in d)
> 
> but
> where is the javadoc expression for "tf_q" faq expression
> 
> nicolas
> 
> - Original Message - 
> From: "Nicolas Maisonneuve" <[EMAIL PROTECTED]>
> To: "Lucene Users List" <[EMAIL PROTECTED]>
> Sent: Sunday, January 18, 2004 9:33 PM
> Subject: Re: theorical informations
> 
> 
> > thanks Karl !
> >
> > - Original Message - 
> > From: "Karl Koch" <[EMAIL PROTECTED]>
> > To: "Lucene Users List" <[EMAIL PROTECTED]>
> > Sent: Sunday, January 18, 2004 9:22 PM
> > Subject: Re: theorical informations
> >
> >
> > > Actually, finding an answer to this question is not really important.
> More
> > > important is if you can do what you want with it. If you result comes
> from
> > a
> > > prob. model or a vector space model, who cares if you just want to
> give
> a
> > > query and back a hit list of results?
> > >
> > > Possibliy some people here will strongly disagree... ;-) (?)
> > >
> > > Karl
> > >
> > > > Hello Nicolas,
> > > >
> > > > I am sure you mean IR (Information Retrieval) Model. Lucene
> implements
> a
> > > > Vector Space Model with integrated Boolean Model. This means the
> Boolean
> > > > model
> > > > is integrated with a Boolean query language but mapped into the
> Vector
> > > > Space.
> > > > Therefore you have ranking even though the traditional Boolean model
> > does
> > > > not
> > > > support this. Cosine similarity is used to measure similarity
> between
> > > > documents and the query. You can find this in a very long dicussion
> here
> > > > when you
> > > > search the archive...
> > > >
> > > > Karl
> > > >
> > > > > hy ,
> > > > > i have 2  theorycal questions :
> > > > >
> > > > > i searched in the mailing list the R.I. model implemented in
> Lucene
> ,
> > > > > but no precise answer.
> > > > >
> > > > > 1) What is the R.I model implemented in Lucene ? (ex: Boolean
> Model,
> > > > > Vector Model,Probabilist Model, etc... )
> > > > >
> > > > > 2) What is the theory Similarity function  implemented in Lucene
> > > > > (Euclidian, Cosine, Jaccard, Dice)
> > > > >
> > > > > (why this important informations is not in the Lucene Web site or
> in
> > the
> > > >
> > > > > faq ? )
> > > > >
> > > >
> > > > -- 
> > > > +++ GMX - die erste Adresse für Mail, Message, More +++
> > > > Bis 31.1.: TopMail + Digicam für nur 29 EUR
> http://www.gmx.net/topmail
> > > >
> > > >
> > > >
> -
> > > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > > For additional commands, e-mail: [EMAIL PROTECTED]
> > > >
> > >
> > > -- 
> > > +++ GMX - die erste Adresse für Mail, Message, More +++
> > > Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail
> > >
> > >
> > > -
> > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > For additional commands, e-mail: [EMAIL PROTECTED]
> > >
> > >
> >
> >
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> 
> 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



setMaxClauseCount ??

2004-01-18 Thread Karl Koch
Hi group,

I run over a IndexOutOfBoundsException:

-> java.lang.IndexOutOfBoundsException: More than 32 required/prohibited
clauses in query.

The reason: I have more then 32 BooleanCauses. From the Mailinglist I got
the info how to set the maxiumum number of clauses higher before a loop:

...
myBooleanQuery.setMaxClauseCount(Integer.MAX_VALUE);
while (true){
  Token token = tokenStream.next();
  if (token == null) {
break;
  }
  myBooleanQuery.add(new TermQuery(new Term("bla", token.termText())), true,
false);
} ... 

However the error still remains, why?

Karl

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: theorical informations

2004-01-18 Thread Karl Koch
Actually, finding an answer to this question is not really important. More
important is if you can do what you want with it. If you result comes from a
prob. model or a vector space model, who cares if you just want to give a
query and back a hit list of results?

Possibliy some people here will strongly disagree... ;-) (?)

Karl

> Hello Nicolas,
> 
> I am sure you mean IR (Information Retrieval) Model. Lucene implements a
> Vector Space Model with integrated Boolean Model. This means the Boolean
> model
> is integrated with a Boolean query language but mapped into the Vector
> Space.
> Therefore you have ranking even though the traditional Boolean model does
> not
> support this. Cosine similarity is used to measure similarity between
> documents and the query. You can find this in a very long dicussion here
> when you
> search the archive...
> 
> Karl
> 
> > hy , 
> > i have 2  theorycal questions :
> > 
> > i searched in the mailing list the R.I. model implemented in Lucene , 
> > but no precise answer.
> > 
> > 1) What is the R.I model implemented in Lucene ? (ex: Boolean Model, 
> > Vector Model,Probabilist Model, etc... ) 
> > 
> > 2) What is the theory Similarity function  implemented in Lucene 
> > (Euclidian, Cosine, Jaccard, Dice)
> > 
> > (why this important informations is not in the Lucene Web site or in the
> 
> > faq ? )
> > 
> 
> -- 
> +++ GMX - die erste Adresse für Mail, Message, More +++
> Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: theorical informations

2004-01-18 Thread Karl Koch
Hello Nicolas,

I am sure you mean IR (Information Retrieval) Model. Lucene implements a
Vector Space Model with integrated Boolean Model. This means the Boolean model
is integrated with a Boolean query language but mapped into the Vector Space.
Therefore you have ranking even though the traditional Boolean model does not
support this. Cosine similarity is used to measure similarity between
documents and the query. You can find this in a very long dicussion here when you
search the archive...

Karl

> hy , 
> i have 2  theorycal questions :
> 
> i searched in the mailing list the R.I. model implemented in Lucene , 
> but no precise answer.
> 
> 1) What is the R.I model implemented in Lucene ? (ex: Boolean Model, 
> Vector Model,Probabilist Model, etc... ) 
> 
> 2) What is the theory Similarity function  implemented in Lucene 
> (Euclidian, Cosine, Jaccard, Dice)
> 
> (why this important informations is not in the Lucene Web site or in the 
> faq ? )
> 

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Extracting particular document from index

2004-01-18 Thread Karl Koch
Hi all,

I have only the file information about its location and the name of the
file. I want to know two things. There is also an index made from suche files
also containing the file I am looking for. I want to know two things:
1) first of all, all fields witch can be searched within this index.
2) secondly, the content as it is represented in each field. 

As you can see, I am asking for the way back. From the file itself I cannot
interfere how it was indexed. The indexer has parsed and devided it somehow.
Something I dunno know. This information should be in the index. I am
basically looking for a functionalty like this:

Index index = new Index("X:/myindex");
Field[] fields = index.getAllFields();
String[] fieldNames = new String[fields.length];
for (int i = 0; i < fields.length; i++){
  fieldNames[i] = fields[i].stringValue();
}

Using the field name I know now, I could perform a search and get the first
entry of the Hits list, which would be my Document. :-)

Does something like that exist in Lucene?

Cheers mates,
Karl


> On Jan 18, 2004, at 11:15 AM, Karl Koch wrote:
> > lets say I have an index with documents encoded in two fields 
> > "filename" and
> > "data". Is it possible to extract a file from which I know the filename
> > directly from this index without performing any search. Like a random 
> > access like
> > in a filesystem?
> 
> It is still technically a "search", but a TermQuery will be basically 
> direct access to the document(s) matching that term.
> 
>   Erik
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Gettting all index fields of an index

2004-01-18 Thread Karl Koch
How can I get a list of all fields in an index from which I know only the
directory string?

Karl

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Extracting particular document from index

2004-01-18 Thread Karl Koch
Hi all,

lets say I have an index with documents encoded in two fields "filename" and
"data". Is it possible to extract a file from which I know the filename
directly from this index without performing any search. Like a random access like
in a filesystem? 

Karl

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Searching for similar documents in Lucene

2004-01-17 Thread Karl Koch
Hello all,

how can I find the most similar document from one given document? Do I need
to perform one search for each field in the Docuemnt and merge the resulting
Hit lists? 

Maybe somebody has already done that and can give me hints / maybe an
example?

Cheers and good night,
Karl

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Closing the IndexSearcher object

2004-01-17 Thread Karl Koch
Hi all,

I have a search method who is used by many programs with different queries.
I therefore do not want to close the IndexSearch object to allow other
programs to reuse it. Does this have any side effects (e.g. does the IndexSearcher
object contain state information)? Would it be better to instanciate always a
new IndexSearch object and close it after usage? 

Cheers,
Karl


-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: IndexReader.document(int i)

2004-01-17 Thread Karl Koch
Hello back,

according to the JavaDoc it means:

"Returns the stored fields of the nth Document in this index."

In your case than if would mean the greater n the younger the doc. However I
am not sure how you can create such an index. I think you should have a look
to the Luke project, which allows you ot access and  look into Lucene
Indices.

If I did not answer your question, please explain a little bit further...

Karl




> hy,
> i would like to know  
> in the IndexReader.document(int i)
> what is this number  i ? 
> if the the first document is the oldest document indexed 
> and the last the youngest ? (so we can sort by date  easyly) ?
> 
> thank in advance
> 
> nico 

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Relevance Feedback (2)

2004-01-17 Thread Karl Koch
Hello all,

oh, I just found a mail from Doug where he wrote that Dmitry Serebrennikov
developed something who provides Document vector access:

> Dmitry Serebrennikov [dmitrys?X0040;earthlink.net] has implemented a
substantial
> extension to Lucene which should help folks doing this sort of research. 
It
>  provides an explicit vector representation for documents.  This way you
can,
> e.g., retrieve a number of documents, efficiently sum their vectors, then
> derive a new query from the sum.  This code was posted to the list a long
> while back, but is now out of date.  As soon as the 1.2 release is final,
> and Dmitry has time, he intends to merge it into Lucene.

Who has this code? Could somebody email it to me? I would highly appreciate
it.

Is there any attempt from Dmitry or somebody else to adapt it to Lucene 1.3?


I wish you all a nice weekend,
Karl

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Relevance Feedback (2)

2004-01-17 Thread Karl Koch
Hello group,

I would like to implement Relevance Feedback functionality for my system.
>From the privious discussion in this group I know that this is not implemented
in Lucene. 

We all know that Relevance Feedback has two fields, which are 
1) Term Reweighting
2) Query Expansion

I am interesting in doing both of it. 

My first thought was that Term Reweighting can be solved with term boosing
and expansion, well, with basically generation a new query. Looking close to
one of the classic term reweighting formula's (Rocchio) however reveals that I
need access to the term vector of the relevant as well as the term vector of
the non-relevant documents. Bringing this to Lucenen it would mean, that I
need to have the score of each term in the relevant and non-relevant documents
to process the reweigthing formula.

Coming back to Lucene, this would mean that I need to extract Documents from
the Hits object after the search. From this Documents I would need to get
all terms and its scores.

However, Lucene does not provide this. Only Documents can be retrieved and
its scores. It does not provide access to its terms and therefore no access to
Term scores.

Does somebody have ideas of workaround for Term Reweighting and Query
Expansion withouth using the way over Hits. Does somebody have produces workarounds
and can provide it to me? 

Thank you very much in advance,
Karl


-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Term weighting and Term boost

2004-01-16 Thread Karl Koch
Hello Andrzej,

sorry. I mistakenly run it under Java 1.2.2 which cannot work :-) Then you
get Threat Exceptions...

Anyway, solved now. Thank you,
Karl

> Karl Koch wrote:
> 
> > Hello and thank you for this link. I think this is a very usefull tool
> to
> > analyse Lucene internals.
> > 
> > 
> >>I realize this is not exactly the answer, but you may want to try one of
> 
> >>the new features of Luke (http://www.getopt.org/luke), namely the query 
> >>result explanation.
> > 
> > 
> > When I start it according to the description on your web site and select
> the
> > index directory I get an error message "current threat no owner"...
> > 
> 
> I.e. Java WebStart, or by getting the jars and starting it from 
> command-line?
> 
> > What does it mean and what do I wrong?
> 
> Beats me... I've never seen something like that. Could you please turn 
> on the Java console, and see what kind of exception and where is thrown?
> 
> -- 
> Best regards,
> Andrzej Bialecki
> 
> -
> Software Architect, System Integration Specialist
> CEN/ISSS EC Workshop, ECIMF project chair
> EU FP6 E-Commerce Expert/Evaluator
> -
> FreeBSD developer (http://www.freebsd.org)
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Term weighting and Term boost

2004-01-16 Thread Karl Koch

Hello and thank you for this link. I think this is a very usefull tool to
analyse Lucene internals.

> I realize this is not exactly the answer, but you may want to try one of 
> the new features of Luke (http://www.getopt.org/luke), namely the query 
> result explanation.

When I start it according to the description on your web site and select the
index directory I get an error message "current threat no owner"...

What does it mean and what do I wrong?

Kind Regards,
Karl


> 
> Currently the best way to start Luke is to use Java WebStart. Then open 
> an already existing index, go to the Search tab, enter a query (use 
> "Update" button to see exactly what it is parsed into), press Search, 
> and then highlight one of the results and press "Explain".
> 
> It was revealing for me to see how weights, boosts, normalizations etc. 
> are applied "under the hood" so to speak, especially for  Fuzzy or 
> Phrase queries.
> 
> After experimenting a little, you may want to consult the classes in 
> org.apache.lucene.search (e.g. Scorer and Similarity) to see the gory 
> details.
> 
> -- 
> Best regards,
> Andrzej Bialecki
> 
> -
> Software Architect, System Integration Specialist
> CEN/ISSS EC Workshop, ECIMF project chair
> EU FP6 E-Commerce Expert/Evaluator
> -
> FreeBSD developer (http://www.freebsd.org)
> 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



BooleanQuery question

2004-01-16 Thread Karl Koch
Hi all,

why does the boolean query have a "required" and a "prohited" field (boolean
value)? If something is required it cannot be forbidden and otherwise? How
does this match with the Boolean model we know from theory?

Are there differences between Lucene and the Boolean model in theory?

Kind Regards,
Karl

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Term weighting and Term boost

2004-01-16 Thread Karl Koch
Hello all,

I am new to the Lucene scene and have a few questions regarding the term
boost physolophy:

Is the term boost equal to a term weight? Example: If I boost a term with
0.2 does this mean the term has a weight of 0.2 then?

If this is not the case, how is the term weight of the query calculated
then? Formula? Are there parts in it which I cannot influence? Does this formular
depend on the type of Query or is it independent. Maybe somebody can provide
a small code example? 

Give the following code:

TermQuery termQuery1 = new TermQuery(new Term("contents", "house"));
TermQuery termQuery2 = new TermQuery(new Term("contents", "tree"));
termQuery2.setBoost( ? );
BooleanQuery finalQuery = new BooleanQuery();
finalQuery.add(termQuery1, true, false);
finalQuery.add(termQuery2, true, false);

How can I realise that the term "tree" is double as important for search
than "house"?

Many questions I know but I am sure that the experts here can answer them
easily.

Cheers,
Karl

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]