Lucene on PersonalJava ?? HELP!

2005-02-15 Thread Karl Koch
Hi, 

did anybody here run Lucene 1.3 or 1.2 under PersonalJava (equivalent to JDK
1.1) ? I have a friend who runs Lucene 1.3 under PersonalJava and it works.
Mine doesn't. When conmparing the the code I cannot find any difference. I
search the index for a Query. 

I get an error saying that the method java.io.File.createNewFile() is used
in Lucene. I have checked Java 1.1.8 and indeed this method does not exist.

Beside the question, how it can work on my friends system with the same
code, I am asking two more questions:

1) Did anybody here use Lucene on a PDA under Personal Java and can tell
some experience?

2) Is there anything else I should try or something I have forgotten?

Thanks for your help,
Karl

-- 
Lassen Sie Ihren Gedanken freien Lauf... z.B. per FreeSMS
GMX bietet bis zu 100 FreeSMS/Monat: http://www.gmx.net/de/go/mail

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene on PersonalJava ?? HELP!

2005-02-15 Thread Karl Koch
Hello,

thank you for the tip. I have solved the problem in a different way. If
anybody else want to run Lucene on PJava, he/she might go for the same.

I am using the cvm VM instead of the jeode VM. Then it works fine with
Lucene 1.2 withtout any change in my code or in the Lucene code. Perhaps
even with a newer version (but havn't tested yet). :-)

Thank you anyway,
Karl

 On Tue, 2005-02-15 at 14:05 +0100, Karl Koch wrote:
  did anybody here run Lucene 1.3 or 1.2 under PersonalJava (equivalent to
 JDK
  1.1) ? I have a friend who runs Lucene 1.3 under PersonalJava and it
 works.
  Mine doesn't. When conmparing the the code I cannot find any difference.
 I
  search the index for a Query. 
  
  I get an error saying that the method java.io.File.createNewFile() is
 used
  in Lucene. I have checked Java 1.1.8 and indeed this method does not
 exist.
  
  Beside the question, how it can work on my friends system with the same
  code, I am asking two more questions:
  
  1) Did anybody here use Lucene on a PDA under Personal Java and can tell
  some experience?
  
  2) Is there anything else I should try or something I have forgotten?
 
 It might be the constructor the the IndexReader or IndexSearcher that
 you're using. You can pass in a string that points to the directory or a
 file object instead. Lucene might being using
 java.io.File.createNewFile() if you pass in a string. 
 
 A simple grep should find out where it's being used.
 
 
 
 -- 
 Miles Barr [EMAIL PROTECTED]
 Runtime Collective Ltd.
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 

-- 
Lassen Sie Ihren Gedanken freien Lauf... z.B. per FreeSMS
GMX bietet bis zu 100 FreeSMS/Monat: http://www.gmx.net/de/go/mail

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Fieldinformation from Index

2005-02-15 Thread Karl Koch
Hello,

I have two questions which might be easy to answer from a Lucene expert:

1) I need to know which fields a collection of Documents has (given the fact
that not all docuemnts do necessaryly use all fields). This Documents are
all stored in one index. Is there a way (with Lucene 1.2 or 1.3) to find out
without going though each document and retrieving it?

2) I need to know which Analyzer was used to index a field. One important
rule, as we all know, is to use the same analyzer for indexing and searching
a field. Is this information stored in the index or in full responsibilty of
the application developer?

Karl

-- 
DSL Komplett von GMX +++ Supergünstig und stressfrei einsteigen!
AKTION Kein Einrichtungspreis nutzen: http://www.gmx.net/de/go/dsl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



JIT error when searching... Lucene 1.3 on Java 1.1

2005-02-08 Thread Karl Koch
Hello all,

I have heard that Lucene 1.3 Final should run under Java 1.1. (I need that
because I want to run a search with a PDA using Java 1.1).

However, when I run my code. I get the following error:

--

A nonfatal internal JIT (3.10.107(x)) error 'chgTarg: Conditional' has
occurred in : 
  'org/apache/lucene/store/FSDirectory.getDirectory
(Ljava/io/File;Z)Lorg/apache/lucene/store/FSDirectory;': Interpreting
method.
  Please report this error in detail to
http://java.sun.com/cgi-bin/bugreport.cgi

Exception occured in StandardSearch:search(String, String[], String)!
java.lang.IllegalMonitorStateException: current thread not owner
at org.apache.lucene.store.FSDirectory.makeLock(FSDirectory.java:312)
at org.apache.lucene.index.IndexReader.open(IndexReader.java, Compiled
Code)

--

The error does not occur when I run it under Java 1.4.

What do I do wrong and what do I need to change in order to make it work. It
must be my code. Here the code relevant to this error (the search method).


public static Result search(String queryString, String[] searchFields, 
  String indexDirectory) {
  // create access to index
  StandardAnalyzer analyser = new StandardAnalyzer();
  Hits hits = null;
  Result result = null;
  try {
  fsDirectory = 
FSDirectory.getDirectory(StandardSearcher.indexDirectory, false);
  IndexSearcher searcher = new IndexSearcher(fsDirectory);
  ...
}


What is wrong here?

Best Regards,
Karl

-- 
DSL Komplett von GMX +++ Supergünstig und stressfrei einsteigen!
AKTION Kein Einrichtungspreis nutzen: http://www.gmx.net/de/go/dsl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



HELP! JIT error when searching... Lucene 1.3 on Java 1.1

2005-02-08 Thread Karl Koch
When I switch to Java 1.2, I can also not run it. Also I cannot index
anything. I have no idea why...

Can sombody help me?

Karl

 Hello all,
 
 I have heard that Lucene 1.3 Final should run under Java 1.1. (I need that
 because I want to run a search with a PDA using Java 1.1).
 
 However, when I run my code. I get the following error:
 
 --
 
 A nonfatal internal JIT (3.10.107(x)) error 'chgTarg: Conditional' has
 occurred in : 
   'org/apache/lucene/store/FSDirectory.getDirectory
 (Ljava/io/File;Z)Lorg/apache/lucene/store/FSDirectory;': Interpreting
 method.
   Please report this error in detail to
 http://java.sun.com/cgi-bin/bugreport.cgi
 
 Exception occured in StandardSearch:search(String, String[], String)!
 java.lang.IllegalMonitorStateException: current thread not owner
   at org.apache.lucene.store.FSDirectory.makeLock(FSDirectory.java:312)
   at org.apache.lucene.index.IndexReader.open(IndexReader.java, Compiled
 Code)
 
 --
 
 The error does not occur when I run it under Java 1.4.
 
 What do I do wrong and what do I need to change in order to make it work.
 It
 must be my code. Here the code relevant to this error (the search method).
 
 
 public static Result search(String queryString, String[] searchFields, 
   String indexDirectory) {
   // create access to index
   StandardAnalyzer analyser = new StandardAnalyzer();
   Hits hits = null;
   Result result = null;
   try {
   fsDirectory = 
 FSDirectory.getDirectory(StandardSearcher.indexDirectory, false);
   IndexSearcher searcher = new IndexSearcher(fsDirectory);
   ...
 }
 
 
 What is wrong here?
 
 Best Regards,
 Karl
 
 -- 
 DSL Komplett von GMX +++ Supergünstig und stressfrei einsteigen!
 AKTION Kein Einrichtungspreis nutzen: http://www.gmx.net/de/go/dsl
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 

-- 
DSL Komplett von GMX +++ Supergünstig und stressfrei einsteigen!
AKTION Kein Einrichtungspreis nutzen: http://www.gmx.net/de/go/dsl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: HELP! JIT error when searching... Lucene 1.3 on Java 1.1

2005-02-08 Thread Karl Koch
I have a colleague which uses Lucene 1.3 on PersonalJava (equally to Java
1.1.8). I can't find a significant difference to his code (sill searching)
but he did not many any changes. He did also not recompile Lucene 1.3 on
1.1.8 etc.

It must be something simple. I will look for that switch...

In the meantime, I am thankful for any other help.

Cheers,
Karl

 On Tuesday 08 February 2005 18:49, sergiu gordea wrote:
  Karl Koch wrote:
 ...
  A nonfatal internal JIT (3.10.107(x)) error 'chgTarg: Conditional' has
  occurred in : 
'org/apache/lucene/store/FSDirectory.getDirectory
  (Ljava/io/File;Z)Lorg/apache/lucene/store/FSDirectory;': Interpreting
  method.
Please report this error in detail to
  http://java.sun.com/cgi-bin/bugreport.cgi
 
 Iirc java 1.1 had a switch to turn of JIT compilation. It did slow things
 down when I was using 1.1 (1.1.8?), but it might you help now...
 
 Regards,
 Paul Elschot
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 

-- 
DSL Komplett von GMX +++ Supergünstig und stressfrei einsteigen!
AKTION Kein Einrichtungspreis nutzen: http://www.gmx.net/de/go/dsl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Retrieve all documents - possible?

2005-02-07 Thread Karl Koch
Hi,

is it possible to retrieve ALL documents from a Lucene index? This should
then actually not be a search...

Karl

-- 
Lassen Sie Ihren Gedanken freien Lauf... z.B. per FreeSMS
GMX bietet bis zu 100 FreeSMS/Monat: http://www.gmx.net/de/go/mail

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Some questions about index...

2005-02-05 Thread Karl Koch
Hi all,

for simplicity reasons I would like to use the index as my data storage
whilst using the advantage of the highly optimised Lucene index structure.

1) Can I store all the information of the text file, but also apply a
analyser. E.g. I use the StopAnalyzer. After finding the document, I want to
extract the original text also from the index. Does this require that I
store the information twice in two different fields (one indexed and one
unindexed) ?

2) I would like to extract information from the index which can found in a
boolean way. I know that Lucene is a VSM which provides Boolean operators.
This however does not change its functioning. For example, I have a field
with contains an ID number and I want to use the search like a database
operatation (e.g. to find the document with id=1). I can solve the problem
by searching with query id:1. However, this does not ensure that I will
only get one result. Usually the first result is the document I want. But it
could happen, that this sometimes does not work. What happens if I should
get no results? I guess if I search for id=5 and 5 did not exist I would
probably get 50, 51, .. just because the contain 5. Did somebody work with
this and can suggest a stable solution?

A good solution for these two questions would help me avoiding a database
which would need to replicate most the data which I already have in my
Lucene index...

Kind Regards,
Karl


-- 
DSL Komplett von GMX +++ Supergünstig und stressfrei einsteigen!
AKTION Kein Einrichtungspreis nutzen: http://www.gmx.net/de/go/dsl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Some questions about index...

2005-02-05 Thread Karl Koch
Thank you four fast, straight and usefull comments. Keeping in mind what was
said, did anybody actually think about implementing a kind of database layer
on top of a lucene index. A database would be an index, collumns would be
fields and entries documents. At least everything which would only require a
single table could be done. A SELECT would be search ...

:-)

Karl

 
 On Feb 5, 2005, at 10:04 AM, Karl Koch wrote:
  1) Can I store all the information of the text file, but also apply a
  analyser. E.g. I use the StopAnalyzer. After finding the document, I 
  want to
  extract the original text also from the index. Does this require that I
  store the information twice in two different fields (one indexed and 
  one
  unindexed) ?
 
 You should use a single stored, tokenized, and indexed field for this 
 purpose.  Be cautious of how you construct the Field object to achieve 
 this.
 
  2) I would like to extract information from the index which can found 
  in a
  boolean way. I know that Lucene is a VSM which provides Boolean 
  operators.
  This however does not change its functioning. For example, I have a 
  field
  with contains an ID number and I want to use the search like a database
  operatation (e.g. to find the document with id=1). I can solve the 
  problem
  by searching with query id:1. However, this does not ensure that I 
  will
  only get one result. Usually the first result is the document I want. 
  But it
  could happen, that this sometimes does not work.
 
 Why wouldn't it work?  For ID-type fields, use a Field.Keyword (stored, 
 indexed, but not tokenized).  Search for a specific ID using a 
 TermQuery (don't use QueryParser for this, please).  If the ID values 
 are unique, you'll either get zero or one result.
 
   What happens if I should
  get no results? I guess if I search for id=5 and 5 did not exist I 
  would
  probably get 50, 51, .. just because the contain 5. Did somebody work 
  with
  this and can suggest a stable solution?
 
 No, this would not be the case, unless you're analyzing the ID field 
 with some strange character-by-character analyzer or doing a wildcard 
 *5* type query.
 
  A good solution for these two questions would help me avoiding a 
  database
  which would need to replicate most the data which I already have in my
  Lucene index...
 
 You're on the right track and avoiding a database when it is overkill 
 or duplicative is commendable :)
 
   Erik
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 

-- 
Lassen Sie Ihren Gedanken freien Lauf... z.B. per FreeSMS
GMX bietet bis zu 100 FreeSMS/Monat: http://www.gmx.net/de/go/mail

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: which HTML parser is better?

2005-02-04 Thread Karl Koch
The link does not work.

 
 One which we've been using can be found at:
 http://www.ltg.ed.ac.uk/~richard/ftp-area/html-parser/
 
 We absolutely need to be able to recover gracefully from malformed
 HTML and/or SGML.  Most of the nicer SAX/DOM/TLA parsers out there
 failed this criterion when we started our effort.  The above one is
 kind of SAX-y but doesn't fall over at the sight of a real web page
 ;-)
 
 Ian
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 

-- 
DSL Komplett von GMX +++ Supergünstig und stressfrei einsteigen!
AKTION Kein Einrichtungspreis nutzen: http://www.gmx.net/de/go/dsl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: which HTML parser is better?

2005-02-03 Thread Karl Koch
Hello Sergiu,

thank you for your help so far. I appreciate it.

I am working with Java 1.1 which does not include regular expressions.

Your turn ;-)
Karl 

 Karl Koch wrote:
 
 I am in control of the html, which means it is well formated HTML. I use
 only HTML files which I have transformed from XML. No external HTML (e.g.
 the web).
 
 Are there any very-short solutions for that?
   
 
 if you are using only correct formated HTML pages and you are in control 
 of these pages.
 you can use a regular exprestion to remove the tags.
 
 something like
 replaceAll(*,);
 
 This is the ideea behind the operation. If you will search on google you 
 will find a more robust
 regular expression.
 
 Using a simple regular expression will be a very cheap solution, that 
 can cause you a lot of problems in the future.
  
  It's up to you to use it 
 
  Best,
  
  Sergiu
 
 Karl
 
   
 
 Karl Koch wrote:
 
 
 
 Hi,
 
 yes, but the library your are using is quite big. I was thinking that a
   
 
 5kB
 
 
 code could actually do that. That sourceforge project is doing much
 more
 than that but I do not need it.
  
 
   
 
 you need just the htmlparser.jar 200k.
 ... you know ... the functionality is strongly correclated with the
 size.
 
   You can use 3 lines of code with a good regular expresion to eliminate
 the html tags,
 but this won't give you any guarantie that the text from the bad 
 fromated html files will be
 correctly extracted...
 
   Best,
 
   Sergiu
 
 
 
 Karl
 
  
 
   
 
  Hi Karl,
 
 I already submitted a peace of code that removes the html tags.
 Search for my previous answer in this thread.
 
  Best,
 
   Sergiu
 
 Karl Koch wrote:
 

 
 
 
 Hello,
 
 I have  been following this thread and have another question. 
 
 Is there a piece of sourcecode (which is preferably very short and
   
 
 simple
 
 
 (KISS)) which allows to remove all HTML tags from HTML content? HTML
   
 
 3.2
 
 
 would be enough...also no frames, CSS, etc. 
 
 I do not need to have the HTML strucutre tree or any other structure
   
 
 but
 
 
 need a facility to clean up HTML into its normal underlying content
  
 
   
 
 before

 
 
 
 indexing that content as a whole.
 
 Karl
 
 
 
 
  
 
   
 
 I think that depends on what you want to do.  The Lucene demo parser

 
 
 
 does

 
 
 
 simple mapping of HTML files into Lucene Documents; it does not give
 
 
 you
 
 

 
 
 
 a

 
 
 
 parse tree for the HTML doc.  CyberNeko is an extension of Xerces
 
 
 (uses
 
 
   
 

 
 
 
 the
 
 
  
 
   
 
 same API; will likely become part of Xerces), and so maps an HTML

 
 
 
 document

 
 
 
 into a full DOM that you can manipulate easily for a wide range of
 purposes.  I haven't used JTidy at an API level and so don't know it
 
 
 as
 
 
   
 

 
 
 
 well --
 
 
  
 
   
 
 based on its UI, it appears to be focused primarily on HTML
 validation

 
 
 
 and

 
 
 
 error detection/correction.
 
 I use CyberNeko for a range of operations on HTML documents that go

 
 
 
 beyond

 
 
 
 indexing them in Lucene, and really like it.  It has been robust for
 
 
 me
 
 

 
 
 
 so

 
 
 
 far.
 
 Chuck
 
 
 
 -Original Message-
 From: Jingkang Zhang [mailto:[EMAIL PROTECTED]
 Sent: Tuesday, February 01, 2005 1:15 AM
 To: lucene-user@jakarta.apache.org
 Subject: which HTML parser is better?
 
 Three HTML parsers(Lucene web application
 demo,CyberNeko HTML Parser,JTidy) are mentioned in
 Lucene FAQ
 1.3.27.Which is the best?Can it filter tags that are
 auto-created by MS-word 'Save As HTML files' function?
 
 _
 Do You Yahoo!?
 150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌÃ
 http://music.yisou.com/
 ÃÀÅ®Ã÷ÐÇÓ¦Óо¡ÓУ¬ËѱéÃÀͼ¡¢ÑÞͼºÍ¿áͼ
 http://image.yisou.com
 1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡
 
   
 

http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma
   
 
 il_1g/
 
 
   
 

 
 
 
 -

 
 
 
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail:
   
 
 [EMAIL PROTECTED]
 
 

-
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
   
 

 
 
 
  
 
   
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED

Re: which HTML parser is better?

2005-02-03 Thread Karl Koch
Unfortunaltiy I am faithful ;-). Just for practical reason I want to do that
in a single class or even method called by another part in my Java
application. It should also run on Java 1.1 and it should be small and
simple. As I said before, I am in control of the HTML and it will be well
formated, because I generate it from XML using XSLT.

Karl

 If you are not married to Java:
 http://search.cpan.org/~kilinrax/HTML-Strip-1.04/Strip.pm
 
 Otis
 
 --- sergiu gordea [EMAIL PROTECTED] wrote:
 
  Karl Koch wrote:
  
  I am in control of the html, which means it is well formated HTML. I
  use
  only HTML files which I have transformed from XML. No external HTML
  (e.g.
  the web).
  
  Are there any very-short solutions for that?

  
  if you are using only correct formated HTML pages and you are in
  control 
  of these pages.
  you can use a regular exprestion to remove the tags.
  
  something like
  replaceAll(*,);
  
  This is the ideea behind the operation. If you will search on google
  you 
  will find a more robust
  regular expression.
  
  Using a simple regular expression will be a very cheap solution, that
  
  can cause you a lot of problems in the future.
   
   It's up to you to use it 
  
   Best,
   
   Sergiu
  
  Karl
  

  
  Karl Koch wrote:
  
  
  
  Hi,
  
  yes, but the library your are using is quite big. I was thinking
  that a

  
  5kB
  
  
  code could actually do that. That sourceforge project is doing
  much more
  than that but I do not need it.
   
  

  
  you need just the htmlparser.jar 200k.
  ... you know ... the functionality is strongly correclated with the
  size.
  
You can use 3 lines of code with a good regular expresion to
  eliminate 
  the html tags,
  but this won't give you any guarantie that the text from the bad 
  fromated html files will be
  correctly extracted...
  
Best,
  
Sergiu
  
  
  
  Karl
  
   
  

  
   Hi Karl,
  
  I already submitted a peace of code that removes the html tags.
  Search for my previous answer in this thread.
  
   Best,
  
Sergiu
  
  Karl Koch wrote:
  
 
  
  
  
  Hello,
  
  I have  been following this thread and have another question. 
  
  Is there a piece of sourcecode (which is preferably very short
  and

  
  simple
  
  
  (KISS)) which allows to remove all HTML tags from HTML content?
  HTML

  
  3.2
  
  
  would be enough...also no frames, CSS, etc. 
  
  I do not need to have the HTML strucutre tree or any other
  structure

  
  but
  
  
  need a facility to clean up HTML into its normal underlying
  content
   
  

  
  before
 
  
  
  
  indexing that content as a whole.
  
  Karl
  
  
  
  
   
  

  
  I think that depends on what you want to do.  The Lucene demo
  parser
 
  
  
  
  does
 
  
  
  
  simple mapping of HTML files into Lucene Documents; it does not
  give
  
  
  you
  
  
 
  
  
  
  a
 
  
  
  
  parse tree for the HTML doc.  CyberNeko is an extension of
  Xerces
  
  
  (uses
  
  

  
 
  
  
  
  the
  
  
   
  

  
  same API; will likely become part of Xerces), and so maps an
  HTML
 
  
  
  
  document
 
  
  
  
  into a full DOM that you can manipulate easily for a wide range
  of
  purposes.  I haven't used JTidy at an API level and so don't
  know it
  
  
  as
  
  

  
 
  
  
  
  well --
  
  
   
  

  
  based on its UI, it appears to be focused primarily on HTML
  validation
 
  
  
  
  and
 
  
  
  
  error detection/correction.
  
  I use CyberNeko for a range of operations on HTML documents
  that go
 
  
  
  
  beyond
 
  
  
  
  indexing them in Lucene, and really like it.  It has been
  robust for
  
  
  me
  
  
 
  
  
  
  so
 
  
  
  
  far.
  
  Chuck
  
  
  
  -Original Message-
  From: Jingkang Zhang [mailto:[EMAIL PROTECTED]
  Sent: Tuesday, February 01, 2005 1:15 AM
  To: lucene-user@jakarta.apache.org
  Subject: which HTML parser is better?
  
  Three HTML parsers(Lucene web application
  demo,CyberNeko HTML Parser,JTidy) are mentioned in
  Lucene FAQ
  1.3.27.Which is the best?Can it filter tags that are
  auto-created by MS-word 'Save As HTML files' function?
  
  _
  Do You Yahoo!?
  150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌÃ
  http://music.yisou.com/
  ÃÀÅ®Ã÷ÐÇÓ¦Óо¡ÓУ¬ËѱéÃÀͼ¡¢ÑÞͼºÍ¿áͼ
  http://image.yisou.com
  1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡
  

  
 

http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma

  
  il_1g

Re: which HTML parser is better?

2005-02-03 Thread Karl Koch
I appologise in advance, if some of my writing here has been said before.
The last three answers to my question have been suggesting pattern matching
solutions and Swing. Pattern matching was introduced in Java 1.4 and Swing
is something I cannot use since I work with Java 1.1 on a PDA.

I am wondering if somebody knows a piece of simple sourcecode with low
requirement which is running under this tense specification.

Thank you all,
Karl

 No one has yet mentioned using ParserDelegator and ParserCallback that 
 are part of HTMLEditorKit in Swing.  I have been successfully using 
 these classes to parse out the text of an HTML file.  You just need to 
 extend HTMLEditorKit.ParserCallback and override the various methods 
 that are called when different tags are encountered.
 
 
 On Feb 1, 2005, at 3:14 AM, Jingkang Zhang wrote:
 
  Three HTML parsers(Lucene web application
  demo,CyberNeko HTML Parser,JTidy) are mentioned in
  Lucene FAQ
  1.3.27.Which is the best?Can it filter tags that are
  auto-created by MS-word 'Save As HTML files' function?
 -- 
 Bill Tschumy
 Otherwise -- Austin, TX
 http://www.otherwise.com
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 

-- 
Sparen beginnt mit GMX DSL: http://www.gmx.net/de/go/dsl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: which HTML parser is better?

2005-02-03 Thread Karl Koch
I am using Java 1.1 with a Sharp Zaurus PDA. I have very limited memory
constraints. I do not think CPU performance is a big issues though. But I
have other parts in my application which use quite a lot of memory and
soemthing run short. I therefore do not look into solutions which build up
tag trees etc. More like a solution who reads a stream of HTML and
transforms it into a stream of text.

I see your point of using an external program. I am however not entirely
sure if this is available. Also it would be much simpler to have a 3-5 kB
solution in Java, perhaps encapsulated in a class which does the job without
the need for advanced libraries which need 100-200 KB on my internal
storage. 

I hope I could clarify my situation now.

Cheers,
Karl 

 Karl Koch wrote:
 
 Hello Sergiu,
 
 thank you for your help so far. I appreciate it.
 
 I am working with Java 1.1 which does not include regular expressions.
   
 
 Why are you using Java 1.1? Are you so limited in resources?
 What operating system do you use?
 I asume that you just need to index the html files, and you need a 
 html2txt conversion.
 If  an external converter si a solution for you, you can use
 Runtime.executeCommnand(...) to run the converter that will extract the 
 information from your HTMLs
 and generate a .txt file. Then you can use a reader to index the txt.
 
 As I told you before, the best solution depends on your constraints 
 (time, effort, hardware, performance) and requirements :)
 
   Best,
 
   Sergiu
 
 Your turn ;-)
 Karl 
 
   
 
 Karl Koch wrote:
 
 
 
 I am in control of the html, which means it is well formated HTML. I
 use
 only HTML files which I have transformed from XML. No external HTML
 (e.g.
 the web).
 
 Are there any very-short solutions for that?
  
 
   
 
 if you are using only correct formated HTML pages and you are in control
 of these pages.
 you can use a regular exprestion to remove the tags.
 
 something like
 replaceAll(*,);
 
 This is the ideea behind the operation. If you will search on google you
 will find a more robust
 regular expression.
 
 Using a simple regular expression will be a very cheap solution, that 
 can cause you a lot of problems in the future.
  
  It's up to you to use it 
 
  Best,
  
  Sergiu
 
 
 
 Karl
 
  
 
   
 
 Karl Koch wrote:
 

 
 
 
 Hi,
 
 yes, but the library your are using is quite big. I was thinking that
 a
  
 
   
 
 5kB

 
 
 
 code could actually do that. That sourceforge project is doing much
   
 
 more
 
 
 than that but I do not need it.
 
 
  
 
   
 
 you need just the htmlparser.jar 200k.
 ... you know ... the functionality is strongly correclated with the
 
 
 size.
 
 
  You can use 3 lines of code with a good regular expresion to
 eliminate
 the html tags,
 but this won't give you any guarantie that the text from the bad 
 fromated html files will be
 correctly extracted...
 
  Best,
 
  Sergiu
 

 
 
 
 Karl
 
 
 
  
 
   
 
 Hi Karl,
 
 I already submitted a peace of code that removes the html tags.
 Search for my previous answer in this thread.
 
 Best,
 
  Sergiu
 
 Karl Koch wrote:
 
   
 

 
 
 
 Hello,
 
 I have  been following this thread and have another question. 
 
 Is there a piece of sourcecode (which is preferably very short and
  
 
   
 
 simple

 
 
 
 (KISS)) which allows to remove all HTML tags from HTML content?
 HTML
  
 
   
 
 3.2

 
 
 
 would be enough...also no frames, CSS, etc. 
 
 I do not need to have the HTML strucutre tree or any other
 structure
  
 
   
 
 but

 
 
 
 need a facility to clean up HTML into its normal underlying content
 
 
  
 
   
 
 before
   
 

 
 
 
 indexing that content as a whole.
 
 Karl
 
 
 
 
 
 
  
 
   
 
 I think that depends on what you want to do.  The Lucene demo
 parser
   
 

 
 
 
 does
   
 

 
 
 
 simple mapping of HTML files into Lucene Documents; it does not
 give

 
 
 
 you

 
 
 
   
 

 
 
 
 a
   
 

 
 
 
 parse tree for the HTML doc.  CyberNeko is an extension of Xerces

 
 
 
 (uses

 
 
 
  
 
   
 

 
 
 
 the
 
 
 
 
  
 
   
 
 same API; will likely become part of Xerces), and so maps an HTML
   
 

 
 
 
 document
   
 

 
 
 
 into a full DOM that you can manipulate easily for a wide range of
 purposes.  I haven't used JTidy at an API level and so don't know
 it

 
 
 
 as

 
 
 
  
 
   
 

 
 
 
 well

Re: which HTML parser is better? - Thread closed

2005-02-03 Thread Karl Koch
Thank you, I will do that.

 Karl Koch wrote:
 
 I appologise in advance, if some of my writing here has been said before.
 The last three answers to my question have been suggesting pattern
 matching
 solutions and Swing. Pattern matching was introduced in Java 1.4 and
 Swing
 is something I cannot use since I work with Java 1.1 on a PDA.
   
 
 I see,
 
 In this case you can read line by line your HTML file and then write 
 something like this:
 
 String line;
 int startPos, endPos;
 StringBuffer text = new StringBuffer();
 while((line = reader.readLine()) != null   ){
 startPos = line.indexOf();
 endPos = line.indexOf();
 if(startPos 0  endPos  startPos)
   text.append(line.substring(startPos, endPos));
 }
 
 This is just a sample code that should work if you have just one tag per 
 line in the HTML file.
 This can be a start point for you.
 
   Hope it helps,
 
  Best,
 
  Sergiu
 
 I am wondering if somebody knows a piece of simple sourcecode with low
 requirement which is running under this tense specification.
 
 Thank you all,
 Karl
 
   
 
 No one has yet mentioned using ParserDelegator and ParserCallback that 
 are part of HTMLEditorKit in Swing.  I have been successfully using 
 these classes to parse out the text of an HTML file.  You just need to 
 extend HTMLEditorKit.ParserCallback and override the various methods 
 that are called when different tags are encountered.
 
 
 On Feb 1, 2005, at 3:14 AM, Jingkang Zhang wrote:
 
 
 
 Three HTML parsers(Lucene web application
 demo,CyberNeko HTML Parser,JTidy) are mentioned in
 Lucene FAQ
 1.3.27.Which is the best?Can it filter tags that are
 auto-created by MS-word 'Save As HTML files' function?
   
 
 -- 
 Bill Tschumy
 Otherwise -- Austin, TX
 http://www.otherwise.com
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 
   
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 

-- 
10 GB Mailbox, 100 FreeSMS http://www.gmx.net/de/go/topmail
+++ GMX - die erste Adresse für Mail, Message, More +++

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: which HTML parser is better?

2005-02-02 Thread Karl Koch
Hello,

I have  been following this thread and have another question. 

Is there a piece of sourcecode (which is preferably very short and simple
(KISS)) which allows to remove all HTML tags from HTML content? HTML 3.2
would be enough...also no frames, CSS, etc. 

I do not need to have the HTML strucutre tree or any other structure but
need a facility to clean up HTML into its normal underlying content before
indexing that content as a whole.

Karl


 I think that depends on what you want to do.  The Lucene demo parser does
 simple mapping of HTML files into Lucene Documents; it does not give you a
 parse tree for the HTML doc.  CyberNeko is an extension of Xerces (uses
the
 same API; will likely become part of Xerces), and so maps an HTML document
 into a full DOM that you can manipulate easily for a wide range of
 purposes.  I haven't used JTidy at an API level and so don't know it as
well --
 based on its UI, it appears to be focused primarily on HTML validation and
 error detection/correction.
 
 I use CyberNeko for a range of operations on HTML documents that go beyond
 indexing them in Lucene, and really like it.  It has been robust for me so
 far.
 
 Chuck
 
-Original Message-
From: Jingkang Zhang [mailto:[EMAIL PROTECTED]
Sent: Tuesday, February 01, 2005 1:15 AM
To: lucene-user@jakarta.apache.org
Subject: which HTML parser is better?

Three HTML parsers(Lucene web application
demo,CyberNeko HTML Parser,JTidy) are mentioned in
Lucene FAQ
1.3.27.Which is the best?Can it filter tags that are
auto-created by MS-word 'Save As HTML files' function?

_
Do You Yahoo!?
150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌÃ
http://music.yisou.com/
ÃÀÅ®Ã÷ÐÇÓ¦Óо¡ÓУ¬ËѱéÃÀͼ¡¢ÑÞͼºÍ¿áͼ
http://image.yisou.com
1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡
   
 http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma
il_1g/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 

-- 
GMX im TV ... Die Gedanken sind frei ... Schon gesehen?
Jetzt Spot online ansehen: http://www.gmx.net/de/go/tv-spot

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: which HTML parser is better?

2005-02-02 Thread Karl Koch
Hi,

yes, but the library your are using is quite big. I was thinking that a 5kB
code could actually do that. That sourceforge project is doing much more
than that but I do not need it.

Karl

   Hi Karl,
 
  I already submitted a peace of code that removes the html tags.
  Search for my previous answer in this thread.
 
   Best,
 
Sergiu
 
 Karl Koch wrote:
 
 Hello,
 
 I have  been following this thread and have another question. 
 
 Is there a piece of sourcecode (which is preferably very short and simple
 (KISS)) which allows to remove all HTML tags from HTML content? HTML 3.2
 would be enough...also no frames, CSS, etc. 
 
 I do not need to have the HTML strucutre tree or any other structure but
 need a facility to clean up HTML into its normal underlying content
 before
 indexing that content as a whole.
 
 Karl
 
 
   
 
 I think that depends on what you want to do.  The Lucene demo parser
 does
 simple mapping of HTML files into Lucene Documents; it does not give you
 a
 parse tree for the HTML doc.  CyberNeko is an extension of Xerces (uses
 
 
 the
   
 
 same API; will likely become part of Xerces), and so maps an HTML
 document
 into a full DOM that you can manipulate easily for a wide range of
 purposes.  I haven't used JTidy at an API level and so don't know it as
 
 
 well --
   
 
 based on its UI, it appears to be focused primarily on HTML validation
 and
 error detection/correction.
 
 I use CyberNeko for a range of operations on HTML documents that go
 beyond
 indexing them in Lucene, and really like it.  It has been robust for me
 so
 far.
 
 Chuck
 
-Original Message-
From: Jingkang Zhang [mailto:[EMAIL PROTECTED]
Sent: Tuesday, February 01, 2005 1:15 AM
To: lucene-user@jakarta.apache.org
Subject: which HTML parser is better?

Three HTML parsers(Lucene web application
demo,CyberNeko HTML Parser,JTidy) are mentioned in
Lucene FAQ
1.3.27.Which is the best?Can it filter tags that are
auto-created by MS-word 'Save As HTML files' function?

_
Do You Yahoo!?
150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌÃ
http://music.yisou.com/
ÃÀÅ®Ã÷ÐÇÓ¦Óо¡ÓУ¬ËѱéÃÀͼ¡¢ÑÞͼºÍ¿áͼ
http://image.yisou.com
1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡
   
 http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma
il_1g/

   
 -
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 
   
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 

-- 
10 GB Mailbox, 100 FreeSMS http://www.gmx.net/de/go/topmail
+++ GMX - die erste Adresse für Mail, Message, More +++

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: which HTML parser is better?

2005-02-02 Thread Karl Koch
I am in control of the html, which means it is well formated HTML. I use
only HTML files which I have transformed from XML. No external HTML (e.g.
the web).

Are there any very-short solutions for that?

Karl

 Karl Koch wrote:
 
 Hi,
 
 yes, but the library your are using is quite big. I was thinking that a
 5kB
 code could actually do that. That sourceforge project is doing much more
 than that but I do not need it.
   
 
 you need just the htmlparser.jar 200k.
 ... you know ... the functionality is strongly correclated with the size.
 
   You can use 3 lines of code with a good regular expresion to eliminate 
 the html tags,
 but this won't give you any guarantie that the text from the bad 
 fromated html files will be
 correctly extracted...
 
   Best,
 
   Sergiu
 
 Karl
 
   
 
   Hi Karl,
 
  I already submitted a peace of code that removes the html tags.
  Search for my previous answer in this thread.
 
   Best,
 
Sergiu
 
 Karl Koch wrote:
 
 
 
 Hello,
 
 I have  been following this thread and have another question. 
 
 Is there a piece of sourcecode (which is preferably very short and
 simple
 (KISS)) which allows to remove all HTML tags from HTML content? HTML
 3.2
 would be enough...also no frames, CSS, etc. 
 
 I do not need to have the HTML strucutre tree or any other structure
 but
 need a facility to clean up HTML into its normal underlying content
   
 
 before
 
 
 indexing that content as a whole.
 
 Karl
 
 
  
 
   
 
 I think that depends on what you want to do.  The Lucene demo parser
 
 
 does
 
 
 simple mapping of HTML files into Lucene Documents; it does not give
 you
 
 
 a
 
 
 parse tree for the HTML doc.  CyberNeko is an extension of Xerces
 (uses

 
 
 
 the
  
 
   
 
 same API; will likely become part of Xerces), and so maps an HTML
 
 
 document
 
 
 into a full DOM that you can manipulate easily for a wide range of
 purposes.  I haven't used JTidy at an API level and so don't know it
 as

 
 
 
 well --
  
 
   
 
 based on its UI, it appears to be focused primarily on HTML validation
 
 
 and
 
 
 error detection/correction.
 
 I use CyberNeko for a range of operations on HTML documents that go
 
 
 beyond
 
 
 indexing them in Lucene, and really like it.  It has been robust for
 me
 
 
 so
 
 
 far.
 
 Chuck
 
   -Original Message-
   From: Jingkang Zhang [mailto:[EMAIL PROTECTED]
   Sent: Tuesday, February 01, 2005 1:15 AM
   To: lucene-user@jakarta.apache.org
   Subject: which HTML parser is better?
   
   Three HTML parsers(Lucene web application
   demo,CyberNeko HTML Parser,JTidy) are mentioned in
   Lucene FAQ
   1.3.27.Which is the best?Can it filter tags that are
   auto-created by MS-word 'Save As HTML files' function?
   
   _
   Do You Yahoo!?
   150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌÃ
   http://music.yisou.com/
   ÃÀÅ®Ã÷ÐÇÓ¦Óо¡ÓУ¬ËѱéÃÀͼ¡¢ÑÞͼºÍ¿áͼ
   http://image.yisou.com
   1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡
  

http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma
   il_1g/
   
  
 
 
 -
 
 
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail:
 [EMAIL PROTECTED]
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 

 
 
 
  
 
   
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 
   
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 

-- 
GMX im TV ... Die Gedanken sind frei ... Schon gesehen?
Jetzt Spot online ansehen: http://www.gmx.net/de/go/tv-spot

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Different Documents (with fields) in one index?

2005-01-27 Thread Karl Koch
Hello all,

perhaps not such a sophisticated question: 

I would like to have a very diverse set of documents in one index. Depending
on the inside of text documents, I would like to put part of the text in
different fields. This means in the searches, when searching a particular
field, some of those documents won't be addressed at all.

Is it possible to have different kinds of Documents with different index
fields in ONE index? Or do I need one index for each set?

Karl

-- 
10 GB Mailbox, 100 FreeSMS http://www.gmx.net/de/go/topmail
+++ GMX - die erste Adresse für Mail, Message, More +++

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



XML index

2005-01-27 Thread Karl Koch
Hi,

I want to use kXML with Lucene to index XML files. I think it is possible to
dynamically assign Node names as Document fields and Node texts as Text
(after using an Analyser). 

I have seen some XML indexing in the Sandbox. Is anybody here which has done
something with a thin pull parser (perhaps even kXML)? Does anybody know of
a project or some sourcecode available which covers this topic?

Karl

 

-- 
Sparen beginnt mit GMX DSL: http://www.gmx.net/de/go/dsl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Lucene on JSE 1.1.8...

2005-01-25 Thread Karl Koch
Hello,

does somebody here know, which Lucene version runs on Java 1.1? Of course I
would like to run the best (latest) version possible :-)

Karl

-- 
10 GB Mailbox, 100 FreeSMS http://www.gmx.net/de/go/topmail
+++ GMX - die erste Adresse für Mail, Message, More +++

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Lucene index - information

2004-03-19 Thread Karl Koch
If I create an standard index, what does Lucene store in this index?

What should be stored in an index at least? Just a link to the file and
keywords? Or also wordnumbers? What else?

Does somebody know a paper which discusses this problem of what to put in
an good universal IR index ?

Cheers,
Karl

-- 
+++ NEU bei GMX und erstmalig in Deutschland: TÜV-geprüfter Virenschutz +++
100% Virenerkennung nach Wildlist. Infos: http://www.gmx.net/virenschutz


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



VSpace Model Index - Prob. Model Index - Difference?

2004-03-19 Thread Karl Koch
Hello group,

coming back to the discussion about probabilistic and vector space model
(which occured here some time ago), I would like to ask something related.

I only know the index structure Lucene offers. Does a IR system, based on
the probabilistic model (e.g. Okapi) look different from a VS model? If yes,
why? 

I hope this questions is not too stupid. I am mainly interested because of
some theoretical background...

Karl

 Uh, there are lots of ways to construct an inverted index.
 Citeseer will give you more than you can read on this topic.
 
 As for Lucene, see File Formats section on the site.
 
 Otis
 
 --- Karl Koch [EMAIL PROTECTED] wrote:
  If I create an standard index, what does Lucene store in this index?
  
  What should be stored in an index at least? Just a link to the file
  and
  keywords? Or also wordnumbers? What else?
  
  Does somebody know a paper which discusses this problem of what to
  put in
  an good universal IR index ?
  
  Cheers,
  Karl
  
  -- 
  +++ NEU bei GMX und erstmalig in Deutschland: TÜV-geprüfter
  Virenschutz +++
  100% Virenerkennung nach Wildlist. Infos:
  http://www.gmx.net/virenschutz
  
  
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
  
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 

-- 
+++ NEU bei GMX und erstmalig in Deutschland: TÜV-geprüfter Virenschutz +++
100% Virenerkennung nach Wildlist. Infos: http://www.gmx.net/virenschutz


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Search one keyword in two fields - How?

2004-03-02 Thread Karl Koch
Hi,

hat do I need to search for one single keyword in two fields of one index?
Can someone provide an easy example or tell me where to find it?

Cheers,
Karl

-- 
+++ NEU bei GMX und erstmalig in Deutschland: TÜV-geprüfter Virenschutz +++
100% Virenerkennung nach Wildlist. Infos: http://www.gmx.net/virenschutz


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: setMaxClauseCount ??

2004-01-21 Thread Karl Koch
Hi Doug,

thank you for the answer so far.

I actually wanted to add a large amount of text from an existing document to
find a close related one. Can you suggest another good way of doing this. A
direct match will not occur anyway. How can I make a most Vector Space Model
(VSM) like query (each word a dimension value - find documents close to
that)? You know as good as I that the standard VSM does not have any Boolean logic
inside... how do I need to formuate the query to make it as much similar to
a vector in order to find similar document in the vector space of the Lucene
index?

Cheers,
Karl

 setMaxClauseCount determines the maximum number of clauses, which is not 
 your problem here.  Your problem is with required clauses.  There may 
 only be a total of 31 required (or prohibited) clauses in a single 
 BooleanQuery.  If you need more, then create more BooleanQueries and 
 combine them with another BooleanQuery.  Perhaps this could be done 
 automatically, but I've never heard anyone encounter this limit before. 
   Do you really mean for 32 different terms to be required?  Do any 
 documents actually match this query?
 
 Doug
 
 Karl Koch wrote:
  Hi group,
  
  I run over a IndexOutOfBoundsException:
  
  - java.lang.IndexOutOfBoundsException: More than 32 required/prohibited
  clauses in query.
  
  The reason: I have more then 32 BooleanCauses. From the Mailinglist I
 got
  the info how to set the maxiumum number of clauses higher before a loop:
  
  ...
  myBooleanQuery.setMaxClauseCount(Integer.MAX_VALUE);
  while (true){
Token token = tokenStream.next();
if (token == null) {
  break;
}
myBooleanQuery.add(new TermQuery(new Term(bla, token.termText())),
 true,
  false);
  } ... 
  
  However the error still remains, why?
  
  Karl
  
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: setMaxClauseCount ??

2004-01-21 Thread Karl Koch
Hello Doug,

that sounds interesting to me. I refer to a paper written by NIST about
Relevance Feedback which was doing test with 20 - 200 words. This is why I
thought it might be good to be able to use all non stopwords of a document for that
and see what is happening. Do you know good papers about strategies of how
to select keywords effectivly beyond the scope of stopword lists and stemming?

Using term frequencies of the document is not really possible since lucene
is not providing access to a document vector, isn't it?

By the way, could you send me the code of Dmitry about the Vector extension.
I have been asking in another thread but I did not get it so far. I really
would like to have a look... Also it would be nice to know about any status
regarding the progress of integrating it in Lucene 1.3. Who is working on it
and how could I contribute?

Cheers,
Karl


 Andrzej Bialecki wrote:
  Karl Koch wrote:
  I actually wanted to add a large amount of text from an existing 
  document to
  find a close related one. Can you suggest another good way of doing 
  this.
 
  You should try to reduce the dimensionality by reducing the number of 
  unique features. In this case, you could for example use only keywords 
  (or key phrases) instead of the full content of documents.
 
 Indeed, this is a good approach.  In my experience, six or eight terms 
 are usually enough, and they needn't all be required.
 
 Doug
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How to limit terms in the index

2004-01-19 Thread Karl Koch
Hello,

I think you have to write your own Analyser who filters out all other words
before providing the text to an indexer. 

karl



 Hi,
 I'm using Lucene to index documents and I want just to limit
 the terms indexed by a list of terms provided by an Ontology.
 May someone help me to know how can I do that ?
 
 Thanks,
 GD
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Extracting particular document from index

2004-01-18 Thread Karl Koch
Hi all,

lets say I have an index with documents encoded in two fields filename and
data. Is it possible to extract a file from which I know the filename
directly from this index without performing any search. Like a random access like
in a filesystem? 

Karl

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Gettting all index fields of an index

2004-01-18 Thread Karl Koch
How can I get a list of all fields in an index from which I know only the
directory string?

Karl

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Extracting particular document from index

2004-01-18 Thread Karl Koch
Hi all,

I have only the file information about its location and the name of the
file. I want to know two things. There is also an index made from suche files
also containing the file I am looking for. I want to know two things:
1) first of all, all fields witch can be searched within this index.
2) secondly, the content as it is represented in each field. 

As you can see, I am asking for the way back. From the file itself I cannot
interfere how it was indexed. The indexer has parsed and devided it somehow.
Something I dunno know. This information should be in the index. I am
basically looking for a functionalty like this:

Index index = new Index(X:/myindex);
Field[] fields = index.getAllFields();
String[] fieldNames = new String[fields.length];
for (int i = 0; i  fields.length; i++){
  fieldNames[i] = fields[i].stringValue();
}

Using the field name I know now, I could perform a search and get the first
entry of the Hits list, which would be my Document. :-)

Does something like that exist in Lucene?

Cheers mates,
Karl


 On Jan 18, 2004, at 11:15 AM, Karl Koch wrote:
  lets say I have an index with documents encoded in two fields 
  filename and
  data. Is it possible to extract a file from which I know the filename
  directly from this index without performing any search. Like a random 
  access like
  in a filesystem?
 
 It is still technically a search, but a TermQuery will be basically 
 direct access to the document(s) matching that term.
 
   Erik
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: theorical informations

2004-01-18 Thread Karl Koch
Hello Nicolas,

I am sure you mean IR (Information Retrieval) Model. Lucene implements a
Vector Space Model with integrated Boolean Model. This means the Boolean model
is integrated with a Boolean query language but mapped into the Vector Space.
Therefore you have ranking even though the traditional Boolean model does not
support this. Cosine similarity is used to measure similarity between
documents and the query. You can find this in a very long dicussion here when you
search the archive...

Karl

 hy , 
 i have 2  theorycal questions :
 
 i searched in the mailing list the R.I. model implemented in Lucene , 
 but no precise answer.
 
 1) What is the R.I model implemented in Lucene ? (ex: Boolean Model, 
 Vector Model,Probabilist Model, etc... ) 
 
 2) What is the theory Similarity function  implemented in Lucene 
 (Euclidian, Cosine, Jaccard, Dice)
 
 (why this important informations is not in the Lucene Web site or in the 
 faq ? )
 

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: theorical informations

2004-01-18 Thread Karl Koch
Actually, finding an answer to this question is not really important. More
important is if you can do what you want with it. If you result comes from a
prob. model or a vector space model, who cares if you just want to give a
query and back a hit list of results?

Possibliy some people here will strongly disagree... ;-) (?)

Karl

 Hello Nicolas,
 
 I am sure you mean IR (Information Retrieval) Model. Lucene implements a
 Vector Space Model with integrated Boolean Model. This means the Boolean
 model
 is integrated with a Boolean query language but mapped into the Vector
 Space.
 Therefore you have ranking even though the traditional Boolean model does
 not
 support this. Cosine similarity is used to measure similarity between
 documents and the query. You can find this in a very long dicussion here
 when you
 search the archive...
 
 Karl
 
  hy , 
  i have 2  theorycal questions :
  
  i searched in the mailing list the R.I. model implemented in Lucene , 
  but no precise answer.
  
  1) What is the R.I model implemented in Lucene ? (ex: Boolean Model, 
  Vector Model,Probabilist Model, etc... ) 
  
  2) What is the theory Similarity function  implemented in Lucene 
  (Euclidian, Cosine, Jaccard, Dice)
  
  (why this important informations is not in the Lucene Web site or in the
 
  faq ? )
  
 
 -- 
 +++ GMX - die erste Adresse für Mail, Message, More +++
 Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



setMaxClauseCount ??

2004-01-18 Thread Karl Koch
Hi group,

I run over a IndexOutOfBoundsException:

- java.lang.IndexOutOfBoundsException: More than 32 required/prohibited
clauses in query.

The reason: I have more then 32 BooleanCauses. From the Mailinglist I got
the info how to set the maxiumum number of clauses higher before a loop:

...
myBooleanQuery.setMaxClauseCount(Integer.MAX_VALUE);
while (true){
  Token token = tokenStream.next();
  if (token == null) {
break;
  }
  myBooleanQuery.add(new TermQuery(new Term(bla, token.termText())), true,
false);
} ... 

However the error still remains, why?

Karl

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: difference in javadoc and faq similarity expression

2004-01-18 Thread Karl Koch
I would rely on the JavaDoc since this one is up to date. The latest version
1.3 final is just a few weeks old. Some entries in the FAQ however are still
from 2001...

Cheers,
Karl

 hy,
 i have troubles in find the correspondance betwwen the javadoc and faq
 similarity expression
 
 in the Similarity Javadoc
 
 score(q,d) =Sum [tf(t in d) * idf(t) * getBoost(t.field in d) *
 lengthNorm(t.field in d)  * coord(q,d) * queryNorm(q) ]
 
 in the FAQ
 
 score_d = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t * boost_t)
 *
 coord_q_d
 
 In FAQ | In Javadoc
 1 / norm_q = queryNorm(q)
 1 / norm_d_t=lengthNorm(t.field in d)
 coord_q_d=coord(q,d)
 boost_t=getBoost(t.field in d)
 idf_t=idf(t)
 tf_d=tf(t in d)
 
 but
 where is the javadoc expression for tf_q faq expression
 
 nicolas
 
 - Original Message - 
 From: Nicolas Maisonneuve [EMAIL PROTECTED]
 To: Lucene Users List [EMAIL PROTECTED]
 Sent: Sunday, January 18, 2004 9:33 PM
 Subject: Re: theorical informations
 
 
  thanks Karl !
 
  - Original Message - 
  From: Karl Koch [EMAIL PROTECTED]
  To: Lucene Users List [EMAIL PROTECTED]
  Sent: Sunday, January 18, 2004 9:22 PM
  Subject: Re: theorical informations
 
 
   Actually, finding an answer to this question is not really important.
 More
   important is if you can do what you want with it. If you result comes
 from
  a
   prob. model or a vector space model, who cares if you just want to
 give
 a
   query and back a hit list of results?
  
   Possibliy some people here will strongly disagree... ;-) (?)
  
   Karl
  
Hello Nicolas,
   
I am sure you mean IR (Information Retrieval) Model. Lucene
 implements
 a
Vector Space Model with integrated Boolean Model. This means the
 Boolean
model
is integrated with a Boolean query language but mapped into the
 Vector
Space.
Therefore you have ranking even though the traditional Boolean model
  does
not
support this. Cosine similarity is used to measure similarity
 between
documents and the query. You can find this in a very long dicussion
 here
when you
search the archive...
   
Karl
   
 hy ,
 i have 2  theorycal questions :

 i searched in the mailing list the R.I. model implemented in
 Lucene
 ,
 but no precise answer.

 1) What is the R.I model implemented in Lucene ? (ex: Boolean
 Model,
 Vector Model,Probabilist Model, etc... )

 2) What is the theory Similarity function  implemented in Lucene
 (Euclidian, Cosine, Jaccard, Dice)

 (why this important informations is not in the Lucene Web site or
 in
  the
   
 faq ? )

   
-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Bis 31.1.: TopMail + Digicam für nur 29 EUR
 http://www.gmx.net/topmail
   
   
   
 -
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   
  
   -- 
   +++ GMX - die erste Adresse für Mail, Message, More +++
   Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail
  
  
   -
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]
  
  
 
 
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Relevance Feedback (2)

2004-01-17 Thread Karl Koch
Hello group,

I would like to implement Relevance Feedback functionality for my system.
From the privious discussion in this group I know that this is not implemented
in Lucene. 

We all know that Relevance Feedback has two fields, which are 
1) Term Reweighting
2) Query Expansion

I am interesting in doing both of it. 

My first thought was that Term Reweighting can be solved with term boosing
and expansion, well, with basically generation a new query. Looking close to
one of the classic term reweighting formula's (Rocchio) however reveals that I
need access to the term vector of the relevant as well as the term vector of
the non-relevant documents. Bringing this to Lucenen it would mean, that I
need to have the score of each term in the relevant and non-relevant documents
to process the reweigthing formula.

Coming back to Lucene, this would mean that I need to extract Documents from
the Hits object after the search. From this Documents I would need to get
all terms and its scores.

However, Lucene does not provide this. Only Documents can be retrieved and
its scores. It does not provide access to its terms and therefore no access to
Term scores.

Does somebody have ideas of workaround for Term Reweighting and Query
Expansion withouth using the way over Hits. Does somebody have produces workarounds
and can provide it to me? 

Thank you very much in advance,
Karl


-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Relevance Feedback (2)

2004-01-17 Thread Karl Koch
Hello all,

oh, I just found a mail from Doug where he wrote that Dmitry Serebrennikov
developed something who provides Document vector access:

 Dmitry Serebrennikov [dmitrys?X0040;earthlink.net] has implemented a
substantial
 extension to Lucene which should help folks doing this sort of research. 
It
  provides an explicit vector representation for documents.  This way you
can,
 e.g., retrieve a number of documents, efficiently sum their vectors, then
 derive a new query from the sum.  This code was posted to the list a long
 while back, but is now out of date.  As soon as the 1.2 release is final,
 and Dmitry has time, he intends to merge it into Lucene.

Who has this code? Could somebody email it to me? I would highly appreciate
it.

Is there any attempt from Dmitry or somebody else to adapt it to Lucene 1.3?


I wish you all a nice weekend,
Karl

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: IndexReader.document(int i)

2004-01-17 Thread Karl Koch
Hello back,

according to the JavaDoc it means:

Returns the stored fields of the nth Document in this index.

In your case than if would mean the greater n the younger the doc. However I
am not sure how you can create such an index. I think you should have a look
to the Luke project, which allows you ot access and  look into Lucene
Indices.

If I did not answer your question, please explain a little bit further...

Karl




 hy,
 i would like to know  
 in the IndexReader.document(int i)
 what is this number  i ? 
 if the the first document is the oldest document indexed 
 and the last the youngest ? (so we can sort by date  easyly) ?
 
 thank in advance
 
 nico 

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Searching for similar documents in Lucene

2004-01-17 Thread Karl Koch
Hello all,

how can I find the most similar document from one given document? Do I need
to perform one search for each field in the Docuemnt and merge the resulting
Hit lists? 

Maybe somebody has already done that and can give me hints / maybe an
example?

Cheers and good night,
Karl

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Term weighting and Term boost

2004-01-16 Thread Karl Koch
Hello all,

I am new to the Lucene scene and have a few questions regarding the term
boost physolophy:

Is the term boost equal to a term weight? Example: If I boost a term with
0.2 does this mean the term has a weight of 0.2 then?

If this is not the case, how is the term weight of the query calculated
then? Formula? Are there parts in it which I cannot influence? Does this formular
depend on the type of Query or is it independent. Maybe somebody can provide
a small code example? 

Give the following code:

TermQuery termQuery1 = new TermQuery(new Term(contents, house));
TermQuery termQuery2 = new TermQuery(new Term(contents, tree));
termQuery2.setBoost( ? );
BooleanQuery finalQuery = new BooleanQuery();
finalQuery.add(termQuery1, true, false);
finalQuery.add(termQuery2, true, false);

How can I realise that the term tree is double as important for search
than house?

Many questions I know but I am sure that the experts here can answer them
easily.

Cheers,
Karl

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Term weighting and Term boost

2004-01-16 Thread Karl Koch

Hello and thank you for this link. I think this is a very usefull tool to
analyse Lucene internals.

 I realize this is not exactly the answer, but you may want to try one of 
 the new features of Luke (http://www.getopt.org/luke), namely the query 
 result explanation.

When I start it according to the description on your web site and select the
index directory I get an error message current threat no owner...

What does it mean and what do I wrong?

Kind Regards,
Karl


 
 Currently the best way to start Luke is to use Java WebStart. Then open 
 an already existing index, go to the Search tab, enter a query (use 
 Update button to see exactly what it is parsed into), press Search, 
 and then highlight one of the results and press Explain.
 
 It was revealing for me to see how weights, boosts, normalizations etc. 
 are applied under the hood so to speak, especially for  Fuzzy or 
 Phrase queries.
 
 After experimenting a little, you may want to consult the classes in 
 org.apache.lucene.search (e.g. Scorer and Similarity) to see the gory 
 details.
 
 -- 
 Best regards,
 Andrzej Bialecki
 
 -
 Software Architect, System Integration Specialist
 CEN/ISSS EC Workshop, ECIMF project chair
 EU FP6 E-Commerce Expert/Evaluator
 -
 FreeBSD developer (http://www.freebsd.org)
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Term weighting and Term boost

2004-01-16 Thread Karl Koch
Hello Andrzej,

sorry. I mistakenly run it under Java 1.2.2 which cannot work :-) Then you
get Threat Exceptions...

Anyway, solved now. Thank you,
Karl

 Karl Koch wrote:
 
  Hello and thank you for this link. I think this is a very usefull tool
 to
  analyse Lucene internals.
  
  
 I realize this is not exactly the answer, but you may want to try one of
 
 the new features of Luke (http://www.getopt.org/luke), namely the query 
 result explanation.
  
  
  When I start it according to the description on your web site and select
 the
  index directory I get an error message current threat no owner...
  
 
 I.e. Java WebStart, or by getting the jars and starting it from 
 command-line?
 
  What does it mean and what do I wrong?
 
 Beats me... I've never seen something like that. Could you please turn 
 on the Java console, and see what kind of exception and where is thrown?
 
 -- 
 Best regards,
 Andrzej Bialecki
 
 -
 Software Architect, System Integration Specialist
 CEN/ISSS EC Workshop, ECIMF project chair
 EU FP6 E-Commerce Expert/Evaluator
 -
 FreeBSD developer (http://www.freebsd.org)
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]