Re: Filters for Openoffice File Indexing available (Java)

2004-11-27 Thread Peter Becker
Joachim Arrasz wrote: Hello List. we have written an application which includes OpenOffice Integration into an OpenSource CMS (OpenCms). For this CMS there is a Lucene Integration available under sourceforge. So now we are looking for search and index Filters for Lucene, that were able to

Re: Bridge with OpenOffice

2004-04-19 Thread Peter Becker
We did a simple one a while ago. Could probably be a bit more sophisticated, but it seems to do it job on the little bit of testing we did. See http://cvs.sourceforge.net/viewcvs.py/toscanaj/docco/source/org/tockit/docco/documenthandler/OpenOfficeDocumentHandler.java?rev=1.4view=auto HTH,

ANN: Docco 0.3

2004-04-13 Thread Peter Becker
Hello, we released Docco 0.3 along with two updates for its plugins. Docco is a personal document retrieval tool based on Apache's Lucene indexing engine and Formal Concept Analysis. It allows you to create an index for files on your file system which you can then search for keywords. It can

Re: ANN: Docco 0.3

2004-04-13 Thread Peter Becker
/... cheers, sv On Wed, 14 Apr 2004, Peter Becker wrote: Hello, we released Docco 0.3 along with two updates for its plugins. Docco is a personal document retrieval tool based on Apache's Lucene indexing engine and Formal Concept Analysis. It allows you to create an index for files on your

Re: use Lucene LOCAL (looking for a frontend)

2004-01-29 Thread Peter Becker
Hi Sebastian, there are not too many Lucene features used, and some rather orthogonal mixin of Formal Concept Analysis, but let me still advertise our little Docco tool: http://tockit.sourceforge.net/docco/index.html It is based on Lucene, comes with a couple of indexing tools (including

Re: Lucene demo ideas?

2003-09-18 Thread Peter Becker
Erik Hatcher wrote: [...] - Index text and HTML files. Any others? I don't want to get into putting too many dependencies in though - let's keep it relatively simple, although still demonstrative. Allow search filtering by last modified date range and document type (extension). If I may

Re: HTML Parsing problems...

2003-09-18 Thread Peter Becker
Tatu Saloranta wrote: On Thursday 18 September 2003 14:50, Michael Giles wrote: I know, I know, the HTML Parser in the demo is just that (i.e. a demo), but I also know that it is updated from time to time and performs much better than the other ones that I have tested. Frustratingly, the very

Re: Similar Document Search

2003-08-25 Thread Peter Becker
Brian Mila wrote: amounts). I failed to find a way to get Lucene to give me this information without hacking this or that. Considering the attention IR Excuse me if this is off-topic, but isn't hacking the code what open source software is all about? Not always, but quite often :-) I

Re: Similar Document Search

2003-08-21 Thread Peter Becker
- From: Peter Becker [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, August 21, 2003 1:37 AM Subject: Re: Similar Document Search Hi all, it seems there are quite a few people looking for similar features, i.e. (a) document identity and (b) forward indexing. So far we

Re: Similar Document Search

2003-08-20 Thread Peter Becker
... Best regards, Gregor -Original Message- From: Peter Becker [mailto:[EMAIL PROTECTED] Sent: Tuesday, August 19, 2003 3:06 AM To: Lucene Users List Subject: Re: Similar Document Search Hi Terry, we have been thinking about the same problem and in the end we decided that most likely

Re: Similar Document Search

2003-08-18 Thread Peter Becker
Hi Terry, we have been thinking about the same problem and in the end we decided that most likely the only good solution to this is to keep a non-inverted index, i.e. a map from the documents to the terms. Then you can query the most terms for the documents and query other documents matching

Re: parallel index building searching multiple indexes

2003-08-14 Thread Peter Becker
Kevin A. Burton wrote: Killeen, Tom wrote: I am attempting to create approx 10 different Lucene indexes. I'm trying to create them at the same time by running multiple processes and each index is written to a new directory. Once I create more than one process - the performance is very, very

Re: parallel index building searching multiple indexes

2003-08-14 Thread Peter Becker
Hi Tom, Killeen, Tom wrote: I am attempting to create approx 10 different Lucene indexes. I'm trying to create them at the same time by running multiple processes and each index is written to a new directory. Once I create more than one process - the performance is very, very slow. As Otis

Re: Luke v 0.2 - Lucene Index Browser

2003-08-12 Thread Peter Becker
Andrzej Bialecki wrote: [...Luke feature requests...] open the original Dokuments with the platform dependant mimetype viewer Someone else already explained the problems with this... What is a document in Lucene? It's a set of fields and their String values (or their terms), so it's not

Re: Free, medium size, downloadable corpus of newspaper articles?

2003-07-30 Thread Peter Becker
[redirected to lucene-user] Me, too! :-) We are currently playing with the small Reuters collection (about 21.500 news items from the 80s), but I don't know if I am allowed to distribute it and it is too small anyway -- many of the implications we find are based on 1 to 3 documents. I still

Term frequency on result sets

2003-07-29 Thread Peter Becker
Hi all, I am interested in comparing different query result sets in term of term frequency. Questions I'd like to answer are: - what are the N most common terms in a result set? - how often does term X occur in a certain result set? The second one is of course easy to do with a boolean query,

Re: Indexing very large sets (10 million docs)

2003-07-28 Thread Peter Becker
Roger Ford wrote: [...index size troubles...] Believe it or not, this 10 million documents was meant to be a single partition of a much larger dataset. I'm not sure I'm at liberty to discuss in detail the data I'm indexing - but it's a massive geneological database. Roger, maybe your data type

Re: Spanish analyzer and Indexing StarOffice docs

2003-07-22 Thread Peter Becker
and StarOffice 6, please correct me if this isn't true), so I'd like to know if there is already any way to index files of version 5.2 and below, or any clue on how to do this, Thank you in advance, Oscar Herrera Bogotá, Colombia, SA. - Original Message - From: Peter Becker [EMAIL PROTECTED

Re: Spanish analyzer and Indexing StarOffice docs

2003-07-21 Thread Peter Becker
Hi Oscar, we have been looking into the StarOffice/OpenOffice problem, although we haven't done it and probably won't anytime soon as we have to move on to other things. I see two approaches, both with variants: (1) use the fact that it is just zipped XML: use a ZipInputStream to open the

Doing it all backwards

2003-07-15 Thread Peter Becker
Hi, is there any way to get the keywords for certain fields in document easily? The situation is that I have small sets of documents coming back from queries and I want to compare those in terms of similarity. The questions are: what are the common terms within each set and what are the terms

Re: about PDF / HTML index

2003-07-15 Thread Peter Becker
Hi Alvaro, there are some examples in our code here -- working with a slightly similar interface to the Ant task in the Lucene contributions. http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/toscanaj/docco/source/org/tockit/docco/indexer/documenthandler/ The actual step of turning it into a