Re: Filters for Openoffice File Indexing available (Java)

2004-11-27 Thread Peter Becker
Joachim Arrasz wrote: Hello List. we have written an application which includes OpenOffice Integration into an OpenSource CMS (OpenCms). For this CMS there is a Lucene Integration available under sourceforge. So now we are looking for search and index Filters for Lucene, that weÂŽre able to inte

Re: Zilverline release candidate 1.0-rc3 available

2004-06-08 Thread Peter Becker
Hi Michael, I wonder if you would be interested in cooperating on the extracting/index management bit. We use Lucene and our own extractor plugins for a Swing-application: http://tockit.sf.net/docco Code can be found here: http://cvs.sourceforge.net/viewcvs.py/toscanaj/docco/ It is BSD-Style l

Re: Bridge with OpenOffice

2004-04-19 Thread Peter Becker
We did a simple one a while ago. Could probably be a bit more sophisticated, but it seems to do it job on the little bit of testing we did. See http://cvs.sourceforge.net/viewcvs.py/toscanaj/docco/source/org/tockit/docco/documenthandler/OpenOfficeDocumentHandler.java?rev=1.4&view=auto HTH, Pe

Re: ANN: Docco 0.3

2004-04-13 Thread Peter Becker
rano-gnome/... cheers, sv On Wed, 14 Apr 2004, Peter Becker wrote: Hello, we released Docco 0.3 along with two updates for its plugins. Docco is a personal document retrieval tool based on Apache's Lucene indexing engine and Formal Concept Analysis. It allows you to create an index fo

ANN: Docco 0.3

2004-04-13 Thread Peter Becker
Hello, we released Docco 0.3 along with two updates for its plugins. Docco is a personal document retrieval tool based on Apache's Lucene indexing engine and Formal Concept Analysis. It allows you to create an index for files on your file system which you can then search for keywords. It can i

Re: use Lucene LOCAL (looking for a frontend)

2004-01-29 Thread Peter Becker
Hi Sebastian, there are not too many Lucene features used, and some rather orthogonal mixin of Formal Concept Analysis, but let me still advertise our little Docco tool: http://tockit.sourceforge.net/docco/index.html It is based on Lucene, comes with a couple of indexing tools (including HTM

Re: HTML Parsing problems...

2003-09-18 Thread Peter Becker
Tatu Saloranta wrote: On Thursday 18 September 2003 14:50, Michael Giles wrote: I know, I know, the HTML Parser in the demo is just that (i.e. a demo), but I also know that it is updated from time to time and performs much better than the other ones that I have tested. Frustratingly, the very

Re: Lucene demo ideas?

2003-09-18 Thread Peter Becker
Erik Hatcher wrote: [...] - Index text and HTML files. Any others? I don't want to get into putting too many dependencies in though - let's keep it relatively simple, although still demonstrative. Allow search filtering by last modified date range and document type (extension). If I may pl

Re: Docco 0.2 / contribution offer

2003-09-02 Thread Peter Becker
ve you a reasonable estimate of the effort involved. Cheers, Peter Best, Gregor -----Original Message- From: Peter Becker [mailto:[EMAIL PROTECTED] Sent: Tuesday, September 02, 2003 1:52 PM To: Lucene Users List Subject: ANN: Docco 0.2 / contribution offer Hi all, we finally finished the

ANN: Docco 0.2 / contribution offer

2003-09-02 Thread Peter Becker
Hi all, we finally finished the 0.2 release of our little personal document management tool based on Lucene: http://tockit.sourceforge.net/docco/index.html This might be interesting for some readers of this list since its source contains some infrastructure for document handlers and index man

Re: Similar Document Search

2003-08-25 Thread Peter Becker
Brian Mila wrote: amounts). I failed to find a way to get Lucene to give me this information without hacking this or that. Considering the attention IR Excuse me if this is off-topic, but isn't hacking the code what open source software is all about? Not always, but quite often :-) I mean

Re: Similar Document Search

2003-08-21 Thread Peter Becker
hat end? Regards, Terry - Original Message - From: "Peter Becker" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Thursday, August 21, 2003 1:37 AM Subject: Re: Similar Document Search Hi all, it seems there are quite a few people l

Re: Similar Document Search

2003-08-20 Thread Peter Becker
yet, therefore no code at this time... Best regards, Gregor -----Original Message- From: Peter Becker [mailto:[EMAIL PROTECTED] Sent: Tuesday, August 19, 2003 3:06 AM To: Lucene Users List Subject: Re: Similar Document Search Hi Terry, we have been thinking about the same problem and in the

Re: Similar Document Search

2003-08-19 Thread Peter Becker
le code if anyone is interested. /magnus Peter Becker wrote: Hi Terry, we have been thinking about the same problem and in the end we decided that most likely the only good solution to this is to keep a non-inverted index, i.e. a map from the documents to the terms. Then you can query the most terms f

Re: Similar Document Search

2003-08-18 Thread Peter Becker
Hi Terry, we have been thinking about the same problem and in the end we decided that most likely the only good solution to this is to keep a non-inverted index, i.e. a map from the documents to the terms. Then you can query the most terms for the documents and query other documents matching p

Re: parallel index building & searching multiple indexes

2003-08-14 Thread Peter Becker
Hi Tom, Killeen, Tom wrote: I am attempting to create approx 10 different Lucene indexes. I'm trying to create them at the same time by running multiple processes and each index is written to a new directory. Once I create more than one process - the performance is very, very slow. As Otis s

Re: parallel index building & searching multiple indexes

2003-08-14 Thread Peter Becker
Kevin A. Burton wrote: Killeen, Tom wrote: I am attempting to create approx 10 different Lucene indexes. I'm trying to create them at the same time by running multiple processes and each index is written to a new directory. Once I create more than one process - the performance is very, very s

Re: Luke v 0.2 - Lucene Index Browser

2003-08-12 Thread Peter Becker
Andrzej Bialecki wrote: [...Luke feature requests...] open the original Dokuments with the platform dependant mimetype viewer Someone else already explained the problems with this... What is a document in Lucene? It's a set of fields and their String values (or their terms), so it's not possi

Re: Free, medium size, downloadable corpus of newspaper articles?

2003-07-30 Thread Peter Becker
[redirected to lucene-user] Me, too! :-) We are currently playing with the small Reuters collection (about 21.500 news items from the 80s), but I don't know if I am allowed to distribute it and it is too small anyway -- many of the implications we find are based on 1 to 3 documents. I still ha

Term frequency on result sets

2003-07-29 Thread Peter Becker
Hi all, I am interested in comparing different query result sets in term of term frequency. Questions I'd like to answer are: - what are the N most common terms in a result set? - how often does term X occur in a certain result set? The second one is of course easy to do with a boolean query, bu

Re: Indexing very large sets (10 million docs)

2003-07-28 Thread Peter Becker
Roger Ford wrote: [...index size troubles...] Believe it or not, this 10 million documents was meant to be a single partition of a much larger dataset. I'm not sure I'm at liberty to discuss in detail the data I'm indexing - but it's a massive geneological database. Roger, maybe your data type is

Re: Spanish analyzer and Indexing StarOffice docs

2003-07-22 Thread Peter Becker
StarOffice (OpenOffice and StarOffice 6, please correct me if this isn't true), so I'd like to know if there is already any way to index files of version 5.2 and below, or any clue on how to do this, Thank you in advance, Oscar Herrera Bogotá, Colombia, SA. - Original Message - Fr

Re: Spanish analyzer and Indexing StarOffice docs

2003-07-21 Thread Peter Becker
Hi Oscar, we have been looking into the StarOffice/OpenOffice problem, although we haven't done it and probably won't anytime soon as we have to move on to other things. I see two approaches, both with variants: (1) use the fact that it is just zipped XML: use a ZipInputStream to open the file

Re: about PDF / HTML index

2003-07-15 Thread Peter Becker
Hi Alvaro, there are some examples in our code here -- working with a slightly similar interface to the Ant task in the Lucene contributions. http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/toscanaj/docco/source/org/tockit/docco/indexer/documenthandler/ The actual step of turning it into a Luc

Doing it all backwards

2003-07-15 Thread Peter Becker
Hi, is there any way to get the keywords for certain fields in document easily? The situation is that I have small sets of documents coming back from queries and I want to compare those in terms of similarity. The questions are: what are the common terms within each set and what are the terms