Re: Parsing .ppt

2004-11-15 Thread Magnus Johansson
There's some code using POI at http://www.mail-archive.com/poi-user@jakarta.apache.org/msg04809.html /magnus Luke Shannon wrote: Hey All; Anyone know a good API for parsing MS powerpoint files? Luke - To unsubscribe, e-mail: [E

Re: Power Point Processing

2004-09-24 Thread Magnus Johansson
I've had some success with the code found at http://www.mail-archive.com/[EMAIL PROTECTED]/msg04809.html together with POI. Then there's OpenOffice, but I don't really think it is usable in a production envrionment /Magnus Johansson > Hi, > > Does anyone know a go

Re: Analyzing and Querying

2004-08-06 Thread Magnus Johansson
us Daniel Naber wrote: On Friday 06 August 2004 13:28, Magnus Johansson wrote: Splitting compound words can be done quite effectively simply by using a large wordlist. I have done this for swedish. It is, however, difficult to get right for German. On the one hand there are compounds in G

Re: Analyzing and Querying

2004-08-06 Thread Magnus Johansson
You could create a custom analyzer that splits compound words into its parts. That is applying the analyzer to the word "bergbahn" would yield the terms "berg" and "bahn" Splitting compound words can be done quite effectively simply by using a large wordlist. I have done this for swedish. /magnus T

Re: Bridge with OpenOffice

2004-04-18 Thread Magnus Johansson
Yes I have tried it and it seems to work ok. I haven't really used it in a production environment however. There was some code here http://www.gzlinux.org/docs/category/dev/java/doc2txt.pdf it is however not there anymore, Google HTML version is however avaialble at http://66.102.9.104/search?q

Re: Google search algorithm

2004-01-29 Thread Magnus Johansson
be at a particular page after an infinite time using random browsing according to the probabilies found. This probability is then used as a basis for ranking results. Magnus Johansson > We all know Lucene algorithm (thanks to open source:). > Anybody has a general idea of how Google

Re: understanding IR topics on this list [was: Re: Vector Space Model in Lucene?]

2003-11-16 Thread Magnus Johansson
I would also like to recommend "Modern Information Retrieval" by Ricardo Baeza-Yates /magnus Gerret Apelt writes: Dror -- I just completed an introductory course in IR. I can recommend the textbook we used: "Managing Gigabytes: Compressing and Indexing Documents and Images". When I don't

Re: Similar Document Search

2003-08-19 Thread Magnus Johansson
unless you can keep the documents in memory somehow. Storing the other/non-inverted/normal/whatever index would be expensive for indexing, but querying should be a lot faster than having to re-index documents. That is in our situation preferable. Peter Magnus Johansson wrote: Hi Peter If t

Re: Similar Document Search

2003-08-19 Thread Magnus Johansson
Ok, here it is. It's part of a JSP that prints out all keywords in a document. /magnus <%@ page import="org.apache.lucene.index.IndexReader, org.apache.lucene.document.Document, com.technohuman.search.language.SwedishAnalyzer, java.io.StringReader,

Re: Similar Document Search

2003-08-19 Thread Magnus Johansson
Hi Peter If the original document is available. You could extract keywords from the document at query time. That is when someone asks for documents similar to document a. You re-analyze document a and in combination with statistics from the Lucene index you extract keywords from document a that

Re: QueryParser and compound words

2003-03-12 Thread Magnus Johansson
Tatu Saloranta wrote: On Wednesday 12 March 2003 01:19, Magnus Johansson wrote: Well, the problem arise when a user enters a query with a compound word and the compound word itself is not indexed, only one of its parts. Yes, but neither is compound word itself ever user in query either

Re: QueryParser and compound words

2003-03-12 Thread Magnus Johansson
e with you that this might not be a problem. The user could be instructed to reformulate his query. However the behaviour for an english index and a swedish index would be different. /magnus Tatu Saloranta wrote: On Tuesday 11 March 2003 03:05, Magnus Johansson wrote: Hello I have written

QueryParser and compound words

2003-03-11 Thread Magnus Johansson
Hello I have written an Analyzer for swedish. Compound words are common in swedish, therefore my Analyzer tries to split the compound words into its parts. For example the swedish word fotbollsmatch (football game) is split into fotboll and match. However when I use my Analyzer with the QueryPar