There's some code using POI at
http://www.mail-archive.com/poi-user@jakarta.apache.org/msg04809.html
/magnus
Luke Shannon wrote:
Hey All;
Anyone know a good API for parsing MS powerpoint files?
Luke
-
To unsubscribe, e-mail: [E
I've had some success with the code found at
http://www.mail-archive.com/[EMAIL PROTECTED]/msg04809.html
together with POI.
Then there's OpenOffice, but I don't really think it is usable
in a production envrionment
/Magnus Johansson
> Hi,
>
> Does anyone know a go
us
Daniel Naber wrote:
On Friday 06 August 2004 13:28, Magnus Johansson wrote:
Splitting compound words can be done quite effectively simply by using
a large wordlist. I have done this for swedish.
It is, however, difficult to get right for German. On the one hand there are
compounds in G
You could create a custom analyzer that splits compound words into its
parts. That is applying the analyzer to the word "bergbahn" would yield
the terms "berg" and "bahn"
Splitting compound words can be done quite effectively simply by using
a large wordlist. I have done this for swedish.
/magnus
T
Yes I have tried it and it seems to work ok.
I haven't really used it in a production environment
however.
There was some code here
http://www.gzlinux.org/docs/category/dev/java/doc2txt.pdf
it is however not there anymore, Google HTML version is however
avaialble at
http://66.102.9.104/search?q
be at a particular page
after an infinite time using random browsing according
to the probabilies found.
This probability is then used as a basis for ranking
results.
Magnus Johansson
> We all know Lucene algorithm (thanks to open source:).
> Anybody has a general idea of how Google
I would also like to recommend "Modern Information Retrieval"
by Ricardo Baeza-Yates
/magnus
Gerret Apelt writes:
Dror --
I just completed an introductory course in IR. I can recommend the
textbook we used: "Managing Gigabytes: Compressing and Indexing Documents
and Images". When I don't
unless you can keep
the documents in memory somehow.
Storing the other/non-inverted/normal/whatever index would be
expensive for indexing, but querying should be a lot faster than
having to re-index documents. That is in our situation preferable.
Peter
Magnus Johansson wrote:
Hi Peter
If t
Ok, here it is. It's part of a JSP that prints out all keywords in a
document.
/magnus
<%@ page import="org.apache.lucene.index.IndexReader,
org.apache.lucene.document.Document,
com.technohuman.search.language.SwedishAnalyzer,
java.io.StringReader,
Hi Peter
If the original document is available. You could extract keywords from
the document
at query time. That is when someone asks for documents similar to
document a. You
re-analyze document a and in combination with statistics from the Lucene
index you extract
keywords from document a that
Tatu Saloranta wrote:
On Wednesday 12 March 2003 01:19, Magnus Johansson wrote:
Well, the problem arise when a user enters a query with a compound word
and the compound word itself is not indexed, only one of its parts.
Yes, but neither is compound word itself ever user in query either
e with you that this might not be a problem. The user could be
instructed
to reformulate his query. However the behaviour for an english index and
a swedish
index would be different.
/magnus
Tatu Saloranta wrote:
On Tuesday 11 March 2003 03:05, Magnus Johansson wrote:
Hello
I have written
Hello
I have written an Analyzer for swedish. Compound words are common in
swedish, therefore my Analyzer tries to split the compound words
into its parts. For example the swedish word fotbollsmatch (football
game) is split into fotboll and match.
However when I use my Analyzer with the QueryPar
13 matches
Mail list logo