Re: What is the best file system for Lucene?

2004-11-30 Thread Pete Lewis
Hi Sanyi

Could you try XP on your desktop - that would take some variables out.  The
problem is that you are comparing OS, as well as filesystems, as well as
different hardware configs.

Also, unless you take your hyperthreading off, with just one index you are
searching with just one half of the CPU - so your desktop is actually using
a 1.5GHz CPU for the search.  So, taking account of this its not too
surprising that they are searching at comparable speeds.

HTH
Pete

- Original Message - 
From: "Sanyi" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Tuesday, November 30, 2004 11:28 AM
Subject: Re: What is the best file system for Lucene?


> > Interesting, what are your merge settings
>
> Sorry, I didn't mention that I was talking about search performance.
> I'm using the same, fully optimized index on both systems.
> (I've generated both indexes with the same code from the same database on
the actual OS)
>
> > which JDK are you using?
>
> I'm using the same Sun JDK on both systems.
> I've tried so far:
> j2sdk1.4.2_04 _05 and _06.
> I didn't notice speed differences between these subversions.
> Do you know about significant speed differences between them I should
notice?
>
> > Have you tried with hyperthreading turned off on #2?
>
> No, but I will try it if the problem isn't in the file system.
> I hope that the reason of slowness is reiserfs, because it is the easiest
to change.
>
> What file systems are you people using Lucene on? And what are your
experiences?
>
> Regards,
> Sanyi
>
>
>
>
> __
> Do you Yahoo!?
> The all-new My Yahoo! - What will yours do?
> http://my.yahoo.com
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene scalability, performance

2004-11-15 Thread Pete Lewis
Hi Venkat

If you want to go against just html pages (maybe with Dublin core tags) then
Swich-E isn't too bad, but it wont be as portable as Lucene plus it doesn't
seem to be as nearly as active on the development side as Lucene (so you'll
get less support in the event of problems).  Swish seems easy to install &
index against just html pages.

If you want something a bit more capable, or if you are going against
structured text files / databases / etc then choose Lucene everytime.

Performance, you'll find that Java Lucene will compare well with C / Perl
Swish, and for your volumes there shouldn't be any problems with Lucene.

Cheers

Pete

- Original Message - 
From: "Venkatraju" <[EMAIL PROTECTED]>
To: "lucene-user" <[EMAIL PROTECTED]>
Sent: Monday, November 15, 2004 12:36 PM
Subject: Lucene scalability, performance


> Hi,
>
> I plan to use Lucene in a project. The data to be indexed could get
> pretty large (few GBs, with average document size ~ 10-30KB). I have
> seen the numbers on the Benchmarks page - but I wanted to hear if
> others had something to add. Any tips/hints on keeping the searches
> fast enough with such large indexes? Any caveats that I must be wary
> of?
>
> Also, has someone evaluated Swih-E vs. Lucene in terms of performance
> and scalability?
>
> Thanks in advance.
> Venkat
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Stemming Oddness

2004-11-06 Thread Pete Lewis
Hi Yousef

You are not doing anything wrong - its just how the Porter stemmer works!

The problem with Porter is that it tries to do everything in a purely algorithmic way 
- which doesn't cater for irregular conjugations etc.

Don't worry too much though, as long as you do the same stemming on the query string 
as you did while indexing - you should be able to find what you are looking for but 
can have some issues with trailing wildcards.

If you want a better stemmer, look for something that has a dictionary as well as 
algorithmic rules - a quick one that is readily available is Kstem which while not 
perfect I think is quite a bit better than Porter.

You can get the source code (Kstem.jar) from the floowing website:

http://ciir.cs.umass.edu/downloads/

For more info on Kstem see the paper by its designer Bob Krovetz at:

http://ciir.cs.umass.edu/pubfiles/ir-35.pdf

Cheers

Pete


- Original Message - 
From: "Yousef Ourabi" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Saturday, November 06, 2004 1:13 AM
Subject: Stemming Oddness


> Hey,
> Thanks for everyone's reply to my last post, I have
> some quesiton. I imported the PorterStemmer and when I
> did the following
> 
> PorterStemmer ps = new PorterStemmer();
> string r1 = ps.stem("elephant");
> r1 is 'eleph'
> 
> also buying stems to bui, is this normal? Am I doing
> something wrong.
> 
> I am calling reset inbetween function calls.
> 
> Thanks,
> Yousef
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 

Re: PorterStemmer / Levenshtein Distance

2004-11-05 Thread Pete Lewis
Hi Yousef

If you want to use it for something else then go direct for the Snowball
stemmers, for details go to the site:

http://snowball.tartarus.org/

Cheers

Pete

- Original Message - 
From: "Yousef Ourabi" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Friday, November 05, 2004 2:12 AM
Subject: PorterStemmer / Levenshtein Distance


> Hey,
> On the site It says Lucence Uses Levenshtein distance
> algorithm for fuzzy matching, where is this in the
> source code? Also I would like to use the porter
> stemming algorithm for somethign else, Are there any
> documents on the Lucence implementation of Porter
> Stemmer.
>
> Best,
> Yousef
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: PorterStemfilter

2004-09-14 Thread Pete Lewis
Hi David

I like KStem more than Porter / Snowball - but still has limitations
although performs better as it has a dictionary to augment the rules.

Note that KStem will also treat "print" and "printer" as two distinct terms,
probably treating it as verb and noun respectively.

Cheers

Pete Lewis

- Original Message - 
From: "David Spencer" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Tuesday, September 14, 2004 7:19 PM
Subject: Re: PorterStemfilter


> Honey George wrote:
>
> > Hi,
> >  This might be more of a questing related to the
> > PorterStemmer algorithm rather than with lucene, but
> > if anyone has the knowledge please share.
>
> You might want to also try the Snowball stemmer:
>
> http://jakarta.apache.org/lucene/docs/lucene-sandbox/snowball/
>
> And KStem:
>
> http://ciir.cs.umass.edu/downloads/
> >
> > I am using the PorterStemFilter that some with lucene
> > and it turns out that searching for the word 'printer'
> > does not return a document containing the text
> > 'print'. To narrow down the problem, I have tested the
> > PorterStemFilter in a standalone programs and it turns
> > out that the stem of printer is 'printer' and not
> > 'print'. That is 'printer' is not equal to 'print' +
> > 'er', the whole of the word is stem. Can somebody
> > explain the behavior.
> >
> > Thanks & Regards,
> >George
> >
> >
> >
> >
> >
> > ___ALL-NEW
Yahoo! Messenger - all new features - even more fun!
http://uk.messenger.yahoo.com
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: PorterStemfilter

2004-09-14 Thread Pete Lewis
Hi George

There are lots of problems with Port stemmers, not great for English but get
worse for other languages.

If you look at:

http://snowball.tartarus.org/demo.php

You'll see the Snowball demo - this is basically another instance of Porter.

If you enter "print" and "printer" and submit then the results will be
"print" and "printer" - hence showing the the Porter stemmed versions are
the same as the originals.  Therefore they are both distinct terms in their
own right and searches on one will not hit the other.

Cheers

Pete Lewis

- Original Message - 
From: "Honey George" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Tuesday, September 14, 2004 6:57 PM
Subject: PorterStemfilter


> Hi,
>  This might be more of a questing related to the
> PorterStemmer algorithm rather than with lucene, but
> if anyone has the knowledge please share.
>
> I am using the PorterStemFilter that some with lucene
> and it turns out that searching for the word 'printer'
> does not return a document containing the text
> 'print'. To narrow down the problem, I have tested the
> PorterStemFilter in a standalone programs and it turns
> out that the stem of printer is 'printer' and not
> 'print'. That is 'printer' is not equal to 'print' +
> 'er', the whole of the word is stem. Can somebody
> explain the behavior.
>
> Thanks & Regards,
>George
>
>
>
>
>
> ___ALL-NEW Yahoo!
Messenger - all new features - even more fun!  http://uk.messenger.yahoo.com
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents

2004-09-10 Thread Pete Lewis
Hi all

Reading the thread with interest, there is another way I've come across out
of memory errors when indexing large batches of documents.

If you have your heap space settings too high, then you get swapping (which
impacts performance) plus you never reach the trigger for garbage
collection, hence you don't garbage collect and hence you run out of memory.

Can you check whether or not your garbage collection is being triggered?

Anomalously therefore if this is the case, by reducing the heap space you
can improve performance get rid of the out of memory errors.

Cheers
Pete Lewis

- Original Message - 
From: "Daniel Taurat" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Friday, September 10, 2004 1:10 PM
Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large number of
documents


> Daniel Aber schrieb:
>
> >On Thursday 09 September 2004 19:47, Daniel Taurat wrote:
> >
> >
> >
> >>I am facing an out of memory problem using  Lucene 1.4.1.
> >>
> >>
> >
> >Could you try with a recent CVS version? There has been a fix about files
> >not being deleted after 1.4.1. Not sure if that could cause the problems
> >you're experiencing.
> >
> >Regards
> > Daniel
> >
> >
> >
> Well, it seems not to be files, it looks more like those SegmentTermEnum
> objects accumulating in memory.
> #I've seen some discussion on these objects in the developer-newsgroup
> that had taken place some time ago.
> I am afraid this is some kind of runaway caching I have to deal with.
> Maybe not  correctly addressed in this newsgroup, after all...
>
> Anyway: any idea if there is an API command to re-init caches?
>
> Thanks,
>
> Daniel
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Using MySpell iso the Snowball Analyzer

2004-09-09 Thread Pete Lewis
Hi Aad

Use the stemmed result as what you index, but then also remember to stem the
query terms as well - you need to do the same on the way out as on the way
in.

We don't use MySpell but we do use our own stemmer in this way, as there are
many examples where Snowball falls down like:

caught -> caught instead of catch
buses -> buse instead of bus

and Snowball gets worse for none-English languages like Dutch

Cheers
Pete

- Original Message - 
From: "Aad Nales" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Thursday, September 09, 2004 8:44 AM
Subject: Using MySpell iso the Snowball Analyzer


> For an eductational customer we have been requested to add spell
> checking to queries that enter lucene. The MySpell classes of
> Pietschmann seem to makes this more than feasible. What i wonder if
> somebody else has done this before? Any tips, questions or remarks?
>
> MySpell is the successor of ISpell and is used as the spellchecker in
> OpenOffice. It excutes a stemming algoritm in combination with a
> dictionary. My second question is if any has extracted the stemming
> result to be used in an index?
>
> Thanks for any or all feedback,
> cheers,
> Aad
>
>
> --
> Aad Nales
> [EMAIL PROTECTED], +31-(0)6 54 207 340
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Searching different types of words

2003-11-25 Thread Pete Lewis
Hi

I'd recommend Kstem over Porter, it performs much better on English let
alone when you get to other languages. You can get the source code for
Kstem.jar at teh following website:

http://ciir.cs.umass.edu/downloads/

Pete

- Original Message - 
From: "Otis Gospodnetic" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Tuesday, November 25, 2003 5:42 PM
Subject: Re: Searching different types of words


> Yes.
> For this particular example, PorterStemFilter will do the job.
> For more complex things (e.g. a search for car returning car, auto,
> automobile, vehicle) you'll need to add thesaurus-like capability to
> your indexer.  This can be done by writing a custom Analyzer.
>
> It sounds like you have a lot of questions, but have not read much
> Lucene documentation. :)
>
> Otis
>
>
> --- "Pleasant, Tracy" <[EMAIL PROTECTED]> wrote:
> > If I search for "like" I would want the search to return documents
> > containing "like", "liked", "likes", etc.. variations of the word.
> >
> > Is there a way to tell Lucene to do this?
>
>
>
> __
> Do you Yahoo!?
> Free Pop-Up Blocker - Get it now
> http://companion.yahoo.com/
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index entire filesystem

2003-11-05 Thread Pete Lewis
Hi Stefan

Wouldn't mind joining in a joint approach, only problem is timing - it would
probably be late December before we could start putting the hours in.

If anyone could come up with work packages, we wouldn't mind doing our share
of the work - otherwise I wouldn't mind leading an effort in the New Year.

Has anyone done a full survey of what's out there?  I'd like to be able to
cover the list that Stellent's OutsideIn filters cover (see attached) but
obviously starting from the most popular formats.

Cheers

Pete

- Original Message - 
From: "Stefan Groschupf" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Wednesday, November 05, 2003 11:01 AM
Subject: Re: Index entire filesystem


> There is some ongoing work for nutch.org.
> May be we can bundle all work together?! 
> Nutch has alraeady a java *.doc, *.pdf parser as well .
>
> Stefan
>
> Pete Lewis wrote:
>
> >Hi Stefan
> >
> >Using OpenOffice will enable you to parse 182 file formats, but its not a
> >pure java solution and you still need an alternate solution for pdfs.
> >
> >I'd be interested in knowing whether anyone is working on a pure java
> >solution that would give us a single method for handling ms office
> >documments / pdfs / etc.
> >
> >Cheers
> >
> >Pete
> >
> >- Original Message - 
> >From: "Stefan Groschupf" <[EMAIL PROTECTED]>
> >To: "Lucene Users List" <[EMAIL PROTECTED]>
> >Sent: Wednesday, November 05, 2003 10:26 AM
> >Subject: Re: Index entire filesystem
> >
> >
> >
> >
> >>I had write to this list some days ago, to announce a possibility to
> >>parse 182 file formats.
> >>There was a tiny bug report some days ago, i hope i can fix it.
> >>
> >>Browse the archive to figure out more.
> >>
> >>Cheers
> >>Stefan
> >>
> >>Marcel Stor wrote:
> >>
> >>
> >>
> >>>Hi all,
> >>>
> >>>I'm thinkin' about writing a search tool for my filesystem. I know such
> >>>things exist already but programming it myself is much more fun ;-)
> >>>So, I would have Lucene crawl through my filesystem and pass each file
> >>>to an appropriate indexer (PDF -> PDFbox, etc.). Yes, I run a Windows
> >>>system and would depend on the file ending to distinguish the file
type.
> >>>Is this a good idea in general? Is there a list of available indexer
for
> >>>the the different file types? Any other comments are also welcome.
> >>>
> >>>Regards,
> >>>Marcel
> >>>
> >>>
> >>>-
> >>>To unsubscribe, e-mail: [EMAIL PROTECTED]
> >>>For additional commands, e-mail: [EMAIL PROTECTED]
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>
> >>
> >>-
> >>To unsubscribe, e-mail: [EMAIL PROTECTED]
> >>For additional commands, e-mail: [EMAIL PROTECTED]
> >>
> >>
> >>
> >
> >
> >-
> >To unsubscribe, e-mail: [EMAIL PROTECTED]
> >For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> >
> >
>
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
1000 Word for DOS 4.x
1001 Word for DOS 5.x
1002 Wordstar 5.0
1003 Wordstar 4.0
1004 Wordstar 2000
1005 WordPerfect 5.0
1006 MultiMate 3.6
1007 MultiMate Advantage 2
1008 IBM DCA/RFT
1009 IBM DisplayWrite 2 or 3
1010 SmartWare II
1011 Samna
1012 PFS: Write A
1013 PFS: Write B
1014 Professional Write 1
1015 Professional Write 2
1016 IBM Writing Assistant
1017 First Choice WP
1018 WordMarc
1019 Navy DIF
1020 Volkswriter
1021 DEC DX 3.0 and below
1022 Sprint
1023 WordPerfect 4.2
1024 Total Word
1025 Wang IWP
1026 Wordstar 5.5
1028 Rich Text Format
1029 Mac Word 3.0
1030 Mac Word 4.0
1031 Mass 11
1032 MacWrite II
1033 XyWrite / Nota Bene
1034 IBM DCA/FFT
1035 Mac WordPerfect 1.x
1036 IBM DisplayWrite 4
1037 Mass 11
1038 WordPerfect 5.1/5.2
1039 MultiMate 4.0
1040 Q&A Write
1041 MultiMate Note
1043 PC File 5.0 Doc
1044 Lotus Manuscript 1.0
1045 Lotus Manuscript 2.0
1046 Enable WP 3.0
1047 Windows Write
1048 Microsoft Works 1.0
1049 Microsoft Works 2.0
1050 Wordstar 6.0
1051 OfficeWriter
105

Re: Index entire filesystem

2003-11-05 Thread Pete Lewis
Hi Stefan

Using OpenOffice will enable you to parse 182 file formats, but its not a
pure java solution and you still need an alternate solution for pdfs.

I'd be interested in knowing whether anyone is working on a pure java
solution that would give us a single method for handling ms office
documments / pdfs / etc.

Cheers

Pete

- Original Message - 
From: "Stefan Groschupf" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Wednesday, November 05, 2003 10:26 AM
Subject: Re: Index entire filesystem


>
> I had write to this list some days ago, to announce a possibility to
> parse 182 file formats.
> There was a tiny bug report some days ago, i hope i can fix it.
>
> Browse the archive to figure out more.
>
> Cheers
> Stefan
>
> Marcel Stor wrote:
>
> >Hi all,
> >
> >I'm thinkin' about writing a search tool for my filesystem. I know such
> >things exist already but programming it myself is much more fun ;-)
> >So, I would have Lucene crawl through my filesystem and pass each file
> >to an appropriate indexer (PDF -> PDFbox, etc.). Yes, I run a Windows
> >system and would depend on the file ending to distinguish the file type.
> >Is this a good idea in general? Is there a list of available indexer for
> >the the different file types? Any other comments are also welcome.
> >
> >Regards,
> >Marcel
> >
> >
> >-
> >To unsubscribe, e-mail: [EMAIL PROTECTED]
> >For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> >
> >
>
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene demo ideas?

2003-09-17 Thread Pete Lewis
Might want two demos, one for Unix environments and one for Windows.

Most users will want a fast start that they can copy and adapt.  So quick
targets would be:

filesystems - html / text / pdf / office documents for windows.
xml - fairly simple example maybe against news items.
database - again simple maybe a pseudo employee database.
website - accessable from the filesystem.
website - that requires crawling.

Show hit markup.

Pete

- Original Message - 
From: "Erik Hatcher" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Wednesday, September 17, 2003 1:00 PM
Subject: Lucene demo ideas?


> I'm about to start some refactorings on the web application demo that
> ships with Lucene to show off its features and be usable more easily
> and cleanly out of the box - i.e. just drop into Tomcat's webapps
> directory and go.
>
> Does anyone have any suggestions on what they'd like to see in the demo
> app?  Some of my ideas are:
>
> - Eliminate the need to do a command-line indexing, let the web app do
> this upon command, allowing you to specify where the index lives (there
> will be a reasonable default like ~/lucenedemo/index perhaps) and what
> directory tree to index (perhaps defaulting to the root directory or
> c:\, or where instead?)
>
> - Spin off a background indexing thread so the web app searching is
> immediately useful after kicking off the indexing process, and allow a
> status view of the indexing progress.
>
> - Index text and HTML files.  Any others?  I don't want to get into
> putting too many dependencies in though - let's keep it relatively
> simple, although still demonstrative.  Allow search filtering by last
> modified date range and document type (extension).
>
> - Perhaps allow you to specify the analyzer to use when indexing.
>
> - Show the explanation of how scores are computed in the search results
> as an option.
>
> I'm all ears to possibilities of improvements!  Send your wishlist.
>
> Erik
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Reference for Lucene as a search tool built into a CD

2003-09-09 Thread Pete Lewis
Does anyone know of Lucene being packaged onto a CD to provide a search facility for 
the data on that CD?  If so, would it be possible to refence?

Thanks

Pete



Multi-lingual synonym and homonym lists

2003-06-08 Thread Pete Lewis
Hi all

Does anyone know of any sysnonym and homonym lists for the different European 
languages?

Sorry for the cross-posting but I'd like to use them for query expanssion in different 
languages.

Pete

Re: RE : Parsers

2003-05-30 Thread Pete Lewis
Hi guys

Thanks, Jawin looks really nice :) 

Pete
- Original Message - 
From: "Andrzej Bialecki" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Thursday, May 29, 2003 9:45 AM
Subject: Re: RE : Parsers


> Victor Hadianto wrote:
> >>I'm using successfully a combination of Office automation via Jawin
> >>(free Java/COM bridge) to convert PPT files. You need to learn a bit
> >>about the pseudo-object model of PowerPoint to properly convert various
> >>objects, but this information can be found at msdn.microsoft.com.
> > 
> > 
> > Hmm this is really a nice idea, I've never heard of Jawin until now. 
> > 
> >
> 
> I highly recommend it - it works pretty well, it's stable, mature, and 
> most of all free :-) Sure, it has a well-known range of problems, e.g. 
> with calls to functions that require structs, but as it happens most of 
> the automation interfaces don't use them. I've been using it for 
> Java-Windows integration on various occasions, solving such "taboo" 
> problems like reading/creating Windows shortcuts, file conversion, 
> reading Outlook mail etc.
> 
> It works also with DLL's, although this is a bit more involved... It 
> uses an extensible marshaller/de-marshaller, so if you know COM pretty 
> well you can extend it to handle any conceivable parameter types.
> 
> -- 
> Best regards,
> Andrzej Bialecki
> 
> -
> Software Architect, System Integration Specialist
> CEN/ISSS EC Workshop, ECIMF project chair
> EU FP6 E-Commerce Expert/Evaluator
> -
> FreeBSD developer (http://www.freebsd.org)
> 
> 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: RE : Parsers

2003-05-29 Thread Pete Lewis
Hi Victor

Thanks.

In the past I have used the Inso OutsideIn filters and found them very good;
however I'd like to come up with a pure Java solution, so if there is a Java
equivalent to the Inso filters I be grateful for any details.  Failing that,
I thought that I'd go for individual parsers initially using the file
extensions to select the correct parser but in the future adding a file type
recogniser for files without extensions.  Hence my request for anyone
knowing of good parsers particularly for the most common formats.

That being said, has anyone come across a Powerpoint parser?

Pete

- Original Message -
From: "Victor Hadianto" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Thursday, May 29, 2003 12:01 AM
Subject: Re: RE : Parsers


> > The www.textmining.org text extractors work very well for Word and pdf
> > documents.
> > They use both PDFBox and POI.
> >
> > For Excel, using POI directly is very easy. Tell me if you want to see
> > code samples.
> >
> > I'm looking myself for a Powerpoint text extractor, if you know one...
>
> Another solution is to use Microsoft Office itself. You can setup a server
> that serve request to convert Microsoft Office doc. There are many ways of
> doing this, for example using Python to directly call Office then put your
> python script in a webserver.
>
> Or you can set a .Net conversion server and you can call this .Net service
> using a Web Service, and many other interesting technique.
>
> victor
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Parsers

2003-05-28 Thread Pete Lewis
Hi Adriano

Thanks.  Code samples would be nice :)

Will come back if I find something for .ppt.

Pete

- Original Message -
From: "Adriano Labate" <[EMAIL PROTECTED]>
To: "'Lucene Users List'" <[EMAIL PROTECTED]>
Sent: Wednesday, May 28, 2003 1:03 PM
Subject: RE : Parsers


The www.textmining.org text extractors work very well for Word and pdf
documents.
They use both PDFBox and POI.

For Excel, using POI directly is very easy. Tell me if you want to see
code samples.

I'm looking myself for a Powerpoint text extractor, if you know one...

Adriano Labate


-Message d'origine-
De : Pete Lewis [mailto:[EMAIL PROTECTED]
Envoyé : mercredi, 28 mai 2003 12:48
À : Lucene Users List
Objet : Parsers


Hi all,

I have a rather nice html parser that I got from SourceForge.  Does
anyone know of any good parsers for pdf and Microsoft Office Suite
(.doc, .ppt, .xls, etc), any help would be much appreciated.

Pete Lewis




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Parsers

2003-05-28 Thread Pete Lewis
Hi all,

I have a rather nice html parser that I got from SourceForge.  Does anyone know of any 
good parsers for pdf and Microsoft Office Suite (.doc, .ppt, .xls, etc), any help 
would be much appreciated.

Pete Lewis