Re: Text extraction tool for Microsoft Office 2007

2009-02-21 Thread Otis Gospodnetic
Hi, POI - http://poi.apache.org/ or Tika (it uses POI) - http://lucene.apache.org/tika And you can use code from Lucene in Action to index the text with Lucene - http://manning.com/hatcher2 . The code is free to download. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch -

Text extraction tool for Microsoft Office 2007

2009-02-21 Thread Zhang, Lisheng
Hi, What is the best tool (free software) to extract text from Microsoft Office 2007: Word 2007, Excel 2007, Power Point 2007 so that we can index them by lucene? Thanks very much for helps, Lisheng - To unsubscribe, e-mail:

RAID Stripe sizes - suggestions?

2009-02-21 Thread Paul Smith
I'm just wondering if anyone can share with us their learnings on optimizing their storage configurations for relatively large indexes (millions of documents, 10+Gb in size). Is there a 'suggested best' Stripe size for RAID-10 configurations? I did some Googling, and surprised I couldn't f

How can i analyse a Lucene Query ?

2009-02-21 Thread Kanterwoopy
I have to know the words which were important that a post was found by the searcher. What classes can i use for this ? What does QueryScorer do ? Can somebody give me a short overview over my possibilities ? -- View this message in context: http://www.nabble.com/How-can-i-analyse-a-Lucene-Quer

RE: 2.3.2 -> 2.4.0 StandardTokenizer issue

2009-02-21 Thread Philip Puffinburger
Thanks for the suggestion. We're going to go over all of this information/suggestions next week to see what we want to do. -Original Message- From: Robert Muir [mailto:rcm...@gmail.com] Sent: Saturday, February 21, 2009 11:52 AM To: java-user@lucene.apache.org Subject: Re: 2.3.2 -> 2.4

Re: 2.3.2 -> 2.4.0 StandardTokenizer issue

2009-02-21 Thread Robert Muir
that was just a suggestion as a quick hack... it still won't really fix the problem because some character + accent combinations don't have composed forms. even if you added entire combining diacritical marks block to the jflex grammar, its still wrong... what needs to be supported is \p{Word_Bre

RE: 2.3.2 -> 2.4.0 StandardTokenizer issue

2009-02-21 Thread Philip Puffinburger
That's something we can try. I don't know how much it performance we'd lose doing that as our custom filter has to decompose the tokens to do its operations. So instead of 0..1 conversions we'd be doing 1..2 conversions during indexing and searching. -Original Message- From: Robert

Re: 2.3.2 -> 2.4.0 StandardTokenizer issue

2009-02-21 Thread Robert Muir
normalize your text to NFC. then it will be \u0043 \u00F3 \u006D \u006F and will work... On Fri, Feb 20, 2009 at 11:16 PM, Philip Puffinburger < ppuffinbur...@tlcdelivers.com> wrote: > >some changes were made to the StandardTokenizer.jflex grammer (you can svn > diff the two URLs fairly trivially

Re: Indexer.Java problem

2009-02-21 Thread Erik Hatcher
Also, the first several hits here provide the tricks to update the code to the latest API: :) Erik On Feb 19, 2009, at 10:41 AM, Seid Mohammed wrote: I am using netbeans on windows to test lucene. I