Re: [ANN] PDFBox 0.6.0
Ben- In attempting to use the PDFBox-0.6.0, I rec'd the following error when attempting to scan a reasonably sized PDF repository. Any thoughts? caught a class java.io.EOFException with message: Unexpected end of ZLIB input stream Eric Anderson LanRx Network Solutions Quoting Ben Litchfield [EMAIL PROTECTED]: I would like to announce the next release of PDFBox. PDFBox allows for PDF documents to be indexed using lucene through a simple interface. Please take a look at org.pdfbox.searchengine.lucene.LucenePDFDocument, which will extract all text and PDF document summary properties as lucene fields. You can obtain the latest release from http://www.pdfbox.org Please send all bug reports to me and attach the PDF document when possible. RELEASE 0.6.0 -Massive improvements to memory footprint. -Must call close() on the COSDocument(LucenePDFDocument does this for you) -Really fixed the bug where small documents were not being indexed. -Fixed bug where no whitespace existed between obj and start of object. Exception in thread main java.io.IOException: expected='obj' actual='obj/Pro -Fixed issue with spacing where textLineMatrix was not being copied properly -Fixed 'bug' where parsing would fail with some pdfs with double endobj definitions -Added PDF document summary fields to the lucene document Thank you, Ben Litchfield http://www.pdfbox.org - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] LanRx Network Solutions, Inc. Providing Enterprise Level Solutions...On A Small Business Budget - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Advanced Text Indexing with Lucene
Another fine article by Otis: http://www.onjava.com/pub/a/onjava/2003/03/05/lucene.html PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: my experiences - Re: Parsing Word Docs
David, The textmining.org stuff only works on Word97 and above. It should work with no exceptions on any Word 97 doc. If you have any problems then it is from an earlier version (most likely Word 6.0) or its not a word document. If this isn't the case you need to email me so I can fix it and make it better for the benefit of everyone. I plan on adding support for Word 6 in the future. Ryan Ackley - Original Message - From: David Spencer [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, March 05, 2003 6:24 PM Subject: my experiences - Re: Parsing Word Docs FYI I tried the textmining.org/poi combo and on a collection of 350 word docs people have developed here over the years, and it failed on 33% of them with exceptions being thrown about the formats being invalid. I tried antiword ( http://www.winfield.demon.nl/ ), a native free *.exe, and it worked great ( well it seemed to process all the files fine). I've had similar experiences with PDF - I tried the 3 or so freeware/java PDF text extractors and they were not as good as the exe, pdftotext, from foolabs (http://www.foolabs.com/xpdf/). Not satisfying to a java developer but these work better than anything else I can find. You get source and I use them on windows linux, no prob. Eric Anderson wrote: I'm interested in using the textmining/textextraction utilities using Apache POI, that Ryan was discussing. However, I'm having some difficulty determining what the insertion point would be to replace the default parser with the word parser. Any assistance would be appreciated. LanRx Network Solutions, Inc. Providing Enterprise Level Solutions...On A Small Business Budget - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: my experiences - Re: Parsing Word Docs
Eric, The problem with antiword is that it is a native application. You must write a class that uses JNI to access the native code. If you link your java code with native code you have lost one of the biggest benefits of Java, platform independence. I would suggest you use the library at http://textmining.org. contrary to what David Spencer says, it should work on all documents created with Word 97 or above. I have literally indexed 100,000s of unique documents using my library. Ryan Ackley - Original Message - From: Eric Anderson [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, March 05, 2003 7:14 PM Subject: Re: my experiences - Re: Parsing Word Docs Ok. Thanks for the tip. I downloaded and compiled Antiword, and would like to now add it to my indexing class. However, I'm not sure how the application would be called, and from where it would be called. How will I have the class parse the document through Antiword to create the keyword index, but leaving the DOC intact, as Mr. Litchfield did with PDFBox? Your assistance is greatly appreciated. Eric Anderson 815-505-6132 Quoting David Spencer [EMAIL PROTECTED]: FYI I tried the textmining.org/poi combo and on a collection of 350 word docs people have developed here over the years, and it failed on 33% of them with exceptions being thrown about the formats being invalid. I tried antiword ( http://www.winfield.demon.nl/ ), a native free *.exe, and it worked great ( well it seemed to process all the files fine). I've had similar experiences with PDF - I tried the 3 or so freeware/java PDF text extractors and they were not as good as the exe, pdftotext, from foolabs (http://www.foolabs.com/xpdf/). Not satisfying to a java developer but these work better than anything else I can find. You get source and I use them on windows linux, no prob. Eric Anderson wrote: I'm interested in using the textmining/textextraction utilities using Apache POI, that Ryan was discussing. However, I'm having some difficulty determining what the insertion point would be to replace the default parser with the word parser. Any assistance would be appreciated. LanRx Network Solutions, Inc. Providing Enterprise Level Solutions...On A Small Business Budget - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] LanRx Network Solutions, Inc. Providing Enterprise Level Solutions...On A Small Business Budget - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: my experiences - Re: Parsing Word Docs
I'll go either way, but I still don't know how to implement the word parser, as opposed to the PDF parser or HTM parser. Eric Anderson LanRx Network Solutions Quoting Ryan Ackley [EMAIL PROTECTED]: Eric, The problem with antiword is that it is a native application. You must write a class that uses JNI to access the native code. If you link your java code with native code you have lost one of the biggest benefits of Java, platform independence. I would suggest you use the library at http://textmining.org. contrary to what David Spencer says, it should work on all documents created with Word 97 or above. I have literally indexed 100,000s of unique documents using my library. Ryan Ackley - Original Message - From: Eric Anderson [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, March 05, 2003 7:14 PM Subject: Re: my experiences - Re: Parsing Word Docs Ok. Thanks for the tip. I downloaded and compiled Antiword, and would like to now add it to my indexing class. However, I'm not sure how the application would be called, and from where it would be called. How will I have the class parse the document through Antiword to create the keyword index, but leaving the DOC intact, as Mr. Litchfield did with PDFBox? Your assistance is greatly appreciated. Eric Anderson 815-505-6132 Quoting David Spencer [EMAIL PROTECTED]: FYI I tried the textmining.org/poi combo and on a collection of 350 word docs people have developed here over the years, and it failed on 33% of them with exceptions being thrown about the formats being invalid. I tried antiword ( http://www.winfield.demon.nl/ ), a native free *.exe, and it worked great ( well it seemed to process all the files fine). I've had similar experiences with PDF - I tried the 3 or so freeware/java PDF text extractors and they were not as good as the exe, pdftotext, from foolabs (http://www.foolabs.com/xpdf/). Not satisfying to a java developer but these work better than anything else I can find. You get source and I use them on windows linux, no prob. Eric Anderson wrote: I'm interested in using the textmining/textextraction utilities using Apache POI, that Ryan was discussing. However, I'm having some difficulty determining what the insertion point would be to replace the default parser with the word parser. Any assistance would be appreciated. LanRx Network Solutions, Inc. Providing Enterprise Level Solutions...On A Small Business Budget - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] LanRx Network Solutions, Inc. Providing Enterprise Level Solutions...On A Small Business Budget - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] LanRx Network Solutions, Inc. Providing Enterprise Level Solutions...On A Small Business Budget - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
AW: my experiences - Re: Parsing Word Docs
Ryan, I tried to use texmining to extract text from word97 Documents. Some german characters like ä, ü etc. aren`t parsed correctly, so a can`t use it cause many german words include this characters. I dont know if the reason is textmining or hdf from poi (hssf from poi parses this characters correctly). Do you have any hints for me ? Michael -Ursprüngliche Nachricht- Von: Ryan Ackley [mailto:[EMAIL PROTECTED] Gesendet: Donnerstag, 6. März 2003 13:13 An: Lucene Users List Betreff: Re: my experiences - Re: Parsing Word Docs David, The textmining.org stuff only works on Word97 and above. It should work with no exceptions on any Word 97 doc. If you have any problems then it is from an earlier version (most likely Word 6.0) or its not a word document. If this isn't the case you need to email me so I can fix it and make it better for the benefit of everyone. I plan on adding support for Word 6 in the future. Ryan Ackley - Original Message - From: David Spencer [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, March 05, 2003 6:24 PM Subject: my experiences - Re: Parsing Word Docs FYI I tried the textmining.org/poi combo and on a collection of 350 word docs people have developed here over the years, and it failed on 33% of them with exceptions being thrown about the formats being invalid. I tried antiword ( http://www.winfield.demon.nl/ ), a native free *.exe, and it worked great ( well it seemed to process all the files fine). I've had similar experiences with PDF - I tried the 3 or so freeware/java PDF text extractors and they were not as good as the exe, pdftotext, from foolabs (http://www.foolabs.com/xpdf/). Not satisfying to a java developer but these work better than anything else I can find. You get source and I use them on windows linux, no prob. Eric Anderson wrote: I'm interested in using the textmining/textextraction utilities using Apache POI, that Ryan was discussing. However, I'm having some difficulty determining what the insertion point would be to replace the default parser with the word parser. Any assistance would be appreciated. LanRx Network Solutions, Inc. Providing Enterprise Level Solutions...On A Small Business Budget - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [ANN] PDFBox 0.6.0
In this release I have changed how I parsed the document, which may have introduced this bug. I have received another report of this and will have it fixed for the next point release. You said you tried with reasonably sized PDF repository. Did you stop indexing at this error or did you continue? If you continued, is this the only error that you got? -Ben -- On Thu, 6 Mar 2003, Eric Anderson wrote: Ben- In attempting to use the PDFBox-0.6.0, I rec'd the following error when attempting to scan a reasonably sized PDF repository. Any thoughts? caught a class java.io.EOFException with message: Unexpected end of ZLIB input stream Eric Anderson LanRx Network Solutions Quoting Ben Litchfield [EMAIL PROTECTED]: I would like to announce the next release of PDFBox. PDFBox allows for PDF documents to be indexed using lucene through a simple interface. Please take a look at org.pdfbox.searchengine.lucene.LucenePDFDocument, which will extract all text and PDF document summary properties as lucene fields. You can obtain the latest release from http://www.pdfbox.org Please send all bug reports to me and attach the PDF document when possible. RELEASE 0.6.0 -Massive improvements to memory footprint. -Must call close() on the COSDocument(LucenePDFDocument does this for you) -Really fixed the bug where small documents were not being indexed. -Fixed bug where no whitespace existed between obj and start of object. Exception in thread main java.io.IOException: expected='obj' actual='obj/Pro -Fixed issue with spacing where textLineMatrix was not being copied properly -Fixed 'bug' where parsing would fail with some pdfs with double endobj definitions -Added PDF document summary fields to the lucene document Thank you, Ben Litchfield http://www.pdfbox.org - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] LanRx Network Solutions, Inc. Providing Enterprise Level Solutions...On A Small Business Budget - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
AW: my experiences - Re: Parsing Word Docs
thx a lot :) I'll try it -Ursprüngliche Nachricht- Von: Mario Ivankovits [mailto:[EMAIL PROTECTED] Gesendet: Donnerstag, 6. März 2003 14:00 An: Lucene Users List Betreff: Re: my experiences - Re: Parsing Word Docs The problems with german umlauts should be fixed. I have posted them a patch (see http://nagoya.apache.org/bugzilla/show_bug.cgi?id=14735), and it should be applied now. I havent cross-checked it for now. I currently use POI to index documents with lucene, but i do not use the standard way with an lucende-word-document class (like the pdfdocument). For sure, i have had some problems with getting the text from old documents, but in this case my system falls back to an simple STRINGS parser (filters any human-readable) char from the document-file. byebye Mario - Original Message - From: Borkenhagen, Michael (ofd-ko zdfin) [EMAIL PROTECTED] To: 'Lucene Users List' [EMAIL PROTECTED] Sent: Thursday, March 06, 2003 1:39 PM Subject: AW: my experiences - Re: Parsing Word Docs Ryan, I tried to use texmining to extract text from word97 Documents. Some german characters like ä, ü etc. aren`t parsed correctly, so a can`t use it cause many german words include this characters. I dont know if the reason is textmining or hdf from poi (hssf from poi parses this characters correctly). Do you have any hints for me ? Michael -Ursprüngliche Nachricht- Von: Ryan Ackley [mailto:[EMAIL PROTECTED] Gesendet: Donnerstag, 6. März 2003 13:13 An: Lucene Users List Betreff: Re: my experiences - Re: Parsing Word Docs David, The textmining.org stuff only works on Word97 and above. It should work with no exceptions on any Word 97 doc. If you have any problems then it is from an earlier version (most likely Word 6.0) or its not a word document. If this isn't the case you need to email me so I can fix it and make it better for the benefit of everyone. I plan on adding support for Word 6 in the future. Ryan Ackley - Original Message - From: David Spencer [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, March 05, 2003 6:24 PM Subject: my experiences - Re: Parsing Word Docs FYI I tried the textmining.org/poi combo and on a collection of 350 word docs people have developed here over the years, and it failed on 33% of them with exceptions being thrown about the formats being invalid. I tried antiword ( http://www.winfield.demon.nl/ ), a native free *.exe, and it worked great ( well it seemed to process all the files fine). I've had similar experiences with PDF - I tried the 3 or so freeware/java PDF text extractors and they were not as good as the exe, pdftotext, from foolabs (http://www.foolabs.com/xpdf/). Not satisfying to a java developer but these work better than anything else I can find. You get source and I use them on windows linux, no prob. Eric Anderson wrote: I'm interested in using the textmining/textextraction utilities using Apache POI, that Ryan was discussing. However, I'm having some difficulty determining what the insertion point would be to replace the default parser with the word parser. Any assistance would be appreciated. LanRx Network Solutions, Inc. Providing Enterprise Level Solutions...On A Small Business Budget - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: [ANN] PDFBox 0.6.0
Ben, I downloaded pdfbox and installed it. And I can use: java org.pdfbox.Main PDF-file output-text-file to convert .pdf file to string file. Then I tried to integrate with Lucene. I modified the following codes in IndexHTML.java: else if(file.getPath().endsWith(.pdf)) { Document doc = LucenePDFDocument.getDocument(file); System.out.println(adding + pdf files); writer.addDocument(doc); } It did pass ant compiler (ant wardemo). However, when I tested: java org.apache.lucene.demo.IndexHTML -create -index {index-dir} .. It seems to me it still didnot pick up new IndexHTML.java, still did not index .pdf files. Did I miss something here? Regards, George = Original Message From Lucene Users List [EMAIL PROTECTED] = I would like to announce the next release of PDFBox. PDFBox allows for PDF documents to be indexed using lucene through a simple interface. Please take a look at org.pdfbox.searchengine.lucene.LucenePDFDocument, which will extract all text and PDF document summary properties as lucene fields. You can obtain the latest release from http://www.pdfbox.org Please send all bug reports to me and attach the PDF document when possible. RELEASE 0.6.0 -Massive improvements to memory footprint. -Must call close() on the COSDocument(LucenePDFDocument does this for you) -Really fixed the bug where small documents were not being indexed. -Fixed bug where no whitespace existed between obj and start of object. Exception in thread main java.io.IOException: expected='obj' actual='obj/Pro -Fixed issue with spacing where textLineMatrix was not being copied properly -Fixed 'bug' where parsing would fail with some pdfs with double endobj definitions -Added PDF document summary fields to the lucene document Thank you, Ben Litchfield http://www.pdfbox.org - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Multi Language support
Hi Günter, I had a similar requirement for my use of Lucene. We have documents with mixed languages, some of the text in the user's native language and some in English. We made the decision to not use any of the stemming analyzers and index with no stop words (I didn't like the no stop words decision, but it wasn't really my call). My analyzer tokenStream method: public TokenStream tokenStream(String fieldName, Reader reader) { TokenStream result = new StandardTokenizer(reader); result = new StandardFilter(result); result = new LowerCaseFilter(result); return result; } Do you really need stemming in your application? Do you really need stop words? See this note http://archives.apache.org/eyebrowse/[EMAIL PROTECTED]msgId=653731 for a discussion about the advantages/disadvantages of stemming. If you still want stop words, you can create a list that includes words from more than one language, then use the same analyzer for all of your content. If you still need stemming, you will probably have to give your user the ability to tell you which language index they wish to search and you would probably be better off maintaining separate indices for each language at that point. Best of luck, Eric -Original Message- From: Günter Kukies [mailto:[EMAIL PROTECTED] Sent: Thursday, March 06, 2003 2:08 AM To: Lucene Users List Subject: Multi Language support Hello, that is what I know about indexing international documents: 1. I have a language ID 2. with this ID I choose an special Analzer for that language 3. I can use one index for all languages But what about searching for international documents? I don't have a language ID, because the user is interested in documents with his native language and a second language mostly english. So, what Analyzer do I use for searching? Thanks Günter - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
AW: [ANN] PDFBox 0.6.0
Ben, by using PDFBox-0.5.6 and alternative PDFBox-0.6.0 I'd receive the following StackTrace java.lang.ClassCastException: org.pdfbox.cos.COSObject at org.pdfbox.encoding.DictionaryEncoding.init(DictionaryEncoding.java :66) at org.pdfbox.cos.COSObject.getEncoding(COSObject.java:269) at org.pdfbox.cos.COSObject.encode(COSObject.java:210) at org.pdfbox.util.PDFTextStripper.showString(PDFTextStripper.java:959) at org.pdfbox.util.PDFTextStripper.handleOperation(PDFTextStripper.java: 788) at org.pdfbox.util.PDFTextStripper.process(PDFTextStripper.java:379) at org.pdfbox.util.PDFTextStripper.process(PDFTextStripper.java:366) at org.pdfbox.util.PDFTextStripper.processPageContents(PDFTextStripper.j ava:288) at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:231 ) at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:223 ) at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:148) ... (Stack from PDF 0.6.0) I also receive the from Eric reported Error - but only one time. My Indexer continues parsing the other pdf Documents after getting an error. Have you any idea regarding the ClassCastException ? Michael -Ursprüngliche Nachricht- Von: Ben Litchfield [mailto:[EMAIL PROTECTED] Gesendet: Donnerstag, 6. März 2003 14:45 An: Lucene Users List Betreff: Re: [ANN] PDFBox 0.6.0 In this release I have changed how I parsed the document, which may have introduced this bug. I have received another report of this and will have it fixed for the next point release. You said you tried with reasonably sized PDF repository. Did you stop indexing at this error or did you continue? If you continued, is this the only error that you got? -Ben -- On Thu, 6 Mar 2003, Eric Anderson wrote: Ben- In attempting to use the PDFBox-0.6.0, I rec'd the following error when attempting to scan a reasonably sized PDF repository. Any thoughts? caught a class java.io.EOFException with message: Unexpected end of ZLIB input stream Eric Anderson LanRx Network Solutions Quoting Ben Litchfield [EMAIL PROTECTED]: I would like to announce the next release of PDFBox. PDFBox allows for PDF documents to be indexed using lucene through a simple interface. Please take a look at org.pdfbox.searchengine.lucene.LucenePDFDocument, which will extract all text and PDF document summary properties as lucene fields. You can obtain the latest release from http://www.pdfbox.org Please send all bug reports to me and attach the PDF document when possible. RELEASE 0.6.0 -Massive improvements to memory footprint. -Must call close() on the COSDocument(LucenePDFDocument does this for you) -Really fixed the bug where small documents were not being indexed. -Fixed bug where no whitespace existed between obj and start of object. Exception in thread main java.io.IOException: expected='obj' actual='obj/Pro -Fixed issue with spacing where textLineMatrix was not being copied properly -Fixed 'bug' where parsing would fail with some pdfs with double endobj definitions -Added PDF document summary fields to the lucene document Thank you, Ben Litchfield http://www.pdfbox.org - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] LanRx Network Solutions, Inc. Providing Enterprise Level Solutions...On A Small Business Budget - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: my experiences - Re: Parsing Word Docs
Eric Anderson wrote: Ok. Thanks for the tip. I downloaded and compiled Antiword, and would like to now add it to my indexing class. However, I'm not sure how the application would be called, How? You exec passing the file name and it prints the ascii text to stdout. This method takes the file name (e.g. c:/dir1/dir2/foo.doc) and returns the output from antitext as one big string: public static String getAntiText( String fn) throws Throwable { Process p = null; InputStream is = null; DataInputStream dis = null; try { p = rt.exec( new String[] { anti, fn}); is = p.getInputStream(); dis = new DataInputStream( is); String line; StringBuffer sb = new StringBuffer(); while ( ( line = dis.readLine()) != null) { //o.println( READ: +line); sb.append( line); sb.append( ); } return sb.toString(); } finally { try { dis.close(); } catch( Throwable t) { } try { is.close(); } catch( Throwable t) { } try { p.waitFor(); } catch( Throwable t) { } try { p.destroy(); } catch( Throwable t) { } } } private static String anti = c:/antiword/antiword.exe; and from where it would be called. From where? If the file is a word doc e.g. name ends with .doc. How will I have the class parse the document through Antiword to create the keyword index, but leaving the DOC intact, as Mr. Litchfield did with PDFBox? Hmmm not sure what the exact issue is but is this the answer: doc.add( Field.Text( contents, new StringReader( getAntiText( file_name_of_word_file; Your assistance is greatly appreciated. Eric Anderson 815-505-6132 Quoting David Spencer [EMAIL PROTECTED]: FYI I tried the textmining.org/poi combo and on a collection of 350 word docs people have developed here over the years, and it failed on 33% of them with exceptions being thrown about the formats being invalid. I tried antiword ( http://www.winfield.demon.nl/ ), a native free *.exe, and it worked great ( well it seemed to process all the files fine). I've had similar experiences with PDF - I tried the 3 or so freeware/java PDF text extractors and they were not as good as the exe, pdftotext, from foolabs (http://www.foolabs.com/xpdf/). Not satisfying to a java developer but these work better than anything else I can find. You get source and I use them on windows linux, no prob. Eric Anderson wrote: I'm interested in using the textmining/textextraction utilities using Apache POI, that Ryan was discussing. However, I'm having some difficulty determining what the insertion point would be to replace the default parser with the word parser. Any assistance would be appreciated. LanRx Network Solutions, Inc. Providing Enterprise Level Solutions...On A Small Business Budget - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] LanRx Network Solutions, Inc. Providing Enterprise Level Solutions...On A Small Business Budget - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: my experiences - Re: Parsing Word Docs
Ryan Ackley wrote: Eric, The problem with antiword is that it is a native application. You must write a class that uses JNI to access the native code. No you don't. Just use Runtime.exec - no JNI :) If you link your java code with native code you have lost one of the biggest benefits of Java, platform Yeah but given that the source for antitext is avail and it runs on all platforms I use (windows/linux/sun) and works better than anything else (given that it seems to accept older formats than POI/textmining) it seems to get the job done better. independence. I would suggest you use the library at http://textmining.org. contrary to what David Spencer says, it should work on all documents created with Word 97 or above. I have literally indexed 100,000s of unique documents using my library. Ryan Ackley - Original Message - From: Eric Anderson [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, March 05, 2003 7:14 PM Subject: Re: my experiences - Re: Parsing Word Docs Ok. Thanks for the tip. I downloaded and compiled Antiword, and would like to now add it to my indexing class. However, I'm not sure how the application would be called, and from where it would be called. How will I have the class parse the document through Antiword to create the keyword index, but leaving the DOC intact, as Mr. Litchfield did with PDFBox? Your assistance is greatly appreciated. Eric Anderson 815-505-6132 Quoting David Spencer [EMAIL PROTECTED]: FYI I tried the textmining.org/poi combo and on a collection of 350 word docs people have developed here over the years, and it failed on 33% of them with exceptions being thrown about the formats being invalid. I tried antiword ( http://www.winfield.demon.nl/ ), a native free *.exe, and it worked great ( well it seemed to process all the files fine). I've had similar experiences with PDF - I tried the 3 or so freeware/java PDF text extractors and they were not as good as the exe, pdftotext, from foolabs (http://www.foolabs.com/xpdf/). Not satisfying to a java developer but these work better than anything else I can find. You get source and I use them on windows linux, no prob. Eric Anderson wrote: I'm interested in using the textmining/textextraction utilities using Apache POI, that Ryan was discussing. However, I'm having some difficulty determining what the insertion point would be to replace the default parser with the word parser. Any assistance would be appreciated. LanRx Network Solutions, Inc. Providing Enterprise Level Solutions...On A Small Business Budget - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] LanRx Network Solutions, Inc. Providing Enterprise Level Solutions...On A Small Business Budget - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: my experiences - Re: Parsing Word Docs
Ryan Ackley wrote: David, The textmining.org stuff only works on Word97 and above. It should work with Could be we had pre word97 docs as some date from 1996 when we (Lumos at least) were founded. no exceptions on any Word 97 doc. If you have any problems then it is from an earlier version (most likely Word 6.0) or its not a word document. If this isn't the case you need to email me so I can fix it and make it better for the benefit of everyone. I plan on adding support for Word 6 in the future. Ryan Ackley - Original Message - From: David Spencer [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, March 05, 2003 6:24 PM Subject: my experiences - Re: Parsing Word Docs FYI I tried the textmining.org/poi combo and on a collection of 350 word docs people have developed here over the years, and it failed on 33% of them with exceptions being thrown about the formats being invalid. I tried antiword ( http://www.winfield.demon.nl/ ), a native free *.exe, and it worked great ( well it seemed to process all the files fine). I've had similar experiences with PDF - I tried the 3 or so freeware/java PDF text extractors and they were not as good as the exe, pdftotext, from foolabs (http://www.foolabs.com/xpdf/). Not satisfying to a java developer but these work better than anything else I can find. You get source and I use them on windows linux, no prob. Eric Anderson wrote: I'm interested in using the textmining/textextraction utilities using Apache POI, that Ryan was discussing. However, I'm having some difficulty determining what the insertion point would be to replace the default parser with the word parser. Any assistance would be appreciated. LanRx Network Solutions, Inc. Providing Enterprise Level Solutions...On A Small Business Budget - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Regarding Setup Lucine for my site
1. 2 threads per request may improve speed up to 50% Hmm? Could you clarify? During indexing, multithreading may speed things up (splitting docs to index in 2 or more sets, indexing separately, combining indexing). But... isn't that a good thing? Or are you saying that it'd be good to have multi-threaded search functionality for single search? (in my experience searching is seldom the slow part) you may improve indexing and searching. Indexing, because the merge operation will lock just one thread and smaller part of an index while other threads are still working; searching, because you can distribute the query to more barrels. In both cases you save up to 50% of time (I assume mergefactor=2). 2. Merger is hard coded In a way that is bad because... ? (ie. what is the specific problem... I assume you mean index merging functionality?) Because you cannot process local and/or remote barrels -- all must be local in Lucene object model. That is the serious bug IMHO. 4. you cannot implement dissemination + wrappers for internet servers which would serve as static barrels. Could you explain this bit more thoroughly (or pointers on longer explanation)? Read more about dissemination, metasearch engines (i.e. Savvysearch), dDIRs (i.e. Harvest). BTW, let's go to a pub and we can talk til morning :) (it is a serious offer, because I would like to know more about IR). This example is about metasearch (the simplest case of dDIRs): Can Lucene allow that a barrel (index segment?) is static and a query is solved via wrapper, that sends the query ${QUERY} to http://www.google.com/search?hl=enie=UTF-8oe=UTF-8q=${QUERY} and then reads the HTML output as a result? 5. Document metadata cannot be stored as a programmer wants, he must translate the object to a set of fields Yes? I'd think that possibility of doing separate fields is a good thing; after all, all a plain text search engine needs to provide (to be considered one) is indexing of plain text data, right? I talked about metadata. When metadata object knows how to achieve its persistence, why would one translate anything to fields and then back? Why would you touch the users metadata at all? You need flat fields for indexing, and what's around -- it is not your problem :). Lucene is something between CMS and CIS, you say that it's closer to CIS, but when you need metadata in fields, you are closer to CMS IMHO. 6. Lucene cannot implement your own dynamization (sorry, I must sound real thick here). Could you elaborate on this... what do you mean by dynamization? Read more about Dynamization of Decomposable Searching Problems. -g- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Potential Lucene drawbacks
If I understand you correctly, then maybe you are not aware of RemoteSearchable in Lucene. That class cannot be used in Merger. RemoteSearchable is a class that allows you to pass a query to another node, nothing less and nothing more AFAIK. This is the point that's more clear to me now. There is confusion about what Lucene is and what it is not. Lucene does not even try to be what those services you mentioned are. Their goals are different, they are a different set of tools. Lucene's focus is on indexing text and searching it. It is not a tool to query other existing search I do not think so. It is all about the object model you use. If you are not able to solve the simplest case, how can you distribute the engine across the network? I do not mean the simple RMI gateways which marshall parameters and send them through a network pipe, I mean the true system that could beat google (and it is another topic...). Moreover, I think that Lucene can do much more than you think Otis :). Egothor can do that, so why not Lucene? -g- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Potential Lucene drawbacks
--- Leo Galambos [EMAIL PROTECTED] wrote: If I understand you correctly, then maybe you are not aware of RemoteSearchable in Lucene. That class cannot be used in Merger. RemoteSearchable is a class that allows you to pass a query to another node, nothing less and nothing more AFAIK. What is Merger? Verb, noun, an IR concept, a name of the product or project? Merging of results from multiple searchers from multiple indices? This is the point that's more clear to me now. There is confusion about what Lucene is and what it is not. Lucene does not even try to be what those services you mentioned are. Their goals are different, they are a different set of tools. Lucene's focus is on indexing text and searching it. It is not a tool to query other existing search I do not think so. It is all about the object model you use. If you are not able to solve the simplest case, how can you distribute the engine across the network? I do not mean the simple RMI gateways which marshall parameters and send them through a network pipe, I mean the true system that could beat google (and it is another topic...). That is the difference between a simple library and a targeted application. Moreover, I think that Lucene can do much more than you think Otis :). Egothor can do that, so why not Lucene? Yes, Lucene can do more than I think it can, why not. Maybe this is being done already...with Lucene... ;) Otis __ Do you Yahoo!? Yahoo! Tax Center - forms, calculators, tips, more http://taxes.yahoo.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]