AW: Retriving Results - getting blank entries?
Maybe you get something like . Try to trim() the Strings. -Ursprüngliche Nachricht- Von: Rishabh Bajpai [mailto:[EMAIL PROTECTED] Gesendet: Montag, 16. Juni 2003 08:25 An: Lucene Users List Betreff: Retriving Results - getting blank entries? Hi All, I am retrieving results in the normal manner.. construct a query, get the hits object and iterate through it... doc = hits.doc(i); if at all any of the field name or value is null or blank, then dont display that result... if ( field.name()==null) || (field.stringValue()==null) || (field.name().equals()) || (field.stringValue().equals()) ) { addtoResultSet = false; } But in some rare cases, I am still getting blank records displayed? Is it some problem that happend while indexing, or a bug in Lucene, or just that I am totally missing out on something? Please help... -Rishabh Get advanced SPAM filtering on Webmail or POP Mail ... Get Lycos Mail! http://login.mail.lycos.com/r/referral?aid=27005 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Bug in QueryParser ?
I´ve got the following Exeption during my tests with a query like word1 || word2 || word3 if one of the words, e.g. word2 is in the stopword - list of my Analyzer : java.lang.ArrayIndexOutOfBoundsException: -1 0 at java.util.Vector.elementAt(Vector.java:427) at org.apache.lucene.queryParser.QueryParser.addClause(QueryParser.java:171) at org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:463) at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:113) I´m using Lucene 1.3 rc1. Is this a Bug ? Michael - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
AW: About Query...
Your Syntax seems to be wrong; try Author:Williams AND Title:Sword - Title:House or Author:Williams AND Title:Sword NOT Title:House Michael -Ursprüngliche Nachricht- Von: Pierre Lacchini [mailto:[EMAIL PROTECTED] Gesendet: Montag, 17. März 2003 10:47 An: Lucene (E-mail) Betreff: About Query... Well guys, here's my (silly) question : I got 2 Fields in my Index, for example Title and Author... If i want to perform a complex query like : search Williams in fields Author AND Sword in fields Title WITHOUT House in the fields Title I tried this synthax : Author:Williams AND Title:Sword -House But it doesnt'seem to work... Is it possible ? Or mb i'm wrong with the synthax ??? Thx for help ;) Pierre Lacchini Consultant développement PeopleWare 12, rue du Cimetière L-8413 Steinfort Phone : + 352 399 968 35 http://www.peopleware.lu - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
AW: About Query...
mmmh ... good question - i really don´t know :( -Ursprüngliche Nachricht- Von: Pierre Lacchini [mailto:[EMAIL PROTECTED] Gesendet: Montag, 17. März 2003 12:32 An: 'Lucene Users List' Betreff: RE: About Query... sorry for my poor english... Well if i perform a Multiple Fields query... Why do I have to specify the name of the field in the parse method ? Because i'm using 2 field in the query... -Original Message- From: Borkenhagen, Michael (ofd-ko zdfin) [mailto:[EMAIL PROTECTED] Sent: lundi 17 mars 2003 12:07 To: 'Lucene Users List' Subject: AW: About Query... Yes for sure, Maybe I don´t understand your question ? -Ursprüngliche Nachricht- Von: Pierre Lacchini [mailto:[EMAIL PROTECTED] Gesendet: Montag, 17. März 2003 12:26 An: 'Lucene Users List' Betreff: RE: About Query... Yeah thx Michael, now it works fine :) But in this case, does the second argument of method parse(String query, String field, Analyser analyser) of the QueryParser matter ? -Original Message- From: Borkenhagen, Michael (ofd-ko zdfin) [mailto:[EMAIL PROTECTED] Sent: lundi 17 mars 2003 12:01 To: 'Lucene Users List' Subject: AW: About Query... Your Syntax seems to be wrong; try Author:Williams AND Title:Sword - Title:House or Author:Williams AND Title:Sword NOT Title:House Michael -Ursprüngliche Nachricht- Von: Pierre Lacchini [mailto:[EMAIL PROTECTED] Gesendet: Montag, 17. März 2003 10:47 An: Lucene (E-mail) Betreff: About Query... Well guys, here's my (silly) question : I got 2 Fields in my Index, for example Title and Author... If i want to perform a complex query like : search Williams in fields Author AND Sword in fields Title WITHOUT House in the fields Title I tried this synthax : Author:Williams AND Title:Sword -House But it doesnt'seem to work... Is it possible ? Or mb i'm wrong with the synthax ??? Thx for help ;) Pierre Lacchini Consultant développement PeopleWare 12, rue du Cimetière L-8413 Steinfort Phone : + 352 399 968 35 http://www.peopleware.lu - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
AW: my experiences - Re: Parsing Word Docs
Ryan, I tried to use texmining to extract text from word97 Documents. Some german characters like ä, ü etc. aren`t parsed correctly, so a can`t use it cause many german words include this characters. I dont know if the reason is textmining or hdf from poi (hssf from poi parses this characters correctly). Do you have any hints for me ? Michael -Ursprüngliche Nachricht- Von: Ryan Ackley [mailto:[EMAIL PROTECTED] Gesendet: Donnerstag, 6. März 2003 13:13 An: Lucene Users List Betreff: Re: my experiences - Re: Parsing Word Docs David, The textmining.org stuff only works on Word97 and above. It should work with no exceptions on any Word 97 doc. If you have any problems then it is from an earlier version (most likely Word 6.0) or its not a word document. If this isn't the case you need to email me so I can fix it and make it better for the benefit of everyone. I plan on adding support for Word 6 in the future. Ryan Ackley - Original Message - From: David Spencer [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, March 05, 2003 6:24 PM Subject: my experiences - Re: Parsing Word Docs FYI I tried the textmining.org/poi combo and on a collection of 350 word docs people have developed here over the years, and it failed on 33% of them with exceptions being thrown about the formats being invalid. I tried antiword ( http://www.winfield.demon.nl/ ), a native free *.exe, and it worked great ( well it seemed to process all the files fine). I've had similar experiences with PDF - I tried the 3 or so freeware/java PDF text extractors and they were not as good as the exe, pdftotext, from foolabs (http://www.foolabs.com/xpdf/). Not satisfying to a java developer but these work better than anything else I can find. You get source and I use them on windows linux, no prob. Eric Anderson wrote: I'm interested in using the textmining/textextraction utilities using Apache POI, that Ryan was discussing. However, I'm having some difficulty determining what the insertion point would be to replace the default parser with the word parser. Any assistance would be appreciated. LanRx Network Solutions, Inc. Providing Enterprise Level Solutions...On A Small Business Budget - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
AW: my experiences - Re: Parsing Word Docs
thx a lot :) I'll try it -Ursprüngliche Nachricht- Von: Mario Ivankovits [mailto:[EMAIL PROTECTED] Gesendet: Donnerstag, 6. März 2003 14:00 An: Lucene Users List Betreff: Re: my experiences - Re: Parsing Word Docs The problems with german umlauts should be fixed. I have posted them a patch (see http://nagoya.apache.org/bugzilla/show_bug.cgi?id=14735), and it should be applied now. I havent cross-checked it for now. I currently use POI to index documents with lucene, but i do not use the standard way with an lucende-word-document class (like the pdfdocument). For sure, i have had some problems with getting the text from old documents, but in this case my system falls back to an simple STRINGS parser (filters any human-readable) char from the document-file. byebye Mario - Original Message - From: Borkenhagen, Michael (ofd-ko zdfin) [EMAIL PROTECTED] To: 'Lucene Users List' [EMAIL PROTECTED] Sent: Thursday, March 06, 2003 1:39 PM Subject: AW: my experiences - Re: Parsing Word Docs Ryan, I tried to use texmining to extract text from word97 Documents. Some german characters like ä, ü etc. aren`t parsed correctly, so a can`t use it cause many german words include this characters. I dont know if the reason is textmining or hdf from poi (hssf from poi parses this characters correctly). Do you have any hints for me ? Michael -Ursprüngliche Nachricht- Von: Ryan Ackley [mailto:[EMAIL PROTECTED] Gesendet: Donnerstag, 6. März 2003 13:13 An: Lucene Users List Betreff: Re: my experiences - Re: Parsing Word Docs David, The textmining.org stuff only works on Word97 and above. It should work with no exceptions on any Word 97 doc. If you have any problems then it is from an earlier version (most likely Word 6.0) or its not a word document. If this isn't the case you need to email me so I can fix it and make it better for the benefit of everyone. I plan on adding support for Word 6 in the future. Ryan Ackley - Original Message - From: David Spencer [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, March 05, 2003 6:24 PM Subject: my experiences - Re: Parsing Word Docs FYI I tried the textmining.org/poi combo and on a collection of 350 word docs people have developed here over the years, and it failed on 33% of them with exceptions being thrown about the formats being invalid. I tried antiword ( http://www.winfield.demon.nl/ ), a native free *.exe, and it worked great ( well it seemed to process all the files fine). I've had similar experiences with PDF - I tried the 3 or so freeware/java PDF text extractors and they were not as good as the exe, pdftotext, from foolabs (http://www.foolabs.com/xpdf/). Not satisfying to a java developer but these work better than anything else I can find. You get source and I use them on windows linux, no prob. Eric Anderson wrote: I'm interested in using the textmining/textextraction utilities using Apache POI, that Ryan was discussing. However, I'm having some difficulty determining what the insertion point would be to replace the default parser with the word parser. Any assistance would be appreciated. LanRx Network Solutions, Inc. Providing Enterprise Level Solutions...On A Small Business Budget - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
AW: [ANN] PDFBox 0.6.0
Ben, by using PDFBox-0.5.6 and alternative PDFBox-0.6.0 I'd receive the following StackTrace java.lang.ClassCastException: org.pdfbox.cos.COSObject at org.pdfbox.encoding.DictionaryEncoding.init(DictionaryEncoding.java :66) at org.pdfbox.cos.COSObject.getEncoding(COSObject.java:269) at org.pdfbox.cos.COSObject.encode(COSObject.java:210) at org.pdfbox.util.PDFTextStripper.showString(PDFTextStripper.java:959) at org.pdfbox.util.PDFTextStripper.handleOperation(PDFTextStripper.java: 788) at org.pdfbox.util.PDFTextStripper.process(PDFTextStripper.java:379) at org.pdfbox.util.PDFTextStripper.process(PDFTextStripper.java:366) at org.pdfbox.util.PDFTextStripper.processPageContents(PDFTextStripper.j ava:288) at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:231 ) at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:223 ) at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:148) ... (Stack from PDF 0.6.0) I also receive the from Eric reported Error - but only one time. My Indexer continues parsing the other pdf Documents after getting an error. Have you any idea regarding the ClassCastException ? Michael -Ursprüngliche Nachricht- Von: Ben Litchfield [mailto:[EMAIL PROTECTED] Gesendet: Donnerstag, 6. März 2003 14:45 An: Lucene Users List Betreff: Re: [ANN] PDFBox 0.6.0 In this release I have changed how I parsed the document, which may have introduced this bug. I have received another report of this and will have it fixed for the next point release. You said you tried with reasonably sized PDF repository. Did you stop indexing at this error or did you continue? If you continued, is this the only error that you got? -Ben -- On Thu, 6 Mar 2003, Eric Anderson wrote: Ben- In attempting to use the PDFBox-0.6.0, I rec'd the following error when attempting to scan a reasonably sized PDF repository. Any thoughts? caught a class java.io.EOFException with message: Unexpected end of ZLIB input stream Eric Anderson LanRx Network Solutions Quoting Ben Litchfield [EMAIL PROTECTED]: I would like to announce the next release of PDFBox. PDFBox allows for PDF documents to be indexed using lucene through a simple interface. Please take a look at org.pdfbox.searchengine.lucene.LucenePDFDocument, which will extract all text and PDF document summary properties as lucene fields. You can obtain the latest release from http://www.pdfbox.org Please send all bug reports to me and attach the PDF document when possible. RELEASE 0.6.0 -Massive improvements to memory footprint. -Must call close() on the COSDocument(LucenePDFDocument does this for you) -Really fixed the bug where small documents were not being indexed. -Fixed bug where no whitespace existed between obj and start of object. Exception in thread main java.io.IOException: expected='obj' actual='obj/Pro -Fixed issue with spacing where textLineMatrix was not being copied properly -Fixed 'bug' where parsing would fail with some pdfs with double endobj definitions -Added PDF document summary fields to the lucene document Thank you, Ben Litchfield http://www.pdfbox.org - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] LanRx Network Solutions, Inc. Providing Enterprise Level Solutions...On A Small Business Budget - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
AW: Best HTML Parser !!
I prefer JTidy http://lempinen.net/sami/jtidy/. Michael -Ursprüngliche Nachricht- Von: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Gesendet: Montag, 24. Februar 2003 15:03 An: Lucene Users List; [EMAIL PROTECTED] Betreff: Re: Best HTML Parser !! It's not possible to generalize like that. I like NekoHTML. Otis --- Pierre Lacchini [EMAIL PROTECTED] wrote: Hello, i'm trying to index html file with Lucene. Do u know what's the best HTML Parser in Java ? The most Powerful ? I need to extract meta-tag, and many other differents text fields... Thx for ur help ;) __ Do you Yahoo!? Yahoo! Tax Center - forms, calculators, tips, more http://taxes.yahoo.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
AW: IndexWriter addDocument NullPointerException
Yes it is possible. Instead of catching an Exception you can do anything else, e.g. try { ...} catch (MyException e) { System.err.prinltn(e.class.forName()); } But this is off-topic here, it´s an gereral question about java. Michael -Ursprüngliche Nachricht- Von: Günter Kukies [mailto:[EMAIL PROTECTED] Gesendet: Montag, 24. Februar 2003 17:52 An: Lucene Users List Betreff: Re: IndexWriter addDocument NullPointerException I switched off the -server switch from the java commandline options and everything works fine now. I changed nothing in my code. So is it principly possible to throw an Exception with not stack trace? Any comments about this phenomenon? Günter - Original Message - From: Otis Gospodnetic [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Monday, February 24, 2003 4:31 PM Subject: Re: IndexWriter addDocument NullPointerException If I were you I would make things simpler for myself by converting the code to something that I could run from the command line instead of having to go through Tomcat. You really need to capture your exception stack trace with lne numbers, and then we can try helping. Otis --- Günter_Kukies [EMAIL PROTECTED] wrote: log(doc: +doc); is handled by tomcat and directed into special log-files, so you can't see them. System.err.println(hallo1 +doc); ex.printStackTrace(); System.err.println(hallo2); this is printing the relevant output. doc is never null, writer is never null and I can't add null-fields to a document. Günter - Original Message - From: Otis Gospodnetic [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Monday, February 24, 2003 3:07 PM Subject: Re: IndexWriter addDocument NullPointerException My guess is that your 2 getDocument calls are the source, that is, that those PDF and TXT classes don't return a proper Document. I also don't see the output created by log(doc: +doc); Otis if(path.matches(\\d+_\\d{4}_[a-z]{2,3}\\.pdf)) { doc = PDF_Document_Parser.getDocument(this,RealPath,file); } else if(path.matches(\\d+_\\d{4}_[a-z]{2,3}\\.txt)) { doc = TXT_Document_Parser.getDocument(this,RealPath,file); } --- Günter_Kukies [EMAIL PROTECTED] wrote: So, weekend is over. here is some code : private void addDocument(IndexWriter writer, File file ) throws IOException, InterruptedException { String path = file.getName(); log( -start Indexing: + path ); Document doc = null; if(path.matches(\\d+_\\d{4}_[a-z]{2,3}\\.pdf)) { doc = PDF_Document_Parser.getDocument(this,RealPath,file); } else if(path.matches(\\d+_\\d{4}_[a-z]{2,3}\\.txt)) { doc = TXT_Document_Parser.getDocument(this,RealPath,file); } else { log(do nothing); } log(doc: +doc); if( doc != null ) { try { writer.addDocument(doc); } catch(Exception ex) { System.err.println(hallo1 +doc); ex.printStackTrace(); System.err.println(hallo2); log(ERROR writer.addDocument(doc);); } } else { log( Skipping + path ); } log( -end Indexing: + path ); } Here is the output: hallo1 DocumentTextcontents:[EMAIL PROTECTED] Unindexedemail:[EMAIL PROTECTED] Unindexedname:Hans Dampf Textsummary:Equipo de deteccion 2002 Texttitle:Equipo de deteccion 2002 Textdoctypeid:0001 Unindexedlifetime:0 [EMAIL PROTECTED] Keywordmodified:0dcek766w Keywordusername:hda Unindexedrelative_path_xml:documents/news_new/sub1/sub11/sub111/10457359746 80_0001_hda.xml Unindexedcategory:documents/news_new/sub1/sub11/sub111/ Keywordsearch_all:all [EMAIL PROTECTED] Unindexedrelative_path:documents/news_new/sub1/sub11/sub111/1045735974680_0 001_hda.pdf java.lang.NullPointerException hallo2 hallo1 DocumentTextcontents:[EMAIL PROTECTED] Unindexedemail:[EMAIL PROTECTED] Unindexedname:Hans Dampf Textsummary:testsummary Texttitle:testtitle Textdoctypeid:0001 Unindexedlifetime:0 [EMAIL PROTECTED] Keywordmodified:0dcek76bm Keywordusername:hda Unindexedrelative_path_xml:documents/news_new/sub1/sub11/sub111/10457359748 50_0001_hda.xml Unindexedcategory:documents/news_new/sub1/sub11/sub111/ Keywordsearch_all:all [EMAIL PROTECTED] Unindexedrelative_path:documents/news_new/sub1/sub11/sub111/1045735974850_0 001_hda.pdf java.lang.NullPointerException hallo2
AW: Using term-highlighter
You have to write a class which implements the TermHighlighter Interface for example like this public class MyHighlighter implements TermHighlighter { public String highlightTerm (String term){ return font class='highlight' + term + /font; } } Use this class in your Query after searching : Document doc = this.ivHits.doc(i); String doctitle = doc.get(Konstanten.F_TITLE); doctitle = LuceneTools.highlightTerms(doctitle, new MyHighlighter(), this.ivQuery, analyzer); Regards, Michael -Ursprüngliche Nachricht- Von: Harpreet S Walia [mailto:[EMAIL PROTECTED]] Gesendet: Freitag, 21. Februar 2003 07:37 An: Lucene Users List Betreff: Using term-highlighter Hi, I am trying to use the term-highlighter posted on the contribution page for lucene. I downloaded the files and made the changes mentioned in the whitepaper to the classes in the lucene search package. can anbody please tell me, how to invoke the highligher while searching. currently i am performing the searches as follows org.apache.lucene.search.Searcher searcher = new IndexSearcher(indexPath); Query query = QueryParser.parse(srchqry,field, new SimpleAnalyzer()); Hits hits = searcher.search(query); what changes will be needed in these search steps. Thanks in advance ! Regards, Harpreet - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
AW: Compile lucene
Here is the exactly link http://www.mail-archive.com/lucene-user@jakarta.apache.org/ :)) -Ursprüngliche Nachricht- Von: Oshima, Scott [mailto:[EMAIL PROTECTED]] Gesendet: Freitag, 10. Januar 2003 20:00 An: Lucene Users List Betreff: RE: Compile lucene Anyone can send me a link to the lucene mailing list email archives? these emails build up fast and i can't store them locally, but too valuable to delete. thanks. -Original Message- From: Romo García, Javier [mailto:[EMAIL PROTECTED]] Sent: Thursday, September 12, 2002 1:19 AM To: Lucene Users List Subject: Compile lucene Hi everyone! Is there a good guide anywhere to compile the source code of lucene? I don't know very well how to start, specially with javacc. Thanks -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
AW: PDFBox 0.5.6
Thank You very very much ! This version is really great - it fixes most of the Problems I had with earlier versions! -Ursprüngliche Nachricht- Von: Ben Litchfield [mailto:[EMAIL PROTECTED]] Gesendet: Freitag, 29. November 2002 04:42 An: [EMAIL PROTECTED] Betreff: PDFBox 0.5.6 PDFBox version 0.5.6 is now available at http://www.pdfbox.org PDFBox makes it easy to add PDF Documents to a lucene index. Fixes over the last version -Fixed bug in LucenePDFDocument where stream was not being closed and small documents were not being indexed. -Fixed a spacing issue for some PDF documents. -Fixed error while parsing the version number -Fixed NullPointer in persistence example. -Create example lucene IndexFiles class which models the demo from lucene. -Fixed bug where garbage at the end of file caused an infinite loop -Fixed bug in parsing boolean values with stuff at the end like true Ben Litchfield -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
AW: PDF parser
There are different Parsers available - every Parser has other advantages and disadvantages. I use a combination of the PDFBox http://www.pdfbox.org/ and Etymon PJ http://www.etymon.com/pjc/, cause their APIs are very simple. Both of them parse PDF in a format of their own an provide interfaces to get the PDF Documents contents. Other developers on this list prefer JPedal http://www.jpedal.org/ which parses PDF into XML an provide a XML Tree with the PDF Documents contentsest, but the Documentation isn´t very detailed. Micha -Ursprüngliche Nachricht- Von: Thomas Chacko [mailto:[EMAIL PROTECTED]] Gesendet: Freitag, 22. November 2002 15:26 An: Lucene Users List Betreff: PDF parser Whats the best parser available to extarct text from PDF documents. Expecting a reply ASAP Thanks in advance Thomas Chacko -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]