Re: Multiple indexes
Is it true that for each index I have to create a seperate instance for FSDirectory, IndexWriter and IndexReader? Do I need to create a seperate locking mechanism as well? I have already implemented a program using just one index. Thanks, Ben On Tue, 1 Mar 2005 22:09:05 -0500, Erik Hatcher <[EMAIL PROTECTED]> wrote: > It's hard to answer such a general question with anything very precise, > so sorry if this doesn't hit the mark. Come back with more details and > we'll gladly assist though. > > First, certainly do not copy/paste code. Use standard reuse practices, > perhaps the same program can build the two different indexes if passed > different parameters, or share code between two different programs as a > JAR. > > What specifically are the issues you're encountering? > > Erik > > > On Mar 1, 2005, at 8:06 PM, Ben wrote: > > > Hi > > > > My site has two types of documents with different structure. I would > > like to create an index for each type of document. What is the best > > way to implement this? > > > > I have been trying to implement this but found out that 90% of the > > code is the same. > > > > In Lucene in Action book, there is a case study on jGuru, it just > > mentions them using multiple indexes. I would like to do something > > like them. > > > > Any resources on the Internet that I can learn from? > > > > Thanks, > > Ben > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Multiple indexes
Hi My site has two types of documents with different structure. I would like to create an index for each type of document. What is the best way to implement this? I have been trying to implement this but found out that 90% of the code is the same. In Lucene in Action book, there is a case study on jGuru, it just mentions them using multiple indexes. I would like to do something like them. Any resources on the Internet that I can learn from? Thanks, Ben - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Investingating Lucene For Project
See inlined comments below. > We have had requests from some clients who would like the ability to > "index" PDF files, now and possibly other text files in the future. The > PDF files live on a server and are in a structured environment. I would > like to somehow index the content inside the PDF and be able to run > searches on that information from a web-form. The result MUST BE a text > snippet (that being some text prior to the searched word and after the > searched word). Does this make sense? And can Lucene do this? Lucene indexes text documents, so you will need to convert your PDF to a text document. PDFBox (http://www.pdfbox.org/) can do that, PDFBox provides a summary of the document, which is just the first x number of characters. If you wanted a smarter summary you would need to create that yourself. > If the product can do this, how is the best way to get rolling on a > project of this nature? Purchase an example book, or are there simple > examples one can pick up on? Does Lucene have a large learning curve? or > reasonably quick? There are tutorials available on the website, and I would recommend the "Lucene in Action" book. There is a learning curve for lucene, but it sounds like your requirements are pretty basic so it shouldn't be that hard. > If all the above will work, what kind of license does this require? I > have not been able to find a link to that yet on the jakarta site. http://www.apache.org/licenses/LICENSE-2.0 Ben - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
PDF Highlighter Package
For those of you that support indexing PDF documents, PDFBox now supports Adobe's PDF Highlight specification (http://partners.adobe.com/public/developer/en/pdf/HighlightFileFormat.pdf) PDFBox is now capable of generating an XML document that describes words in a PDF document to highlight. An "in action" example can be seen at http://pavilion.csh.rit.edu:8080/pdfbox/index.html You can enter any web accessible PDF and any keywords. The PDF will open normally and after a short pause(this is running on an old slow server) will jump to the first selected keyword. Source code is available in CVS or in tonight's nightly build. Any comments/suggestions are welcome. Special thanks to Stephan Lagraulet, who made this possible with code contributions. Ben http://www.pdfbox.org - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Sorting date stored in milliseconds time
Hi I store my date in milliseconds, how can I do a sort on it? SortField has INT, FLOAT and STRING. Do I need to create a new sort class, to sort the long value? Thanks Ben - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Sorting isn't working for my date field
Hi Do I need to store and index the field I want to sort? Currently I am only indexing the field without storing nor tokenizing it. I have a date field indexing as MMdd and I have two documents with the same date. When I do my search with: searcher.search(query, new SortField("date", true)); searcher.search(query, new SortField("date", false)); they both return the same order. Any idea? Thanks. Ben - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: MultiFieldQueryParser 1.8 isn't parsing phrases
Thanks On Sat, 19 Feb 2005 16:09:49 +0100, Daniel Naber <[EMAIL PROTECTED]> wrote: > On Saturday 19 February 2005 15:26, Ben wrote: > > > When I try to search for phrases using the MultiFieldQueryParser v1.8 > > from CVS, it gives me NullPointerException. > > This has just been fixed in SVN (I assume you mean SVN, CVS still exists > but is read only and probably not updated anymore). > > Regards > Daniel > > -- > http://www.danielnaber.de > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
MultiFieldQueryParser 1.8 isn't parsing phrases
Hi When I try to search for phrases using the MultiFieldQueryParser v1.8 from CVS, it gives me NullPointerException. Using the following keyword works: title:"IBM backs linux" However, it gives me the exception if I use the following keyword: "IBM backs linux" Any idea why? I am using this MultiFieldQueryParser with Lucene 1.4.3. Of course I changed some of the boolean stuff to make it works with the production release. Thanks, Ben - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Use an executable from java ...
Kristian, I assume all of you comments are with the 0.7.0 version of PDFBox. There were some great improvements in that version in terms of speed and accuracy. > That's courious beacause we experienced that pdftotext was able to > convert 33% more pdf documents than PDFBox. Depending on the set of PDF documents you will notice different results. I welcome any bug reports(if they don't already exist) on that 33% that are not working for you. In particular, PDFBox needs some work on non-english languages. > That's good. Out application supports alternative conversion pipelines > that provide fallback mechanims. If the first converter cannot convert a > document a second converter is called. So PDFBox is our fallback > converter. Well, at least PDFBox made it as the "fallback. :) Ben http://www.pdfbox.org - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Use an executable from java ...
I will assume you are asking this question on the lucene mailing list because you now want to index that PDF document. Have you tried PDFBox? It can't create an html file for you but it can extract text. Ben http://www.pdfbox.org On Mon, 31 Jan 2005, Bertrand VENZAL wrote: > Hi all, > > I ve a kind of problem to execute a converting tool to modify a pdf to an > html under Linux. In fact, i have an executable "pdftohtml" which work > correctly on batch mode, and when I want to use it through Java under > Windows 2000 works also,BUT it does not work at all on the server under > linux. I m using the following code. > > scommand = "/bin/sh -c \"myCommand fileName output\" "; > > Runtime runtime = Runtime.getRuntime(); > Process proc = runtime.exec(scommand); > proc.waitFor(); > > > I m running my code under Linux-redhat with a classic shell. > Is there an other way to do the same thing or maybe am i missing something > ? > Any help will be grandly appreciate. > > Thanks > Bertrand > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Search results excerpt similar to Google
Hi Is it hard to implement a function that displays the search results excerpts similar to Google? Is it just string manipulations or there are some logic behind it? I like their excerpts. Thanks - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: FOP Generated PDF and PDFBox
Ya, when calling LucenePDFDocument.getDocument( File ) then it should be the same as the path. This is the code that the class uses to set those fields. document.add( Field.UnIndexed("path", file.getPath() ) ); document.add(Field.UnIndexed("url", file.getPath().replace(FILE_SEPARATOR, '/'))); I have no idea why an FOP PDF would be any different than another PDF. You can also run it from the command line, this is just for debugging purposes like this. java org.pdfbox.searchengine.lucene.LucenePDFDocument and it should print out the fields of the lucene Document object. Is the url there and is it correct? Ben On Fri, 21 Jan 2005, Luke Shannon wrote: > That is correct. No difference with how other PDF are handled. > > I am looking at the index in Luke now. The FOP generated documents have a > path but no URL? I would guess that these would be the same? > > Thanks for the speedy reply. > > Luke > > > - Original Message - > From: "Ben Litchfield" <[EMAIL PROTECTED]> > To: "Lucene Users List" > Sent: Friday, January 21, 2005 12:34 PM > Subject: Re: FOP Generated PDF and PDFBox > > > > > > > > Are you indexing the FOP PDF's differently than other PDF documents? > > > > Can I assume that you are using PDFBox's LucenePDFDocument.getDocument() > > method? > > > > Ben > > > > On Fri, 21 Jan 2005, Luke Shannon wrote: > > > > > Hello; > > > > > > Our CMS now allows users to create PDF documents (uses FOP) and than > search > > > them. > > > > > > I seem to be able to index these documents ok. But when I am generating > the > > > results to display I get a Null Pointer Exception while trying to use a > > > variable that should contain the url keyword for one of these documents > in > > > the index: > > > > > > Document doc = hits.doc(i); > > > String path = doc.get("url"); > > > > > > Path contains null. > > > > > > The interesting thing is this only happens with PDF that are generate > with > > > FOP. Other PDFs are fine. > > > > > > What I find weird is shouldn't the "url" field just contain the path of > the > > > file? > > > > > > Anyone else seen this before? > > > > > > Any ideas? > > > > > > Thanks, > > > > > > Luke > > > > > > > > > > > > - > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: FOP Generated PDF and PDFBox
Are you indexing the FOP PDF's differently than other PDF documents? Can I assume that you are using PDFBox's LucenePDFDocument.getDocument() method? Ben On Fri, 21 Jan 2005, Luke Shannon wrote: > Hello; > > Our CMS now allows users to create PDF documents (uses FOP) and than search > them. > > I seem to be able to index these documents ok. But when I am generating the > results to display I get a Null Pointer Exception while trying to use a > variable that should contain the url keyword for one of these documents in > the index: > > Document doc = hits.doc(i); > String path = doc.get("url"); > > Path contains null. > > The interesting thing is this only happens with PDF that are generate with > FOP. Other PDFs are fine. > > What I find weird is shouldn't the "url" field just contain the path of the > file? > > Anyone else seen this before? > > Any ideas? > > Thanks, > > Luke > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: PDFBox deprecated methods
Daniel, Yes, that getText( PDDocument ) is the method you should be using. You no longer need to use a COSDocument object, please note the following methods that go along with the deprecation of getText( COSDocument ) PDFParser.getPDDocument() - to get a PDDocument instead of a COSDocument after parsing PDDocument.load() - A convenience method that does all the PDFParser stuff and returns a PDDocument LucenePDFDocument.getDocument() - to go straight from a File/URL to a lucene document object Ben Quoting Daniel Cortes <[EMAIL PROTECTED]>: > Ok I reply myself > the method deprecated is .getText(Cos Document)) > if you do stripper.getText(new PDDocument(cosDoc)) there isn't any problem. > > > Excuse me, for the question > > > Daniel Cortes wrote: > > > I've been use PDFBox in my indexation of a directory . I've download > > the last version of PDFBox (0.6.7.a) and I've seen that the method > > that I use to extract > > was a deprecated method. PDFTextStripper.getText(). > > stripper.getText(new PDDocument(cosDoc)); > > I know a lot of person use same me this method. What are alternative > > options ? > > > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - This mail sent through IMP: http://horde.org/imp/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene appreciation
Hi Rony Very impressive. Is it possible for you to provide some information about the technology behind it? Like how do you craw other job sites and how often you do it. Do you use any other open source software and what are they? I think you should clean up the data in the "Recent Searches" area, it doesn't make sense for me to see: company%3Amicrosoft It does make sense if you display: company:microsoft Cheers, Ben On Thu, 16 Dec 2004 11:38:20 -0500, Erik Hatcher <[EMAIL PROTECTED]> wrote: > Rony - nice work! I subscribed to an alert already. > > The wiki is self-serve, just log in and add yourself. > > Erik > > > On Dec 16, 2004, at 11:26 AM, Rony Kahan wrote: > > I'd like to introduce myself and say thanks. We've recently launched > > http://www.indeed.com, a search engine for jobs based on Lucene. I'm > > consistently impressed with the quality, professionalism and support > > of the > > Lucene project and the Lucene community. This mailing list has been a > > great > > help. I'd also like to give mention to some of the consultants who had > > a big > > hand in making our project a reality ... Thank you Otis, Aviran, > > Sergiu & > > Dawid. > > > > As for our project, we're in beta and would love to get your feedback. > > The > > index size is currently ~1.8m jobs. My personal email address is rony > > a_t > > indeed.com. If you are interested in Lucene work you can set up an rss > > feed > > or email alert from here: > > http://www.indeed.com/search?q=lucene&sort=date > > > > Is it possible to be added to the Wiki Powered By page? > > > > Thanks Everyone, > > Rony > > > > > > Indeed.com - one search. all Jobs. > > http://www.indeed.com > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: C# Ports
I have created a DLL from the lucene jars for use in the PDFBox project. It uses IKVM(http://www.ikvm.net) to create a DLL from a jar. The binary version can be found here http://www.csh.rit.edu/~ben/projects/pdfbox/nightly-release/PDFBox-.NET-0.7.0-dev.zip This includes the ant script used to create the DLL files. This method is by far the easiest way to port it, see previous posts about advantages and disadvantages. Ben On Wed, 15 Dec 2004, Garrett Heaver wrote: > I was just wondering what tools (JLCA?) people are using to port Lucene to > c# as I'd be well interesting in converting things like snowball stemmers, > wordnet etc. > > > > Thanks > > Garrett > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: QueryFilter vs CachingWrapperFilter vs RangeQuery
thanks chris, you are correct that i'm not sure if i need the caching ability. it is more to understand right now so that if we do need to implement it, i am able to. the reason for the caching is that we will have listing pages for certain content types. for example a listing page of articles. this listing will be generated against lucene engine using a basic query. the page will also have the ability to filter the articles based on date range as one example. so caching those results could be beneficial. however, we will also potentially want to cache the basic query so that subsequent queries will hit a cache. when new content is published or content is removed from the site, the caches will need to be invalidated so new results are created. for the basic query, is there any caching mechanism built into the SearchIndexer or do we need to build our own caching mechanism? thanks ben On Tue, 2004-07-12 at 12:29 -0800, Chris Hostetter wrote: > : > executes the search, i would keep a static reference to SearchIndexer > : > and then when i want to invalidate the cache, set it to null or create > > : design of your system. But, yes, you do need to keep a reference to it > : for the cache to work properly. If you use a new IndexSearcher > : instance (I'm simplifying here, you could have an IndexReader instance > : yourself too, but I'm ignoring that possibility) then the filtering > : process occurs for each search rather than using the cache. > > Assuming you have a finite number of Filters, and assuming those Filters > are expensive enough to be worth it... > > Another approach you can take to "share" the cache among multiple > IndexReaders is to explicitly call the bits method on your filter(s) once, > and then cache the resulting BitSet anywhere you want (ie: serialize it to > disk if you so choose). and then impliment a "BitsFilter" class that you > can construct directly from a BitSet regardless of the IndexReader. The > down side of this approach is that it will *ONLY* work if you arecertain > that the index is never being modified. If any documents get added, or > the index gets re-optimized you must regenerate all of the BitSets. > > (That's why the CachingWrapperFilter's cache is keyed off of hte > IndexReader ... as long as you're re-using the same IndexReader, it know's > that the cached BitSet must still be valid, because an IndexReader > allways sees the same index as when it was opened, even if another > thread/process modifies it.) > > > class BitsFilter { >BitSet bits; >public BitsFilter(BitSet bits) { > this.bits=bits; >} >public BitSet bigs(IndexReader r) { > return bits.clone(); >} > } > > > > > -Hoss > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] >
Re: QueryFilter vs CachingWrapperFilter vs RangeQuery
erik, thanks for the reply i get the filter know and understand how the caching works. however the caching is only on the filtering level which means i can cache results that are filtered. but if i do a basic search against the index and want to cache that, do i need to create my own caching mechanism or does the SearchIndexer cache the results already? if it caches them already, then to clear the cache, is it again removing any references to the SearchIndexer instance? thanks again, ben On Tue, 2004-07-12 at 15:18 -0500, Erik Hatcher wrote: > On Dec 7, 2004, at 3:06 PM, Ben Rooney wrote: > > i'm trying to understand the difference/effects between QueryFilter vs > > CachingWrapperFilter and when you would use one vs the other and how > > they work exactly. > > QueryFilter caches the results (bit set of documents) of a query by > IndexReader. > > CachingWrapperFilter does not actually do any filtering of its own, but > merely wraps the results of another non-caching filter, such as > DateFilter. CachingWrapperFilter was added to disconnect caching from > filtering. QueryFilter is the exception as it came first and already > does caching. If you're using QueryFilter, there is no need to concern > yourself with CachingWrapperFilter. > > > also, when exactly will the cache be cleared. looking at the source > > code, it appears when the IndexReader is released it would be cleared. > > does this mean i should keep a reference to the SearchIndexer until i > > want the results to be cleared? for example, in a class file the > > executes the search, i would keep a static reference to SearchIndexer > > and then when i want to invalidate the cache, set it to null or create > > a > > new instance of it? > > How you keep a reference to the IndexSearcher instance is up to the > design of your system. But, yes, you do need to keep a reference to it > for the cache to work properly. If you use a new IndexSearcher > instance (I'm simplifying here, you could have an IndexReader instance > yourself too, but I'm ignoring that possibility) then the filtering > process occurs for each search rather than using the cache. > > Erik > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] >
QueryFilter vs CachingWrapperFilter vs RangeQuery
nts", analyzer); Query rangeQuery = new RangeQuery(new Term("publishDate", "20040101"), new Term("publishDate", "20041231"), true); BooleanQuery query2004 = new BooleanQuery(); query2004.add(query, true, false); query2004.add(rangeQuery, true, false); start = new Date(); for (int i = 0; i < 100; i++) { hits = searcher.search(query); if (i == 0) logger.debug(hits.length() + " total matching documents"); } end = new Date(); logger.info("query 1 - all docs - total time (ms): " + (end.getTime() - start.getTime())); start = new Date(); for (int i = 0; i < 100; i++) { hits = searcher.search(query2004); if (i == 0) logger.debug(hits.length() + " total matching documents"); } end = new Date(); logger.info("query 2 - 2004 range query - no cache - total time (ms): " + (end.getTime() - start.getTime())); QueryFilter filter2004 = new QueryFilter(rangeQuery); start = new Date(); for (int i = 0; i < 100; i++) { hits = searcher.search(query, filter2004); if (i == 0) logger.debug(hits.length() + " total matching documents"); } end = new Date(); logger.info("query 3 - 2004 docs filter - no cache - total time (ms): " + (end.getTime() - start.getTime())); CachingWrapperFilter cache2004 = new CachingWrapperFilter(filter2004); start = new Date(); for (int i = 0; i < 100; i++) { hits = searcher.search(query, cache2004); if (i == 0) logger.debug(hits.length() + " total matching documents"); } end = new Date(); logger.info("query 4 - 2004 docs filter - cached - total time (ms): " + (end.getTime() - start.getTime())); } catch (Exception e) { logger.error("unexpected excpetion trying to execute search", e); } } } thanks in advance for any help ben - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
.NET Version of Lucene
I know there has been talk about a .NET version of lucene. I have been looking into doing something similar for PDFBox and came across a project called IKVM http://www.ikvm.net/ I don't believe it has been mentioned on this list. It is a little different approach than what I people have been trying. It uses the GNU classpath to bring all of the newer JDK classes into .NET and you can run a command line app to create a DLL from a jar. So for example ikvmc.exe -reference:ikvm.gnu.classpath.dll -reference:IKVM.AWT.WinForms.dll -out:bin\lucene-1.4.2.dll external\lucene-1.4.2.jar The drawback is that you will need to include the ikvm.gnu.classpath.dll in your project which is about 3 megs, but to be able to use lucene in .NET and not have to use a manual process when a new version comes out is pretty cool. I have not gotten around to running the junit tests yet, but that is next. For PDFBox, which depends on ANT/junit/log4j/lucene, I was able to run the jar->DLL process for each of those projects and run PDFBox in .NET without a problem. One licensing note, GNU Classpath is released as GPL "with an exception", allowing it to be rereleased under a different license. See http://www.gnu.org/software/classpath/license.html for more details. Ben - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: PDF Indexing Error
I don't think that is a good solution, as there are many bug fixes and enhancements in the current version and you would never be able to upgrade. The message that you are seeing "You do not have permission to extract text" is not a bug but intended functionality of PDFBox. PDFBox honors the security settings in a PDF, if you don't have permission to extract the text then PDFBox won't allow you to do it, just as Acrobat will not allow you to do it. PDFBox supports *modification* of PDF documents as well as text extraction. Ben On Fri, 3 Dec 2004, Luke Shannon wrote: > Hi Ben; > > Actually I think I did update PDFBox. I will put it back to the version I > previously had. > > Luke > > - Original Message - > From: "Ben Litchfield" <[EMAIL PROTECTED]> > To: "Lucene Users List" <[EMAIL PROTECTED]> > Sent: Thursday, December 02, 2004 8:20 PM > Subject: Re: PDF Indexing Error > > > > > > This error is because of security settings that have been applied to the > > PDF document which disallow text extraction. > > > > Not sure why you would all of a sudden get this error, unless you upgraded > > recently. Older versions of PDFBox did not fully support PDF security. > > > > Ben > > > > On Thu, 2 Dec 2004, Luke Shannon wrote: > > > > > Hello All; > > > > > > Perhaps this should be on the PDFBox forum but I was curious if anyone > has > > > seen this error parsing PDF documents using packages other than PDFBox. > > > > > > /usr/tomcat/fb_hub/GM/Administration/Document/java/java_io.pdf > > > java.io.IOException: You do not have permission to extract text > > > > > > The weird thing is it gave this error on a document I have indexed a > million > > > times over the last 3 weeks. > > > > > > Thanks, > > > > > > Luke > > > > > > > > > > > > - > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: PDF Indexing Error
This error is because of security settings that have been applied to the PDF document which disallow text extraction. Not sure why you would all of a sudden get this error, unless you upgraded recently. Older versions of PDFBox did not fully support PDF security. Ben On Thu, 2 Dec 2004, Luke Shannon wrote: > Hello All; > > Perhaps this should be on the PDFBox forum but I was curious if anyone has > seen this error parsing PDF documents using packages other than PDFBox. > > /usr/tomcat/fb_hub/GM/Administration/Document/java/java_io.pdf > java.io.IOException: You do not have permission to extract text > > The weird thing is it gave this error on a document I have indexed a million > times over the last 3 weeks. > > Thanks, > > Luke > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: PDF Index Time
PDFBox is slow, there is an open issue for it on the sourceforge site and I am actively working on improving speed and should see significant improvements in the next release. I have not extensively tried the snowtide package but they have a trial download and the docs show that it should be just as easy to integrate as PDFBox is. They list pricings on there site as well, which is nice that it is not hidden as some software companies do. Ben On Thu, 18 Nov 2004, Luke Shannon wrote: > Hi; > > I am using the PDFBox's getLuceneDocument method to parse my PDF > documents. It returns good results and was very easy to integrate into > the project. However it is slow. > > Does anyone know of a faster package? Someone mentioned snowtide on an > earlier post. Anyone have experience with this package? > > Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Need advice: what pdf lib to use?
In order to write software that consumes PDF documents you must agree to a list of conditions. One of those conditions is that permissions specified by the author of the PDF document are respected. PDFBox complies with this statement, if there is software that does not then they are in violation of copyright law. That being said, PDFBox is open source so a user could make modifications to the source code, or as a PDF library could change permissions on a document. Ben On Mon, 25 Oct 2004 [EMAIL PROTECTED] wrote: > Yes Ben, You are right. > > This would be correct functionality from technical perspective. But look > it my way with application programmer eyes reporting to big boss that c. > 30% of doc we cope with could not be indexed because of this stupid > limitation. Neither he or me have any influence on pdf owners and any > ideas about what made them create files with documet security set. > > In short, if You also could implement this "uncorrect functionality" the > "closed source" guys did, it would be really great! > > As far as sponsoring is concerned I would be ready to hack (or at least to > try) it even for 1/3 of that fortune:))) > > J. > > > > > > Ben Litchfield <[EMAIL PROTECTED]> > 25.10.2004 14:02 > Please respond to "Lucene Users List" > > > To: Lucene Users List <[EMAIL PROTECTED]> > cc: (bcc: Iouli Golovatyi/X/GP/Novartis) > Subject:Re: Need advice: what pdf lib to use? > Category: > > > > > PDFBox does not 'stumble' when it gives that message, that is correct > functionality if that permission is not allowed. > > If your company is willing to pay a 'fortune' why not sponsor a change to > an open source project for half a fortune. > > Ben > http://www.pdfbox.org > > On Mon, 25 Oct 2004 [EMAIL PROTECTED] wrote: > > > PDFbox stumbles also with "class java.io.IOException with message: - > You > > do not have permission to extract text" in case the doc is copy/print > > protected. > > I tested now the snowtide commercial product and it looks like it could > > process these files as well. Performance was also not so bad. > Unfortunatly > > the test result could not be considered as 100%, because the free > version > > processed just first 8 pages. After all this product costs a fortune > > (as long the company is ready to pay I don't realy mind:)) > > > > J. > > > > > > > > > > > > Robert Newson <[EMAIL PROTECTED]> > > Sent by: news <[EMAIL PROTECTED]> > > 24.10.2004 17:44 > > Please respond to "Lucene Users List" > > > > > > To: [EMAIL PROTECTED] > > cc: (bcc: Iouli Golovatyi/X/GP/Novartis) > > Subject:Re: Need advice: what pdf lib to use? > > Category: > > > > > > > > [EMAIL PROTECTED] wrote: > > > Hello all, > > > > > > I need a piece of advice/experience.. > > > > > > What pdf parser (written in java) u'd recommend? > > > > > > I played now with PDFBox-0.6.7a and would not say I was satisfied too > > much > > > with it > > > > > > On certain pdf's (not well formated but anyway readable with acrobate) > > it > > > run into dead loop (this I could fix in code), > > > and on one file it produced "out of memory error" and killed jvm:( > (this > > > > > problem I could not identify yet) > > > > > > After all the performance was not too great as well: it took c. 19 h. > to > > > > > index 13000 files (c. 3.5Gb) > > > > > > Regards, > > > J. > > > > > > > > > > > > > On the specific problem of the "dead loop", I reported an instance of > > this to Ben a week or so ago and he has fixed it in the latest > > nightlies. I expect an official release will include this bugfix soon. > > The file in question was unreadable with any PDF software I have, but > > someone managed to create it somehow... > > > > http://sourceforge.net/tracker/index.php?func=detail&aid=1037145&group_id=78314&atid=552832 > > > > I've found pdfbox to be pretty good. The only time I get problems is > > with corrupted or egregiously bad PDF files. > > > > B. > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Need advice: what pdf lib to use?
PDFBox does not 'stumble' when it gives that message, that is correct functionality if that permission is not allowed. If your company is willing to pay a 'fortune' why not sponsor a change to an open source project for half a fortune. Ben http://www.pdfbox.org On Mon, 25 Oct 2004 [EMAIL PROTECTED] wrote: > PDFbox stumbles also with "class java.io.IOException with message: - You > do not have permission to extract text" in case the doc is copy/print > protected. > I tested now the snowtide commercial product and it looks like it could > process these files as well. Performance was also not so bad. Unfortunatly > the test result could not be considered as 100%, because the free version > processed just first 8 pages. After all this product costs a fortune > (as long the company is ready to pay I don't realy mind:)) > > J. > > > > > > Robert Newson <[EMAIL PROTECTED]> > Sent by: news <[EMAIL PROTECTED]> > 24.10.2004 17:44 > Please respond to "Lucene Users List" > > > To: [EMAIL PROTECTED] > cc: (bcc: Iouli Golovatyi/X/GP/Novartis) > Subject:Re: Need advice: what pdf lib to use? > Category: > > > > [EMAIL PROTECTED] wrote: > > Hello all, > > > > I need a piece of advice/experience.. > > > > What pdf parser (written in java) u'd recommend? > > > > I played now with PDFBox-0.6.7a and would not say I was satisfied too > much > > with it > > > > On certain pdf's (not well formated but anyway readable with acrobate) > it > > run into dead loop (this I could fix in code), > > and on one file it produced "out of memory error" and killed jvm:( (this > > > problem I could not identify yet) > > > > After all the performance was not too great as well: it took c. 19 h. to > > > index 13000 files (c. 3.5Gb) > > > > Regards, > > J. > > > > > > > > On the specific problem of the "dead loop", I reported an instance of > this to Ben a week or so ago and he has fixed it in the latest > nightlies. I expect an official release will include this bugfix soon. > The file in question was unreadable with any PDF software I have, but > someone managed to create it somehow... > > http://sourceforge.net/tracker/index.php?func=detail&aid=1037145&group_id=78314&atid=552832 > > I've found pdfbox to be pretty good. The only time I get problems is > with corrupted or egregiously bad PDF files. > > B. > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Need advice: what pdf lib to use?
Please post any PDFBox issues you notice on the PDFBox sourceforge bug list, if possible attach/email any problem PDFs that you encounter. There are some efforts underway to improve the speed of PDFBox, you can monitor the progress at http://sourceforge.net/tracker/index.php?func=detail&aid=1046300&group_id=78314&atid=552832 As for other suggestions, I know some people have utilized xpdf(open source but non Java) to extract the text. For other Java solutions PDFTextStream(commercial) - "Fastest PDF-to-Text Solution for Java" http://snowtide.com/home/PDFTextStream/ Etymon PJ (GPL) http://www.etymon.com/ Ben http://www.pdfbox.org On Fri, 22 Oct 2004 [EMAIL PROTECTED] wrote: > Hello all, > > I need a piece of advice/experience.. > > What pdf parser (written in java) u'd recommend? > > I played now with PDFBox-0.6.7a and would not say I was satisfied too much > with it > > On certain pdf's (not well formated but anyway readable with acrobate) it > run into dead loop (this I could fix in code), > and on one file it produced "out of memory error" and killed jvm:( (this > problem I could not identify yet) > > After all the performance was not too great as well: it took c. 19 h. to > index 13000 files (c. 3.5Gb) > > Regards, > J. > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Google Desktop Could be Better
The latest PDFBox jar is 2179K, as you point out is significantly larger than the jar in Parsnips. The majority of that space is used by cmap mapping files used for proper text extraction so any classes that could be removed would only result in a minor size reduction. I would think that the capability of indexing PDF documents would outweigh the extra time for the download. Ben On Sat, 16 Oct 2004, Bill Tschumy wrote: > > On Oct 16, 2004, at 9:47 PM, Ben Litchfield wrote: > > > > >> types. It uses Lucene underneath. I'm thinking about extending it in > >> the direction that Google Desktop is going and automatically index > >> certain file types and directories in your system. > > > > And of course supporting PDF documents right! > > > > Ben > > http://www.pdfbox.org > > > > Ahem... right... My next version will do a better job with PDF and > RTF files. I've looked at pdfBox, but the jar file is so big that I > hate to burden my users by incorporating it. Any chance of getting a > smaller version that just does the text extraction? Your jar file is > more than twice the size of my entire application including > documentation. I really would like to solve this problem. > -- > Bill Tschumy > Otherwise -- Austin, TX > http://www.otherwise.com > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Google Desktop Could be Better
> types. It uses Lucene underneath. I'm thinking about extending it in > the direction that Google Desktop is going and automatically index > certain file types and directories in your system. And of course supporting PDF documents right! Ben http://www.pdfbox.org - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Highlighting PDF file after the search
With some work this is possible with PDFBox. PDFBox extracts text with positioning and sizing. When the text was found you could add to the page content stream the drawing of a highlighted box. PDFBox has an open RFE for this functionality, please monitor it for progress. http://sourceforge.net/tracker/index.php?func=detail&aid=1035635&group_id=78314&atid=552835 Ben On Mon, 27 Sep 2004 [EMAIL PROTECTED] wrote: > Bruce, > You are right, i tried this morning and when i try to stream the > higlighter output as pdf, acrobat was not able to read or open it!! > Which project do you recommend that would do pdf highlighting? > > Thanks, > Vijay Balasubramanian > DPRA Inc., > > > > > Bruce Ritchie > <[EMAIL PROTECTED]To: Lucene Users List <[EMAIL > PROTECTED]> > re.com> cc: >Subject: RE: Highlighting PDF file > after the search > 09/20/2004 05:35 > PM > Please respond to > Lucene Users List > > > > > > > > From: [EMAIL PROTECTED] > > > I can successfully index and search the PDF documents, > > however i am not able to highlight the searched text in my > > original PDF file (ie: like dtSearch highlights on original file) > > > > I took a look at the highlighter in sandbox, compiled it and > > have it ready. I am wondering if this highlighter is for > > highlighting indexed documents or can it be used for PDF > > Files as is ! Please enlighten ! > > The highlighter code in sandbox can facilitate highlighting of text > *extracted* from the PDF, however it does nothing for you to highlight > search terms *inside* of the PDF. For that you will need some sort of > tool > that can modify the PDF on the fly as the user views it. I know of no > quick > and dirty tool that allows you to do this, though there is quite a few > projects and products which allow you to manipulate PDF files which > likely > can be used to obtain the behavior you are looking for (with some effort > on > your part). > > > Regards, > > Bruce Ritchie > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents
> I can say that gc is not collecting these objects since I forced gc > runs when indexing every now and then (when parsing pdf-type objects, > that is): No effect. What PDF parser are you using? Is the problem within the parser and not lucene? Are you releasing all resources? Ben - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: PDF->Text Performance comparison
> 1) I tried to migrate to never versions(o.6.4, 0.6.5, 0.6.6), but all the time I had > problems with parsing the same pdf documents, which worked well for > 0.6.3. I mentioned my problems here: > https://sourceforge.net/tracker/?func=detail&atid=552832&aid=1021691&group_id=78314 I am waiting for a response from you on this issue, try to login to SF when posting bugs so you get a notification when it is updated. > 2) When I were started with 0.6.3 I experienced perfomance problems > too, especially with large pdf documents (I had several with more > then 20MB size). I changed a bit source, wrapping the following line > of BaseParser class: I will give that a try, thanks for letting me know. Ben - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: PDF->Text Performance comparison
Yes, that and a few other adjectives, but I didn't want to get carried away. Ben On Wed, 8 Sep 2004, Doug Cutting wrote: > Ben Litchfield wrote: > > PDFBox: slow PDF text extraction for Java applications > > http://www.pdfbox.org > > Shouldn't that read, "PDFBox: *free* slow PDF text extraction for Java > applications, with Lucene integration"? > > Doug > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
PDF->Text Performance comparison
On Wed, 8 Sep 2004, Chas Emerick wrote: > PDFTextStream: fast PDF text extraction for Java applications > http://snowtide.com/home/PDFTextStream/ For those that have not seen, snowtide.com has done a performance comparison against several Java PDF->Text libraries, including Snowtide's PDFTextStream, PDFBox, Etymon PJ and JPedal. It appears to be fairly well done. http://snowtide.com/home/PDFTextStream/Performance PDFBox: slow PDF text extraction for Java applications http://www.pdfbox.org :) Ben - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: pdf in Chinese
This appears to be more of a PDFBox issue than a lucene issue, please post an issue to the PDFBox site. Also note, that because of certain encodings that a PDF writer can use, it is impossible to extract text from all PDF documents. Ben On Wed, 8 Sep 2004, [EMAIL PROTECTED] wrote: > it is not about analyzer ,i need to read text from pdf file first. > > - Original Message - > From: "Chandan Tamrakar" <[EMAIL PROTECTED]> > To: "Lucene Users List" <[EMAIL PROTECTED]> > Sent: Wednesday, September 08, 2004 4:15 PM > Subject: Re: pdf in Chinese > > > > which analyzer you are using to index chinese pdf documents ? > > I think you should use cjkanalyzer > > - Original Message - > > From: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> > > To: <[EMAIL PROTECTED]> > > Sent: Wednesday, September 08, 2004 11:27 AM > > Subject: pdf in Chinese > > > > > > > Hi all, > > > i use pdfbox to parse pdf file to lucene document.when i parse > > Chinese > > > pdf file,pdfbox is not always success. > > > Is anyone have some advice? > > > > > > > > > - > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Moving from a single server to a cluster
My application currently uses Lucene with an index living on the filesystem, and it works fine. I'm moving to a clustered environment soon and need to figure out how to keep my indexes together. Since the index is on the filesystem, each machine in the cluster will end up with a different index. I looked into JDBC Directory, but it's not tested under Oracle and doesn't seem like a very mature project. What are other people doing to solve this problem? -- Ben Sinclair [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: PDF indexing
You need to add the log4j.jar to your classpath. On Tue, 24 Aug 2004, sivalingam T wrote: > Â Hi I have written one files for PDF Indexing. Here I have written as follows .. This is my IndexPDF file. import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.Term; import org.apache.lucene.index.TermEnum; import org.pdfbox.searchengine.lucene.LucenePDFDocument; import java.io.File; import java.util.Date; import java.util.Arrays; class IndexPDF { private static boolean deleting = false; // true during deletion pass private static IndexReader reader;// existing index private static IndexWriter writer;// new index being built private static TermEnum uidIter;// document id iterator public static void main(String[] argv) { try { String index = "index"; boolean create = false; File root = null; String usage = "IndexHTML [-create] [-index ] "; if (argv.length == 0) { System.err.println("Usage: " + usage); return; } for (int i = 0; i < argv.length; i++) { if (argv[i].equals("-index")) {// parse -index option index = argv[++i]; } else if (argv[i].equals("-create")) { // parse -create option create = true; } else if (i != argv.length-1) { System.err.println("Usage: " + usage); return; } else root = new File(argv[i]); } Date start = new Date(); if (!create) { // delete stale docs deleting = true; indexDocs(root, index, create); } writer = new IndexWriter(index, new StandardAnalyzer(), create); writer.maxFieldLength = 100; indexDocs(root, index, create);// add new docs System.out.println("Optimizing index..."); writer.optimize(); writer.close(); Date end = new Date(); System.out.print(end.getTime() - start.getTime()); System.out.println(" total milliseconds"); } catch (Exception e) { System.out.println(" caught a " + e.getClass() + "\n with message: " + e.getMessage()); } } /* Walk directory hierarchy in uid order, while keeping uid iterator from /* existing index in sync. Mismatches indicate one of: (a) old documents to /* be deleted; (b) unchanged documents, to be left alone; or (c) new /* documents, to be indexed. */ private static void indexDocs(File file, String index, boolean create) throws Exception { if (!create) { // incrementally update reader = IndexReader.open(index);// open existing index uidIter = reader.terms(new Term("uid", "")); // init uid iterator indexDocs(file); if (deleting) { // delete rest of stale docs while (uidIter.term() != null && uidIter.term().field() == "uid") { System.out.println("deleting " + HTMLDocument.uid2url(uidIter.term().text())); reader.delete(uidIter.term()); uidIter.next(); } deleting = false; } uidIter.close(); // close uid iterator reader.close(); // close existing index } else // don't have exisiting indexDocs(file); } private static void indexDocs(File file) throws Exception { if (file.isDirectory()) { // if a directory String[] files = file.list();// list its files Arrays.sort(files); // sort the files for (int i = 0; i < files.length; i++) { // recursively index them indexDocs(new File(file, files[i])); } } if ((file.getPath().endsWith(".pdf" )) || (file.getPath().endsWith(".PDF" ))) { System.out.println( "Indexing PDF document: " + file ); try { //Document doc = LucenePDFDocument.getDocument( file ); writer.addDocument(LucenePDFDocument.getDocument( file)); } catch(Exception e) {} } } } when i use the following commands, the exceptions are thrown if anybody know please inform me. C:\>java org.apache.lucene.demo.IndexPDF -create -index c:\lucene\pdf c:\pdfs\Words.pdf Indexing PDF document: c:\pdfs\Words.pdf Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/log4j/Cate gory at org.pdfbox.searchengine.lucene.LucenePDFDocument.addContent(LucenePDF Document.java:197) at org.pdfbox.searchengine.lucene.LucenePDFDocument.getDocument(LucenePD FDocument.java:118) at org.apache.lucene.demo.IndexPDF.indexDocs(Unknown Source) at org.apache.lucene.demo.IndexPDF.indexDocs(Unknown Source) at org.apache.lucene.demo.Inde
Re: integration of lucene with pdfbox
If you can use lucene on its own then you already know how to add a lucene Document to the index. So you need to be able to take a PDF and get a lucene Document. org.pdfbox.searchengine.lucene.LucenePDFDocument.getDocument() does that for you. Ben On Mon, 23 Aug 2004, Santosh wrote: > I have downloaded pdfbox and lucene and kept jar files in the class path, I am able > to work with both of them independently but how can I integrate both > > regards > Santosh kumar > > ---SOFTPRO DISCLAIMER-- > > Information contained in this E-MAIL and any attachments are > confidential being proprietary to SOFTPRO SYSTEMS is 'privileged' > and 'confidential'. > > If you are not an intended or authorised recipient of this E-MAIL or > have received it in error, You are notified that any use, copying or > dissemination of the information contained in this E-MAIL in any > manner whatsoever is strictly prohibited. Please delete it immediately > and notify the sender by E-MAIL. > > In such a case reading, reproducing, printing or further dissemination > of this E-MAIL is strictly prohibited and may be unlawful. > > SOFTPRO SYSYTEMS does not REPRESENT or WARRANT that an attachment > hereto is free from computer viruses or other defects. > > The opinions expressed in this E-MAIL and any ATTACHEMENTS may be > those of the author and are not necessarily those of SOFTPRO SYSTEMS. > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Fw: pdf search
In order to search through a PDF document the text must be extracted from the PDF document. There are several libraries to do that, including http://www.pdfbox.org After you have the text from the PDF document you just add it to the lucene index like any other text document. You should go through the intro tutorial to understand how to index/search text using lucene. Ben On Fri, 20 Aug 2004, Santosh wrote: > How can I search through PDF? > - Original Message - > From: Santosh > To: Lucene Users List > Sent: Friday, August 20, 2004 5:59 PM > Subject: pdf search > > > Hi, > > I am new bee to lucene. > > I have downloaded zip file. now how can i give my own list words to lucene? > In the demo i saw that lucene is automatically creating index if we run the java > program.but I want to give my own search words, how is it possible? > > > regards > Santosh kumar > SoftPro Systems > Hyderabad > > > "The harder you train in peace, the lesser you bleed in war" > > ---SOFTPRO DISCLAIMER-- > > Information contained in this E-MAIL and any attachments are > confidential being proprietary to SOFTPRO SYSTEMS is 'privileged' > and 'confidential'. > > If you are not an intended or authorised recipient of this E-MAIL or > have received it in error, You are notified that any use, copying or > dissemination of the information contained in this E-MAIL in any > manner whatsoever is strictly prohibited. Please delete it immediately > and notify the sender by E-MAIL. > > In such a case reading, reproducing, printing or further dissemination > of this E-MAIL is strictly prohibited and may be unlawful. > > SOFTPRO SYSYTEMS does not REPRESENT or WARRANT that an attachment > hereto is free from computer viruses or other defects. > > The opinions expressed in this E-MAIL and any ATTACHEMENTS may be > those of the author and are not necessarily those of SOFTPRO SYSTEMS. > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: PDFBox Issue
PDFBox comes with log4j version 1.2.5(according to MANIFEST.MF in jar file), I believe that 1.2.8 is the latest. I will make sure that the next version of PDFBox includes the latest log4j version, which I assume is what everybody would like to use. But, by looking at the below error message it appears that you might have an older log4j in your classpath Logger.getLogger( Class ) is available in 1.2.5 and 1.2.8 Ben On Tue, 17 Aug 2004, Don Vaillancourt wrote: > Wow, this is an old message. > > I managed to get my code to work by using the previous version of > PDFBox. I had used the version of log4j that had come with PDFBox. > > Someone had mentioned recompiling log4j, but I couldn't get the project > to import the source into Eclipse, so I gave up. But things work great > with the version of PDFBox that I compiled with so I am fine with that. > > As for the version of log4j, I could not tell you, as I said above it > came with PDFBox, so I'm guessing that it had probably not been tested > with the version of log4j it was being distributed with. > > Paul Smith wrote: > > >What version of the log4j jar are you using? > > > > > > > >>-Original Message- > >>From: Don Vaillancourt [mailto:[EMAIL PROTECTED] > >>Sent: Tuesday, June 29, 2004 8:06 AM > >>To: Lucene Users List > >>Subject: PDFBox Issue > >> > >>Hi all, > >> > >>I know that this is a Lucene list but wanted to know if any of you have > >>gotten this error before using PDFBox? > >> > >>I've gotten the latest version of PDFBox and it is giving me the following > >>error: > >> > >>java.lang.VerifyError: (class: org/apache/log4j/LogManager, method: > >> signature: ()V) Incompatible argument to function > >>at org.apache.log4j.Logger.getLogger(Logger.java:94) > >>at org.pdfbox.pdfparser.PDFParser.(PDFParser.java:57) > >>at > >>org.pdfbox.searchengine.lucene.LucenePDFDocument.addContent(LucenePDFDocum > >>ent.java:197) > >>at > >>org.pdfbox.searchengine.lucene.LucenePDFDocument.getDocument(LucenePDFDocu > >>ment.java:118) > >>at Index.indexFile(Index.java:287) > >>at Index.indexDirectory(Index.java:265) > >>at Index.update(Index.java:63) > >>at Lucene.main(Lucene.java:26) > >>Exception in thread "main" > >> > >>I am using all the jar files that came with PDFBox. > >> > >>Anyone run into this problem. I am using the following line of code: > >> > >>Document doc = LucenePDFDocument.getDocument(f); > >> > >>Thanks > >> > >> > >>Don Vaillancourt > >>Director of Software Development > >> > >>WEB IMPACT INC. > >>416-815-2000 ext. 245 > >>email: [EMAIL PROTECTED] > >>web: http://www.web-impact.com > >> > >> > >> > >> > >>This email message is intended only for the addressee(s) > >>and contains information that may be confidential and/or > >>copyright. If you are not the intended recipient please > >>notify the sender by reply email and immediately delete > >>this email. Use, disclosure or reproduction of this email > >>by anyone other than the intended recipient(s) is strictly > >>prohibited. No representation is made that this email or > >>any attachments are free of viruses. Virus scanning is > >>recommended and is the responsibility of the recipient. > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > > > > > > > >- > >To unsubscribe, e-mail: [EMAIL PROTECTED] > >For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > > -- > *Don Vaillancourt > Director of Software Development > * > *WEB IMPACT INC.* > phone: 416-815-2000 ext. 245 > fax: 416-815-2001 > email: [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]> > web: http://www.web-impact.com > > > > / This email message is intended only for the addressee(s) > and contains information that may be confidential and/or > copyright. If you are not the intended recipient please > notify the sender by reply email and immediately delete > this email. Use, disclosure or reproduction of this email > by anyone other than the intended recipient(s) is strictly > prohibited. No representation is made that this email or > any attachments are free of viruses. Virus scanning is > recommended and is the responsibility of the recipient. > / > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: pdfbox performance.
Different PDFs will exhibit different extraction speeds because of the way that PDF documents are structured. I assume you are using the latest version 0.6.6, could you give 0.6.5 a try and see if you notice faster speeds. Ben On Thu, 29 Jul 2004, Miroslaw Milewski wrote: > Paul Smith wrote: > > > The first thing that I would do is wrap the FileInputStream with a > > BufferedInputStream. > > Change: > > > FileInputStream reader = new FileInputStream(file); > > To: > > InputStream reader = new BufferedInputStream(new > > FileInputStream(file)); > > You get a significant boost reading in from a buffer, particularly as > > the size of the file grows. Try that first, and then rebenchmark. > > I tested both, here is the code: > > File file = new File("test.pdf"); > InputStream reader = null; > > for(int i=1; i<=6; i++) { > >long step01 = Calendar.getInstance().getTimeInMillis(); >String stream = null; > >if(i%2 == 0) { > reader = new BufferedInputStream(new FileInputStream(file)); >stream = "buffered"; >} >else { > reader = new FileInputStream(file); > stream = "no buffer"; >} > >PDFParser parser = null; >PDDocument pdDoc = null; > >parser = new PDFParser(reader); >parser.parse(); >pdDoc = parser.getPDDocument(); > >long step02 = Calendar.getInstance().getTimeInMillis(); > >PDFTextStripper stripper = new PDFTextStripper(); >tring pdftext = stripper.getText(pdDoc); > >long step03 = Calendar.getInstance().getTimeInMillis(); > >pdDoc.close(); > >long end = Calendar.getInstance().getTimeInMillis(); > >System.out.println("iteration: " + i + " - " + stream); >System.out.println("start: " + start); >System.out.println("step01: " + (step01-start)); >System.out.println("step02: " + (step02-start)); >System.out.println("step03: " + (step03-start)); >System.out.println("end: " + (end-start)); > } > > And below are the benchmarks for buffered and unbuffered readers. The > difference is not stunning. It seems to get better with time, but this > is prably due to some VM optimisation. And I'll extract the text only > once :-). > > file: 9kB, text only; > > iteration: 1 - no buffer > step01: 0; step02: 1492; step03: 13850; end: 13880 > > iteration: 2 - buffered > step01: 0; step02: 912; step03: 10245; end: 10265 > > iteration: 3 - no buffer > step01: 0; step02: 951 ;step03: 9924; end: 9944 > > iteration: 4 - buffered > step01: 0; step02: 842; step03: 10075; end: 10105 > > iteration: 5 - no buffer > step01: 0; step02: 831; step03: 9934; end: 9954 > > iteration: 6 - buffered > step01: 0; step02: 932; step03: 9944; end: 9965 > > > file: 74 kB; text only > > iteration: 1 - no buffer > step01: 0; step02: 4918; step03: 33959; end: 33989 > > iteration: 2 - buffered > step01: 0; step02: 4367; step03: 32367; end: 32407 > > iteration: 3 - no buffer > step01: 0; step02: 4306; step03: 30995; end: 31025 > > iteration: 4 - buffered > step01: 0; step02: 4296; step03: 30734; end: 30764 > > iteration: 5 - no buffer > step01: 0; step02: 4266; step03: 30754; end: 30784 > > iteration: 6 - buffered > step01: 0; step02: 4256; step03: 30634; end: 30664 > > > file: 270 kB, text only > > iteration: 1 - no buffer > step01: 0; step02: 30634; step03: 142225; end: 142265 > > iteration: 2 - buffered > step01: 0; step02: 29893; step03: 135354; end: 135394 > > iteration: 3 - no buffer > step01: 0; step02: 29553; step03: 134654; end: 134694 > > iteration: 4 - buffered > step01: 0; step02: 29613; step03: 134944; end: 134984 > > iteration: 5 - no buffer > step01: 0; step02: 29543; step03: 139070; end: 139110 > > iteration: 6 - buffered > step01: 0; step02: 32427; step03: 150457; end: 150487 > > Anyway, I suppose I made a wrong assumption while designing my app. I > don't think I can get a performance boost of 90% or so. Thus the > documents (at least the .pdfs) won't be extracted and indexed at the > time of adding them to the knowledge base. > Since I also have a db involved, I can keep the basic data there, and > extract and index in the meantime - most likely using a different thread. > > thx, > -- > Miroslaw Milewski > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: PDFBox problem.
I usually use use -Dlog4j.configuration=log4j.xml when invoking java from the command line, but I believe this depends on your environment. ex java -Dlog4j.configuration=log4j.xml org.pdfbox.ExtractText input.pdf Ben On Fri, 23 Jul 2004, Christiaan Fluit wrote: > We invoke the following code in a static initializer that simply > disables log4j's output entirely. > > static { > Properties props = new Properties(); > props.put("log4j.threshold", "OFF"); > org.apache.log4j.PropertyConfigurator.configure(props); > } > > Of course, when you make use of log4j in your own code, you have to be > more specific. > > > Regards, > > Chris. > -- > > Natarajan.T wrote: > > > FYI, > > > > I am using PDFBox.jar to Convert PDF to Text. > > > > Problem is in the runtime its printing lot of object messages > > > > How can I avoid this one??? How can I go with this one. > > > > import java.io.InputStream; > > import java.io.BufferedWriter; > > import java.io.IOException; > > > > import org.pdfbox.util.PDFTextStripper; > > import org.pdfbox.pdfparser.PDFParser; > > import org.pdfbox.pdmodel.PDDocument; > > import org.pdfbox.pdmodel.PDDocumentInformation; > > > > > > /** > > * @author natarajant > > * > > * TODO To change the template for this generated type comment go to > > * Window - Preferences - Java - Code Generation - Code and Comments */ > > public class PDFConverter extends DocumentConverter{ > > > > public PDFConverter() { > > } > > > >/** > > * This method will construct the Lucene document object from the > > * given information by extracting the text from PDF file. > > * > > * @param reader and writer - InputStream > > and BufferedWriter > > * @return true or false i.e. extract the > > text or not > > */ > > public boolean extractText(InputStream reader, BufferedWriter > > writer) throws IOException{ > > > > PDFParser parser = null; > > PDDocument pdDoc = null; > > PDFTextStripper stripper = null; > > String pdftext = ""; > > String pdftitle = ""; > > try { > > parser = new PDFParser(reader); > >parser.parse(); > >pdDoc = parser.getPDDocument(); > > > >stripper = new PDFTextStripper(); > >pdftext = stripper.getText(pdDoc); > > > >writer.write(pdftext +" "); > > > > PDDocumentInformation info = > > pdDoc.getDocumentInformation(); > >pdftitle = info.getTitle(); > > > >} catch(Exception err) { > > > >System.out.println(err.getMessage()); > > } > > writer.close(); > > return true; > >} > > > > > > } > > > > > > > -- > [EMAIL PROTECTED] > > Aduna > Prinses Julianaplein 14-b > 3817 CS Amersfoort > The Netherlands > > +31 33 465 9987 phone > +31 33 465 9987 fax > > http://aduna.biz > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Building query to match a sub-string of a field
If you are building a query using the API, the WildcardQuery class will allow you to use a leading wildcard character. The QueryParser will not allow this, however, so if you're getting queries using the QueryParser a leading wildcard won't work. I have successfully done substring queries through the API using code previously posted to the list: http://www.mail-archive.com/[EMAIL PROTECTED]/msg06388.html I haven't run into any performance problems because of these classes. There were a few minor changes that needed to be made to that code to make it work with the latest Lucene 1.4RC3 - I think it was just a matter of changing a constructor signature. Ben -Original Message- From: Terence Lai [mailto:[EMAIL PROTECTED] Sent: Tuesday, June 29, 2004 4:29 PM To: [EMAIL PROTECTED] Subject: Building query to match a sub-string of a field Hi Everyone, I am trying to construct a query which matches a sub-string of a field. As an illustration, I would like to search the following words by using the sub-string "test": - test - testing - contest - contestable I realize that Lucene does support wildcard searches using "*" and "?" in the custom query. Therefore, the query string "*test*" should give me the right result. However, the Lucene query syntax (http://jakarta.apache.org/lucene/docs/queryparsersyntax.html) does not allow the wildcard "*" as the first character of the search. Therefore, the query "*test*" is invalid. Does anyone have a solution to build the query to achieve the same result? Thanks, Terence - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
queryparser: parsing boolean logic
Here is a follow-up to a previous message I posted, dealing with converting user-entered boolean logic into a Query. Why does the QueryParser construct the same query for the following two strings? "apple AND orange OR pear AND grape" "apple AND orange AND pear AND grape" I think a user's expectation would be that the first query matches things containing apple and orange, or containing pear and grape. And that the second query would only match things containing all four items. However, the same query is constructed both times (the constructed query requires all four). package collective.search.lucene.tests; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.queryParser.ParseException; import org.apache.lucene.queryParser.QueryParser; import org.apache.lucene.search.Query; import junit.framework.TestCase; public class QPTest extends TestCase { public QPTest(String arg0) { super(arg0); } private void display(String s, Query q) { System.out.println("\"" + s + "\" = \"" + q.toString() + "\""); } public void testBooleanConstruction() throws ParseException { String test1 = "apple AND orange OR pear AND grape"; String test2 = "apple AND orange AND pear AND grape"; QueryParser qp = new QueryParser("df", new StandardAnalyzer()); Query query1 = qp.parse(test1); Query query2 = qp.parse(test2); display(test1, query1); display(test2, query2); } }
building a search query
I am working on a UI to allow a user to build a search query. The user creates individual "clauses", each of which is basically a simple search query. The user selects boolean operators (AND, OR, NOT), to connect these clauses. When the user is finished constructing the search, there will be N clauses and N-1 boolean connectors. Each clause is backed by an object that knows how to generate a Lucene Query from the clause. The objective is to combine the clauses and the boolean operators into a BooleanQuery. What is the best way to programmatically make the final BooleanQuery object? It seems there is a modeling mismatch: the user sees N clauses connected with N-1 connectors, but the BooleanQuery will require N Querys with each Query having its own required and prohibited flags set correctly. I looked briefly at the QueryParser class - it appears to have logic to bridge these two different ways of modeling complex queries (in the addClause method). Is this the best approach? What have others done? Thanks, Ben
Re: too many files open error
As PDFBox is an all Java solution there is no specific linux/unix version. The source that is available with the downloaded package should suit your needs. What does the sourceforge site not provide for you? Ben On Fri, 26 Mar 2004, Charlie Smith wrote: > Is there another source for the pdfbox than the sourceforge link from > pdfbox.org? > > I'd like to get the linux/unix version, and wonder if the source there is ok to > use? > Couldn't this be made available to jakarta, or maybe it has? > > > >> Otis wrote on 3/24/04 > >>Subject:Re: analyzer for word perfect? > > > >I just finished writing a chapter for Lucene in Action that deals with > >that. > > >PDF: pdfbox.org > >MS Word/Excel: jakarta.apache.org/poi > >WP: http://www.google.com/search?q=java+word+perfect+parser > > >Note that what you need are parsers. The term Analyzer has a special > >meaning in Lucene realm. > > >Otis > > > >--- Charlie Smith wrote: > >> Is there an analyzer for WordPerfect files? > >> > >> I have a need to be able to index WP files as well as MS files, pdfs, > >> etc. > >> > >> > > -- Otis wrote on 3/24/04 > >Subject:Re: analyzer for word perfect? > > > >I just finished writing a chapter for Lucene in Action that deals with > >that. > > >PDF: pdfbox.org > >MS Word/Excel: jakarta.apache.org/poi > >WP: http://www.google.com/search?q=java+word+perfect+parser > > >Note that what you need are parsers. The term Analyzer has a special > >meaning in Lucene realm. > > >Otis > > > >--- Charlie Smith wrote: > >> Is there an analyzer for WordPerfect files? > >> > >> I have a need to be able to index WP files as well as MS files, pdfs, > >> etc. > >> > >> > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Problem while Indexing Pdf files
The latest release of PDFBox changed the way it dealt with fonts and introduced this bug, please try the version in CVS and let me know if you are still having a problem. Ben On Thu, 25 Mar 2004, Ankur Goel wrote: > > Hi, > > I have to index PDF files. For that I am using pdfbox. But when I try to > extract text from pdf file using pdfbox I get the following error: > > java.io.IOException: Error: No 'ToUnicode' and no 'Encoding' for Font > > at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:347) > > at > org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:169) > > at > org.pdfbox.util.PDFTextStripper.showString(PDFTextStripper.java:461) > > at > org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:692) > > at > org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:128) > > at > org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:268) > > at > org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:200) > > at > org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:172) > > at > org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:120) > > at org.pdfbox.ExtractText.main(ExtractText.java:213) > > at test.LuceneExampleIndexer.indexFile(LuceneExampleIndexer.java:67) > > at > test.LuceneExampleIndexer.indexDirectory(LuceneExampleIndexer.java:47) > > at test.LuceneExampleIndexer.index(LuceneExampleIndexer.java:30) > > at test.LuceneExampleIndexer.main(LuceneExampleIndexer.java:118) > > > Please tell me how to go about it. > > Thanks, > Ankur > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Indexing japanese PDF documents
Yes he did, but I was away the past couple days. As this is more of a PDFBox issue I responded in the PDFBox forums, please follow the thread there if you are interested. Ben On Mon, 22 Mar 2004, Otis Gospodnetic wrote: > I have not tried these other tools yet. > Have you asked Ben Litchfield, the PDFBox author, about handling of > Japanese text? > > Otis > > --- Chandan Tamrakar <[EMAIL PROTECTED]> wrote: > > I am using latest PDFbox library for parsing . I can parse a english > > documents successfully but when I parse a document containing english > > and > > japanese I do not get as I expected . > > > > Have anyone tried using PDFBox library for parsing a japanese > > documents ? Or > > do i need to use other parser like xPDF ,Jpedal ? > > > > Thanks in advace > > Chandan > > > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: use Lucene LOCAL (looking for a frontend)
For an "out of the box" job, I found searchblox pretty impressive, and easy to install. -Original Message- From: Sebastian Fey [mailto:[EMAIL PROTECTED] Sent: 28 January 2004 14:23 To: Lucene Users List Subject: AW: use Lucene LOCAL (looking for a frontend) >Not being funny, but if you have no experience in Java, then why are you using a Java >API >for index building/text searching ? im just testing some possibilities. though i cant write an java application, i can read it and, if someone gives me something to start with, im sure ill make it. if lucene seems to be the best solution, ill spend some time to leran something about java. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] This e-mail and any attachments may be confidential and/or legally privileged. If you have received this e-mail and you are not a named addressee, please inform Landmark Information Group on 01392 441700 and then delete the e-mail from your system. If you are not a named addressee you must not use, disclose, distribute, copy, print or rely on this e-mail. This email and any attachments have been scanned for viruses and to the best of our knowledge are clean. To ensure regulatory compliance and for the protection of our clients and business, we may monitor and read e-mails sent to and from our servers. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: use Lucene LOCAL (looking for a frontend)
Not being funny, but if you have no experience in Java, then why are you using a Java API for index building/text searching ? -Original Message- From: Sebastian Fey [mailto:[EMAIL PROTECTED] Sent: 28 January 2004 14:01 To: Lucene Users List Subject: RE: use Lucene LOCAL (looking for a frontend) >To index local files leverage some of the >code I have put in my java.net articles, or use the Ant task >that resides in the sandbox repository, or write your own. im satisfied with the index ive for now, but later on ill take a look ... >How you present the search results will be up to you and the needs of your >project. ive NO experience with java. it would be nice to see an example of a webinterface, that implements lucene to have something to start with. thx, Sebastian - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] This e-mail and any attachments may be confidential and/or legally privileged. If you have received this e-mail and you are not a named addressee, please inform Landmark Information Group on 01392 441700 and then delete the e-mail from your system. If you are not a named addressee you must not use, disclose, distribute, copy, print or rely on this e-mail. This email and any attachments have been scanned for viruses and to the best of our knowledge are clean. To ensure regulatory compliance and for the protection of our clients and business, we may monitor and read e-mails sent to and from our servers. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: SearchBlox J2EE Search Component Version 1.1 released
I am seriously impressed with that - very smooth looking, and easy to use its a shame its quite pricey ... -Original Message- From: Tate Avery [mailto:[EMAIL PROTECTED] Sent: 02 December 2003 15:45 To: Lucene Users List Subject: RE: SearchBlox J2EE Search Component Version 1.1 released If you buy it, apparently: http://www.searchblox.com/buy.html -Original Message- From: Tun Lin [mailto:[EMAIL PROTECTED] Sent: Tuesday, December 02, 2003 10:43 AM To: 'Lucene Users List'; [EMAIL PROTECTED] Subject: RE: SearchBlox J2EE Search Component Version 1.1 released Hi, Just a feedback. SearchBlox can only search for html files. Will Searchblox support pdf, xml and word documents in future? It will be perfect if it can support all document types mentioned above. -Original Message- From: Robert Selvaraj [mailto:[EMAIL PROTECTED] Sent: Tuesday, December 02, 2003 10:42 PM To: Lucene Users List; [EMAIL PROTECTED] Subject: SearchBlox J2EE Search Component Version 1.1 released SearchBlox is a J2EE search component that enables you to add search functionality to your applications, intranets or portals in a matter of minutes. SearchBlox uses Lucene Search API and features integrated HTTP and File System crawlers, support for different document formats, support for indexing and searching content in 15 languages and customizable search results, all controlled from a browser-based Admin Console. Main features in this update: = - Asian language support. SearchBlox now supports Japanese, Chinese Simplified, Chinese Traditional and Korean language content. - Performance enhancements to search - Improved Hit Highlighting SearchBlox is available as a Web Archive (WAR) and is deployable on any Servlet 2.3/JSP 1.2 compliant server. SearchBlox Getting-Started Guides are available for the following servers: JBoss - http://www.searchblox.com/gettingstarted_jboss.html Jetty - http://www.searchblox.com/gettingstarted_jetty.html JRun - http://www.searchblox.com/gettingstarted_jrun.html Pramati - http://www.searchblox.com/gettingstarted_pramati.html Resin - http://www.searchblox.com/gettingstarted_resin.html Tomcat - http://www.searchblox.com/gettingstarted_tomcat.html Weblogic - http://www.searchblox.com/gettingstarted_weblogic.html Websphere - http://www.searchblox.com/gettingstarted_websphere.html The SearchBlox FREE Edition is available free of charge and can index up to 1000 HTML documents. The software can be downloaded from http://www.searchblox.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] This e-mail and any attachments may be confidential and/or legally privileged. If you have received this e-mail and you are not a named addressee, please inform Landmark Information Group on 01392 441700 and then delete the e-mail from your system. If you are not a named addressee you must not use, disclose, distribute, copy, print or rely on this e-mail. This email and any attachments have been scanned for viruses and to the best of our knowledge are clean. To ensure regulatory compliance and for the protection of our clients and business, we may monitor and read e-mails sent to and from our servers. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Lucene refresh index function (incremental indexing).
Logging uses log4j and can be configured. If you are having issues with specific PDFs then you can post a bug on the sourceforge site or mail me the PDFs directly and I will look at them. Ben http://www.pdfbox.org On Tue, 25 Nov 2003, Zhou, Oliver wrote: > I do have other problems with PDFBox-0.6.4. For one, it has annoying debug > information at very low level parsing process. The other, I got infinite > loop while indexing pdf files although they say the infinite loop bug has > been fixed in their release notes. Anybody knows what's going on? > > Thanks, > Oliver > > > > -Original Message- > From: Ben Litchfield [mailto:[EMAIL PROTECTED] > Sent: Tuesday, November 25, 2003 9:45 AM > To: Lucene Users List > Subject: RE: Lucene refresh index function (incremental indexing). > > > > Yes, just add the log4j configuration. The easiest way to do that is as a > system parameter like this > > java -Dlog4j.configuration=log4j.xml org.apache.lucene.demo.IndexHTML > -create -index c:\\index .. > > Where log4j.xml is the path to your log4j config, PDFBox has an example > one you can use. > > Ben > http://www.pdfbox.org > > On Tue, 25 Nov 2003, Zhou, Oliver wrote: > > > Lucene doesn't have pdf parser. In order to index pdf files you have to > add > > one by your self. PDFBox is a good choice. You may just ignore the > warning > > for log4j or you can add log4j in your classpath. > > > > Oliver > > > > > > -Original Message- > > From: Tun Lin [mailto:[EMAIL PROTECTED] > > Sent: Monday, November 24, 2003 10:07 PM > > To: 'Lucene Users List' > > Subject: RE: Lucene refresh index function (incremental indexing). > > > > > > Does it support indexing the contents of pdf files? I have found one > project > > called PDFBox that can be integrated with Lucene to search inside of the > pdf > > files. Currently, Lucene can only search for the pdf filename. I tried > with > > PDFBox and I got the following message when I typed the command: java > > org.apache.lucene.demo.IndexHTML -create -index c:\\index .. > > > > log4j:WARN No appenders could be found for logger > > (org.pdfbox.pdfparser.PDFParse > > r). > > log4j:WARN Please initialize the log4j system properly. > > > > Can anyone advise? > > > > -Original Message- > > From: Doug Cutting [mailto:[EMAIL PROTECTED] > > Sent: Tuesday, November 25, 2003 5:01 AM > > To: Lucene Users List > > Subject: Re: Lucene refresh index function (incremental indexing). > > > > Tun Lin wrote: > > > These are the steps I took: > > > > > > 1) I compile all the files in a particular directory using the command: > > > java org.apache.lucene.demo.IndexHTML -create -index c:\\index .. > > > , putting all the indexed files in c:\\index. > > > 2) Everytime, I added an additional file in that directory. I need to > > > reindex/recompile that directory to generate the indexes again. As the > > > directory gets larger, the indexing takes a longer time. > > > > > > My question is how do I generate the indexes automatically everytime a > > > new document is added in that directory without me recompiling everytime > > manually? > > > > To update, try removing the '-create' from the command line. The demo > code > > supports incremental updates. It will re-scan the directory and figure > out > > which files have changed, what new files have appeared and which > previously > > existing files have been removed. > > > > Doug > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Lucene refresh index function (incremental indexing).
Yes, just add the log4j configuration. The easiest way to do that is as a system parameter like this java -Dlog4j.configuration=log4j.xml org.apache.lucene.demo.IndexHTML -create -index c:\\index .. Where log4j.xml is the path to your log4j config, PDFBox has an example one you can use. Ben http://www.pdfbox.org On Tue, 25 Nov 2003, Zhou, Oliver wrote: > Lucene doesn't have pdf parser. In order to index pdf files you have to add > one by your self. PDFBox is a good choice. You may just ignore the warning > for log4j or you can add log4j in your classpath. > > Oliver > > > -Original Message- > From: Tun Lin [mailto:[EMAIL PROTECTED] > Sent: Monday, November 24, 2003 10:07 PM > To: 'Lucene Users List' > Subject: RE: Lucene refresh index function (incremental indexing). > > > Does it support indexing the contents of pdf files? I have found one project > called PDFBox that can be integrated with Lucene to search inside of the pdf > files. Currently, Lucene can only search for the pdf filename. I tried with > PDFBox and I got the following message when I typed the command: java > org.apache.lucene.demo.IndexHTML -create -index c:\\index .. > > log4j:WARN No appenders could be found for logger > (org.pdfbox.pdfparser.PDFParse > r). > log4j:WARN Please initialize the log4j system properly. > > Can anyone advise? > > -Original Message- > From: Doug Cutting [mailto:[EMAIL PROTECTED] > Sent: Tuesday, November 25, 2003 5:01 AM > To: Lucene Users List > Subject: Re: Lucene refresh index function (incremental indexing). > > Tun Lin wrote: > > These are the steps I took: > > > > 1) I compile all the files in a particular directory using the command: > > java org.apache.lucene.demo.IndexHTML -create -index c:\\index .. > > , putting all the indexed files in c:\\index. > > 2) Everytime, I added an additional file in that directory. I need to > > reindex/recompile that directory to generate the indexes again. As the > > directory gets larger, the indexing takes a longer time. > > > > My question is how do I generate the indexes automatically everytime a > > new document is added in that directory without me recompiling everytime > manually? > > To update, try removing the '-create' from the command line. The demo code > supports incremental updates. It will re-scan the directory and figure out > which files have changed, what new files have appeared and which previously > existing files have been removed. > > Doug > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Missing pdf document title
I would try two things. 1)Is PDFBox getting the title from the document? You can run this example to find out java org.pdfbox.examples.pdmodel.PrintDocumentMetaData 2)Is the lucene field getting properly set in the lucene database. I would use luke(http://www.getopt.org/luke/) to verify that lucene is getting the field. Other than that I would double check your code that gets the "Title" field correctly. Ben On Mon, 10 Nov 2003, Zhou, Oliver wrote: > Hi, > > I'm using lucene demo IndexHTML.java with pdfbox-0.6.4 to index pdf files. > It created the index files. However, the pdf document title was empty when > I did search. Any idea on why? > > Thanks > Oliver > > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Exotic format indexing?
Unfortunately, it is not quite so easy. I am not sure about Word documents but PDFs usually have there contents compressed so a raw "fishing" around for text would be pointless. Your best bet is to use a package like the one from textmining.org that handles various formats for you. Ben On Thu, 30 Oct 2003, petite_abeille wrote: > Hello, > > Indexing a multitude of esoteric formats (MS Office, PDF, etc) is a > popular question on this list... > > The traditional approach seems to be to try to find some kind of format > specific reader to properly extract the textual part of such documents > for indexing. The drawback of such an approach is that its complicated > and cumborsome: many different formats, not that many Java libraries to > understand them all. > > An alternative to such a mess could be perhaps to convert those > multitude of formats into something more or less standard and then > extract the text from that. But again, this doesn't seem to be such a > straightforward proposition. For example, one could image "printing" > every document to PDF and then convert the resulting PDF to text. Not a > piece of cake in Java. > > Finally, a while back, somebody on this list mentioned quiet a > different approach: simply read the raw binary document and go fishing > for what looks like text. I would like to try that :) > > Does anyone remember this proposal? Has anyone tried such an approach? > > Thanks for any pointers. > > Cheers, > > PA. > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Does the Lucene search engine work with PDF's?
You need to be able to extract the text from them and feed that to lucene. http://ww.pdfbox.org can extract text from pdf documents. Ben On Fri, 17 Oct 2003, Andre Hughes wrote: > Hello, > Can the Lucene search engine index and search though PDF documents? > What are the file format limits for Lucene search engine. > > Thanks in Advance, > > Andre' > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene demo ideas?
> - Index text and HTML files. Any others? What, no PDF files!! Ben -- http://www.pdfbox.org - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Question on Lucene when indexing big pdf files
> "cisco". I use Luke and my searcher program as the searching client, > it seems no problem. Can anyone help me? Or any comments on this When you use luke to look at your index does it show the correct contents for those documents? Ben - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: about PDF / HTML index
PDFBox comes with the class org.pdfbox.searchengine.lucene.LucenePDFDocument which shows how to parse /index a pdf document. Ben On Tue, 15 Jul 2003, alvaro z wrote: > > im using lucene with TXT and HTML files , its working. > > the only problem with HTML files is that i have to index html files as txt first , > before to index them as HTML. > > do anyone have try to index pdf files ? > > im trying the pdfbox , is there any samples for indexing pdf files ? (i dont find > any samples to do that) with any of the parsers (pdfbox, jpedal ,etc). > > thanks for helping, > > Alvaro. from Lima - Peru > > > - > Do you Yahoo!? > SBC Yahoo! DSL - Now only $29.95 per month! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: out of memory
It is possible that it is one single PDF that is having an issue. Can you track it down to that one and let me know which it is. It would be very helpful if you could send it to me as well. Ben http://www.pdfbox.org On Wed, 2 Apr 2003, Eoghan S wrote: > i have tried every memory setting using the -X options, up as far as > 512M actually, no effect. i also tried increasing the thread stack in > case this could have caused it, still no difference. > > thanks all the same > > > On Wed, 2003-04-02 at 20:44, Lichtner, Guglielmo wrote: > > OutOfMemory errors sometimes are not errors. You may need to use -mx to > reset the maximum memory allocated to the jvm. > > -Original Message- > From: Eoghan S [mailto:[EMAIL PROTECTED] > Sent: Wednesday, April 02, 2003 2:23 PM > To: [EMAIL PROTECTED] > Subject: out of memory > > > hi! > i am using lucene1.2 in a file sharing system, my average file amount > is about 400 totalling about 50megs (small), when run on linux it is > fine using jdk1.4.1, however using jdk1.4.1 on windows i get an outof > memory error. i am using pdfbox 0.6.1, i have also tried 0.5.6, however > same problem. i am not sure where the problem lies,whether pdfbox or > lucene or something in jdk, but was wondering if anyone else had the > same experience.. or a solution > thanks > > -- > Eoghans Fortune For Wed Apr 2 17:43:01 IST 2003 > All the world's a stage and most of us are desperately unrehearsed. > -- Sean O'Casey > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > > -- > Eoghans Fortune For Wed Apr 2 17:43:01 IST 2003 > All the world's a stage and most of us are desperately unrehearsed. > -- Sean O'Casey > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: getting PDFBox O/P into a stream
I am not sure what you mean by O/P. You can call into the org.pdfbox.searchengine.lucene.LucenePDFDocument to create a Lucene Document, which then can be added to the index. PDFBox also comes with a version of the IndexFiles that is basically the same as the demo one from lucene. This class can be called from the command line to create an index. Ben Litchfield -- On Tue, 25 Mar 2003, Ramrakhiani, Vikas wrote: > Can some one please help me with the command to get O/P from PDFBox on > command line or into streams rather that dumping it into a text file. > > thanks, > vikas. > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [ANN] PDFBox 0.6.0
I believe this problem has been fixed with 0.6.1. Please give it a try. Ben Litchfield -- On Thu, 6 Mar 2003, Eric Anderson wrote: > When it throws the exception, the indexer fails, so I cannot continue the index. > > It appears that it's only related to some files, as I have been able to remove > some of the files, and it will continue past that point, but if it encounters > one of these files, the index fails. > > Eric Anderson > LanRx Network Solutions > 815-505-6132 > > > Quoting Ben Litchfield <[EMAIL PROTECTED]>: > > > In this release I have changed how I parsed the document, which may have > > introduced this bug. I have received another report of this and will have > > it fixed for the next point release. > > > > You said you tried with reasonably sized PDF repository. Did you stop > > indexing at this error or did you continue? If you continued, is this the > > only error that you got? > > > > -Ben > > > > > > > > > > -- > > > > On Thu, 6 Mar 2003, Eric Anderson wrote: > > > > > Ben- > > > In attempting to use the PDFBox-0.6.0, I rec'd the following error when > > > attempting to scan a reasonably sized PDF repository. > > > > > > Any thoughts? > > > > > > > > > caught a class java.io.EOFException > > > with message: Unexpected end of ZLIB input stream > > > > > > > > > Eric Anderson > > > LanRx Network Solutions > > > > > > > > > Quoting Ben Litchfield <[EMAIL PROTECTED]>: > > > > > > > I would like to announce the next release of PDFBox. PDFBox allows for > > > > PDF documents to be indexed using lucene through a simple interface. > > > > Please take a look at org.pdfbox.searchengine.lucene.LucenePDFDocument, > > > > which will extract all text and PDF document summary properties as > > lucene > > > > fields. > > > > > > > > You can obtain the latest release from http://www.pdfbox.org > > > > > > > > Please send all bug reports to me and attach the PDF document when > > > > possible. > > > > > > > > RELEASE 0.6.0 > > > > -Massive improvements to memory footprint. > > > > -Must call close() on the COSDocument(LucenePDFDocument does this for > > you) > > > > -Really fixed the bug where small documents were not being indexed. > > > > -Fixed bug where no whitespace existed between obj and start of object. > > > > Exception in thread "main" java.io.IOException: expected='obj' > > > > actual='obj< > > > -Fixed issue with spacing where textLineMatrix was not being copied > > > > properly > > > > -Fixed 'bug' where parsing would fail with some pdfs with double endobj > > > > definitions > > > > -Added PDF document summary fields to the lucene document > > > > > > > > > > > > Thank you, > > > > Ben Litchfield > > > > http://www.pdfbox.org > > > > > > > > > > > > > > > > - > > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > > > LanRx Network Solutions, Inc. > > > Providing Enterprise Level Solutions...On A Small Business Budget > > > > > > - > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > LanRx Network Solutions, Inc. > Providing Enterprise Level Solutions...On A Small Business Budget > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [ANN] PDFBox 0.6.0
In this release I have changed how I parsed the document, which may have introduced this bug. I have received another report of this and will have it fixed for the next point release. You said you tried with reasonably sized PDF repository. Did you stop indexing at this error or did you continue? If you continued, is this the only error that you got? -Ben -- On Thu, 6 Mar 2003, Eric Anderson wrote: > Ben- > In attempting to use the PDFBox-0.6.0, I rec'd the following error when > attempting to scan a reasonably sized PDF repository. > > Any thoughts? > > > caught a class java.io.EOFException > with message: Unexpected end of ZLIB input stream > > > Eric Anderson > LanRx Network Solutions > > > Quoting Ben Litchfield <[EMAIL PROTECTED]>: > > > I would like to announce the next release of PDFBox. PDFBox allows for > > PDF documents to be indexed using lucene through a simple interface. > > Please take a look at org.pdfbox.searchengine.lucene.LucenePDFDocument, > > which will extract all text and PDF document summary properties as lucene > > fields. > > > > You can obtain the latest release from http://www.pdfbox.org > > > > Please send all bug reports to me and attach the PDF document when > > possible. > > > > RELEASE 0.6.0 > > -Massive improvements to memory footprint. > > -Must call close() on the COSDocument(LucenePDFDocument does this for you) > > -Really fixed the bug where small documents were not being indexed. > > -Fixed bug where no whitespace existed between obj and start of object. > > Exception in thread "main" java.io.IOException: expected='obj' > > actual='obj< > -Fixed issue with spacing where textLineMatrix was not being copied > > properly > > -Fixed 'bug' where parsing would fail with some pdfs with double endobj > > definitions > > -Added PDF document summary fields to the lucene document > > > > > > Thank you, > > Ben Litchfield > > http://www.pdfbox.org > > > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > LanRx Network Solutions, Inc. > Providing Enterprise Level Solutions...On A Small Business Budget > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[ANN] PDFBox 0.6.0
I would like to announce the next release of PDFBox. PDFBox allows for PDF documents to be indexed using lucene through a simple interface. Please take a look at org.pdfbox.searchengine.lucene.LucenePDFDocument, which will extract all text and PDF document summary properties as lucene fields. You can obtain the latest release from http://www.pdfbox.org Please send all bug reports to me and attach the PDF document when possible. RELEASE 0.6.0 -Massive improvements to memory footprint. -Must call close() on the COSDocument(LucenePDFDocument does this for you) -Really fixed the bug where small documents were not being indexed. -Fixed bug where no whitespace existed between obj and start of object. Exception in thread "main" java.io.IOException: expected='obj' actual='obj
RE: OutOfMemoryException while Indexing an XML file/PdfParser
I am aware of the issues with parsing certain PDF documents. I am currently working on refactoring PDFBox to deal with large documents. You will see this in the next release. I would like to thank people for feedback and sending problem documents. Ben Litchfield http://www.pdfbox.org On Tue, 18 Feb 2003, Pinky Iyer wrote: > > I am having similar problem but indexing pdf documents using pdfbox parser >(available at www.pdfbox.com). I get an exception saying "Exception in thread "main" >java.lang.OutOfMemoryError" Any body who has implemented the above code? Any help >appreciated??? > Thanks! > PI > Rob Outar <[EMAIL PROTECTED]> wrote:We are aware of DOM limitations/memory >problems, but I am using SAX to parse > the file and index elements and attributes in my content handler. > > Thanks, > > Rob > > -Original Message- > From: Tatu Saloranta [mailto:[EMAIL PROTECTED]] > Sent: Friday, February 14, 2003 8:18 PM > To: Lucene Users List > Subject: Re: OutOfMemoryException while Indexing an XML file > > > On Friday 14 February 2003 07:27, Aaron Galea wrote: > > I had this problem when using xerces to parse xml documents. The problem I > > think lies in the Java garbage collector. The way I solved it was to > create > > It's unlikely that GC is the culprit. Current ones are good at purging > objects > that are unreachable, and only throw OutOfMem exception when they really > have > no other choice. > Usually it's the app that has some dangling references to objects that > prevent > GC from collecting objects not useful any more. > > However, it's good to note that Xerces (and DOM parsers in general) > generally > use more memory than the input XML files they process; this because they > usually have to keep the whole document struct in memory, and there is > overhead on top of text segments. So it's likely to be at least 2 * input > file size (files usually use UTF-8 which most of the time uses 1 byte per > char; in memory 16-bit unicode-2 chars are used for performance), plus some > additional overhead for storing element structure information and all that. > > And since default max. java heap size is 64 megs, big XML files can cause > problems. > > More likely however is that references to already processed DOM trees are > not > nulled in a loop or something like that? Especially if doing one JVM process > for item solves the problem. > > > a shell script that invokes a java program for each xml file that adds it > > to the index. > > -+ Tatu +- > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > > - > Do you Yahoo!? > Yahoo! Shopping - Send Flowers for Valentine's Day -- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: PDF Text extraction
You need to do something like //first get the document field Field contentsField = doc.getField( "contents" ); //Then get the reader from the field BufferedReader contentsReader = new BufferedReader( contentsField.readerValue() ); //finally dump the contents of the reader to System.out String line = null; while( (line = contentsReader.readLine() ) != null ) { System.out.println( line ); } I have not tested if this compiles but it should be pretty close. Ben Litchfield On Fri, 27 Dec 2002, Suhas Indra wrote: > Hello List > > I am using PDFBox to index some of the PDF documents. The parser works fine > and I can read the summary. But the contents are displayed as > java.io.InputStream. > > When I try the following: > System.out.println(doc.getField("contents")) (where doc is the Document > object) > > The result will be: > > Text > > I want to print the extracted data. > > Can anyone please let me know how to extract the contents? > > Regards > > Suhas > > > > -- > Robosoft Technologies - Partners in Product Development > > > > > > > > > > -- > To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> > For additional commands, e-mail: <mailto:[EMAIL PROTECTED]> > -- -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
PDFBox 0.5.6
PDFBox version 0.5.6 is now available at http://www.pdfbox.org PDFBox makes it easy to add PDF Documents to a lucene index. Fixes over the last version -Fixed bug in LucenePDFDocument where stream was not being closed and small documents were not being indexed. -Fixed a spacing issue for some PDF documents. -Fixed error while parsing the version number -Fixed NullPointer in persistence example. -Create example lucene IndexFiles class which models the demo from lucene. -Fixed bug where garbage at the end of file caused an infinite loop -Fixed bug in parsing boolean values with stuff at the end like "true>>" Ben Litchfield -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
IOException not a directory
Has anybody seen this type of error before. This used to work and all of a sudden broke. That path is a folder. Ben Litchfield 2002-10-28 12:51:31,109 [Default] java.io.IOException: \\Finsrv04\JBoss-2.4.1_Tomcat-3.2.3\fast_generated_output\lucene\website\index not a directory 2002-10-28 12:51:31,109 [Default] at org.apache.lucene.store.FSDirectory.(Unknown Source) 2002-10-28 12:51:31,109 [Default] 2002-10-28 12:51:31,109 [Default] at org.apache.lucene.store.FSDirectory.getDirectory(Unknown Source) 2002-10-28 12:51:31,109 [Default] 2002-10-28 12:51:31,109 [Default] at org.apache.lucene.store.FSDirectory.getDirectory(Unknown Source) 2002-10-28 12:51:31,109 [Default] 2002-10-28 12:51:31,109 [Default] at org.apache.lucene.index.IndexReader.open(Unknown Source) 2002-10-28 12:51:31,109 [Default] 2002-10-28 12:51:31,109 [Default] at _0002fwebsite_0002dresults_0002ejspwebsite_0002dresults_jsp_1._jspService(_0002fwebsite_0002dresults_0002ejspwebsite_0002dresults_jsp_1.java:98) 2002-10-28 12:51:31,109 [Default] 2002-10-28 12:51:31,109 [Default] at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:119) 2002-10-28 12:51:31,109 [Default] 2002-10-28 12:51:31,109 [Default] at javax.servlet.http.HttpServlet.service(HttpServlet.java:853) 2002-10-28 12:51:31,109 [Default] 2002-10-28 12:51:31,109 [Default] at org.apache.jasper.servlet.JspServlet$JspCountedServlet.service(JspServlet.java:130) 2002-10-28 12:51:31,109 [Default] 2002-10-28 12:51:31,109 [Default] at javax.servlet.http.HttpServlet.service(HttpServlet.java:853) 2002-10-28 12:51:31,109 [Default] 2002-10-28 12:51:31,109 [Default] at org.apache.jasper.servlet.JspServlet$JspServletWrapper.service(JspServlet.java:282) 2002-10-28 12:51:31,109 [Default] 2002-10-28 12:51:31,109 [Default] at org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:429) 2002-10-28 12:51:31,109 [Default] 2002-10-28 12:51:31,109 [Default] at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:500) -- To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@;jakarta.apache.org> For additional commands, e-mail: <mailto:lucene-user-help@;jakarta.apache.org>
Re: pdfbox on solaris
I know that there are some memory issues with some documents. The next release of pdfbox fixes some of these. Although I am not sure why it would run differently under windows than solaris. Off the top of my head maybe the solaris JVM uses more memory per object than the windows JVM. The easiest workaround is to increase the maximum heap size(mhs) of the jvm using the -Xmx option of the jvm. Example: java -Xmx128m The default mhs of java is 64m since JDK1.2 so maybe try 128 or 256. -Ben http://www.pdfbox.org On Wed, 28 Aug 2002, Deenesh wrote: > Hi, > i am using the pdfbox on solaris 8 and am trying to index a pdf file which is around >1 mb. > > I am getting a java.outofmemory error. > > Though the same code works fime under windows. > > Has anyone get the same problem?? Any suggestion? > > Thanks > Deenesh > -- -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
Re: problems with HTML Parser
Maurits, You can get a PDF parser from http://www.pdfbox.org -Ben On Wed, 14 Aug 2002, Maurits van Wijland wrote: > Keith, > > I haven't noticed the problem with the Parser...but you trigger me > by saying that you have a PDFParser!!! > > Are you able to contribute this PDFParser?? > > Maurits. -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
Re: PDF Text Stripper
Can you send me the PDF document that you are having problems with and I will look into it. There are still some issues that I am working out with the spacing of characters. -Ben On Tue, 9 Jul 2002, Keith Gunn wrote: > On Tue, 9 Jul 2002, Ben Litchfield wrote: > > > Hi, > > > > I have written a PDF library that can be used to strip text from PDF > > documents. It is released under LGPL so have fun. > > > > There is one class which can be used to easily index PDF documents. > > pdfparser.searchengine.lucene.LucenePDFDocument has a getDocument > > method which will take a PDF file and return a Lucene Document which you > > can add to an index. > > > > If you would like to see the quality of the text extraction you can run > > pdfparser.Main from the command line which will take a PDF document and > > write a txt file. > > > > I am looking for any input that you might have. Please mail me if you > > have any bugs or feature requests. > > > > The library can be retrieved from > > http://www.csh.rit.edu/~ben/projects/pdfparser/ > > > > -Ben Litchfield > > hi, > > I downloaded the zip and quickly ran the demo on a few files, it displays > .notdef between words and there are spaces between every letter for words, > is there code in your dist. to remove these so that just terms remain? > > Keith Gunn > University Of Aberdeen > > > > -- > To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> > For additional commands, e-mail: <mailto:[EMAIL PROTECTED]> > -- -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
PDF Text Stripper
Hi, I have written a PDF library that can be used to strip text from PDF documents. It is released under LGPL so have fun. There is one class which can be used to easily index PDF documents. pdfparser.searchengine.lucene.LucenePDFDocument has a getDocument method which will take a PDF file and return a Lucene Document which you can add to an index. If you would like to see the quality of the text extraction you can run pdfparser.Main from the command line which will take a PDF document and write a txt file. I am looking for any input that you might have. Please mail me if you have any bugs or feature requests. The library can be retrieved from http://www.csh.rit.edu/~ben/projects/pdfparser/ -Ben Litchfield -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>