Re: Multiple indexes

2005-03-01 Thread Ben
Is it true that for each index I have to create a seperate instance
for FSDirectory, IndexWriter and IndexReader? Do I need to create a
seperate locking mechanism as well?

I have already implemented a program using just one index.

Thanks,
Ben

On Tue, 1 Mar 2005 22:09:05 -0500, Erik Hatcher
<[EMAIL PROTECTED]> wrote:
> It's hard to answer such a general question with anything very precise,
> so sorry if this doesn't hit the mark.  Come back with more details and
> we'll gladly assist though.
> 
> First, certainly do not copy/paste code.  Use standard reuse practices,
> perhaps the same program can build the two different indexes if passed
> different parameters, or share code between two different programs as a
> JAR.
> 
> What specifically are the issues you're encountering?
> 
> Erik
> 
> 
> On Mar 1, 2005, at 8:06 PM, Ben wrote:
> 
> > Hi
> >
> > My site has two types of documents with different structure. I would
> > like to create an index for each type of document. What is the best
> > way to implement this?
> >
> > I have been trying to implement this but found out that 90% of the
> > code is the same.
> >
> > In Lucene in Action book, there is a case study on jGuru, it just
> > mentions them using multiple indexes. I would like to do something
> > like them.
> >
> > Any resources on the Internet that I can learn from?
> >
> > Thanks,
> > Ben
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Multiple indexes

2005-03-01 Thread Ben
Hi

My site has two types of documents with different structure. I would
like to create an index for each type of document. What is the best
way to implement this?

I have been trying to implement this but found out that 90% of the
code is the same.

In Lucene in Action book, there is a case study on jGuru, it just
mentions them using multiple indexes. I would like to do something
like them.

Any resources on the Internet that I can learn from?

Thanks,
Ben

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Investingating Lucene For Project

2005-03-01 Thread Ben Litchfield

See inlined comments below.

> We have had requests from some clients who would like the ability to
> "index"  PDF files, now and possibly other text files in the future. The
> PDF files live on a server and are in a structured environment. I would
> like to somehow index the content inside the PDF and be able to run
> searches on that information from a web-form. The result MUST BE a text
> snippet (that being some text prior to the searched word and after the
> searched word).  Does this make sense? And can Lucene do this?


Lucene indexes text documents, so you will need to convert your PDF to a
text document.  PDFBox (http://www.pdfbox.org/) can do that, PDFBox
provides a summary of the document, which is just the first x number of
characters.  If you wanted a smarter summary you would need to create that
yourself.

> If the product can do this, how is the best way to get rolling on a
> project of this nature? Purchase an example book, or are there simple
> examples one can pick up on? Does Lucene have a large learning curve? or
> reasonably quick?

There are tutorials available on the website, and I would recommend
the "Lucene in Action" book.  There is a learning curve for lucene, but it
sounds like your requirements are pretty basic so it shouldn't be that
hard.



> If all the above will work, what kind of license does this require? I
> have not been able to find a link to that yet on the jakarta site.

http://www.apache.org/licenses/LICENSE-2.0

Ben

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



PDF Highlighter Package

2005-02-28 Thread Ben Litchfield

For those of you that support indexing PDF documents, PDFBox now supports
Adobe's PDF Highlight specification
(http://partners.adobe.com/public/developer/en/pdf/HighlightFileFormat.pdf)

PDFBox is now capable of generating an XML document that describes words
in a PDF document to highlight.

An "in action" example can be seen at

http://pavilion.csh.rit.edu:8080/pdfbox/index.html

You can enter any web accessible PDF and any keywords.  The PDF will open
normally and after a short pause(this is running on an old slow server)
will jump to the first selected keyword.

Source code is available in CVS or in tonight's nightly build.

Any comments/suggestions are welcome.

Special thanks to Stephan Lagraulet, who made this possible with code
contributions.

Ben
http://www.pdfbox.org

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Sorting date stored in milliseconds time

2005-02-25 Thread Ben
Hi

I store my date in milliseconds, how can I do a sort on it? SortField
has INT, FLOAT and STRING. Do I need to create a new sort class, to
sort the long value?

Thanks
Ben

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Sorting isn't working for my date field

2005-02-21 Thread Ben
Hi

Do I need to store and index the field I want to sort? Currently I am
only indexing the field without storing nor tokenizing it.

I have a date field indexing as MMdd and I have two documents with
the same date. When I do my search with:

searcher.search(query, new SortField("date", true));
searcher.search(query, new SortField("date", false));

they both return the same order.

Any idea? Thanks.

Ben

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: MultiFieldQueryParser 1.8 isn't parsing phrases

2005-02-19 Thread Ben
Thanks


On Sat, 19 Feb 2005 16:09:49 +0100, Daniel Naber
<[EMAIL PROTECTED]> wrote:
> On Saturday 19 February 2005 15:26, Ben wrote:
> 
> > When I try to search for phrases using the MultiFieldQueryParser v1.8
> > from CVS, it gives me NullPointerException.
> 
> This has just been fixed in SVN (I assume you mean SVN, CVS still exists
> but is read only and probably not updated anymore).
> 
> Regards
>  Daniel
> 
> --
> http://www.danielnaber.de
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



MultiFieldQueryParser 1.8 isn't parsing phrases

2005-02-19 Thread Ben
Hi

When I try to search for phrases using the MultiFieldQueryParser v1.8
from CVS, it gives me NullPointerException.

Using the following keyword works:

title:"IBM backs linux"

However, it gives me the exception if I use the following keyword:

"IBM backs linux"

Any idea why? I am using this MultiFieldQueryParser with Lucene 1.4.3.
Of course I changed some of the boolean stuff to make it works with
the production release.

Thanks,
Ben

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Use an executable from java ...

2005-02-08 Thread Ben Litchfield

Kristian,

I assume all of you comments are with the 0.7.0 version of PDFBox.  There
were some great improvements in that version in terms of speed and
accuracy.

> That's courious beacause we experienced that pdftotext was able to
> convert 33% more pdf documents than PDFBox.

Depending on the set of PDF documents you will notice different results.
I welcome any bug reports(if they don't already exist) on that 33% that
are not working for you.  In particular, PDFBox needs some work on
non-english languages.


> That's good. Out application supports alternative conversion pipelines
> that provide fallback mechanims. If the first converter cannot convert a
> document a second converter is called. So PDFBox is our fallback
> converter.


Well, at least PDFBox made it as the "fallback.  :)

Ben
http://www.pdfbox.org

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Use an executable from java ...

2005-01-31 Thread Ben Litchfield

I will assume you are asking this question on the lucene mailing list
because you now want to index that PDF document.

Have you tried PDFBox?  It can't create an html file for you but it can
extract text.

Ben
http://www.pdfbox.org



On Mon, 31 Jan 2005, Bertrand VENZAL wrote:

> Hi all,
>
> I ve a kind of problem to execute a converting tool to modify a pdf to an
> html under Linux. In fact, i have an executable "pdftohtml" which work
> correctly on batch mode, and when I want to use it through Java under
> Windows 2000 works also,BUT it does not work at all on the server under
> linux. I m using the following code.
>
> scommand = "/bin/sh -c \"myCommand fileName output\" ";
>
> Runtime runtime = Runtime.getRuntime();
> Process proc = runtime.exec(scommand);
> proc.waitFor();
>
>
> I m running my code under Linux-redhat with a classic shell.
> Is there an other way to do the same thing or maybe am i missing something
> ?
> Any help will be grandly appreciate.
>
> Thanks
> Bertrand
>
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Search results excerpt similar to Google

2005-01-27 Thread Ben
Hi

Is it hard to implement a function that displays the search results
excerpts similar to Google?

Is it just string manipulations or there are some logic behind it? I
like their excerpts.

Thanks

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: FOP Generated PDF and PDFBox

2005-01-21 Thread Ben Litchfield


Ya, when calling LucenePDFDocument.getDocument( File ) then it should be
the same as the path.

This is the code that the class uses to set those fields.

document.add( Field.UnIndexed("path", file.getPath() ) );
document.add(Field.UnIndexed("url", file.getPath().replace(FILE_SEPARATOR,
'/')));

I have no idea why an FOP PDF would be any different than another PDF.

You can also run it from the command line, this is just for debugging
purposes like this.

java org.pdfbox.searchengine.lucene.LucenePDFDocument 

and it should print out the fields of the lucene Document object.  Is the
url there and is it correct?

Ben

On Fri, 21 Jan 2005, Luke Shannon wrote:

> That is correct. No difference with how other PDF are handled.
>
> I am looking at the index in Luke now. The FOP generated documents have a
> path but no URL? I would guess that these would be the same?
>
> Thanks for the speedy reply.
>
> Luke
>
>
> - Original Message -
> From: "Ben Litchfield" <[EMAIL PROTECTED]>
> To: "Lucene Users List" 
> Sent: Friday, January 21, 2005 12:34 PM
> Subject: Re: FOP Generated PDF and PDFBox
>
>
> >
> >
> > Are you indexing the FOP PDF's differently than other PDF documents?
> >
> > Can I assume that you are using PDFBox's LucenePDFDocument.getDocument()
> > method?
> >
> > Ben
> >
> > On Fri, 21 Jan 2005, Luke Shannon wrote:
> >
> > > Hello;
> > >
> > > Our CMS now allows users to create PDF documents (uses FOP) and than
> search
> > > them.
> > >
> > > I seem to be able to index these documents ok. But when I am generating
> the
> > > results to display I get a Null Pointer Exception while trying to use a
> > > variable that should contain the url keyword for one of these documents
> in
> > > the index:
> > >
> > > Document doc = hits.doc(i);
> > > String path = doc.get("url");
> > >
> > > Path contains null.
> > >
> > > The interesting thing is this only happens with PDF that are generate
> with
> > > FOP. Other PDFs are fine.
> > >
> > > What I find weird is shouldn't the "url" field just contain the path of
> the
> > > file?
> > >
> > > Anyone else seen this before?
> > >
> > > Any ideas?
> > >
> > > Thanks,
> > >
> > > Luke
> > >
> > >
> > >
> > > -
> > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > For additional commands, e-mail: [EMAIL PROTECTED]
> > >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: FOP Generated PDF and PDFBox

2005-01-21 Thread Ben Litchfield


Are you indexing the FOP PDF's differently than other PDF documents?

Can I assume that you are using PDFBox's LucenePDFDocument.getDocument()
method?

Ben

On Fri, 21 Jan 2005, Luke Shannon wrote:

> Hello;
>
> Our CMS now allows users to create PDF documents (uses FOP) and than search
> them.
>
> I seem to be able to index these documents ok. But when I am generating the
> results to display I get a Null Pointer Exception while trying to use a
> variable that should contain the url keyword for one of these documents in
> the index:
>
> Document doc = hits.doc(i);
> String path = doc.get("url");
>
> Path contains null.
>
> The interesting thing is this only happens with PDF that are generate with
> FOP. Other PDFs are fine.
>
> What I find weird is shouldn't the "url" field just contain the path of the
> file?
>
> Anyone else seen this before?
>
> Any ideas?
>
> Thanks,
>
> Luke
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: PDFBox deprecated methods

2005-01-05 Thread ben
Daniel,

Yes, that getText( PDDocument ) is the method you should be using.

You no longer need to use a COSDocument object, please note the following 
methods that go along with the deprecation of getText( COSDocument )

PDFParser.getPDDocument() - to get a PDDocument instead of a COSDocument after 
parsing
PDDocument.load() - A convenience method that does all the PDFParser stuff and 
returns a PDDocument
LucenePDFDocument.getDocument() - to go straight from a File/URL to a lucene 
document object


Ben


Quoting Daniel Cortes <[EMAIL PROTECTED]>:

> Ok I reply myself
> the method deprecated is .getText(Cos Document))
> if you do stripper.getText(new PDDocument(cosDoc)) there isn't any problem.
> 
> 
> Excuse me, for the question
> 
> 
> Daniel Cortes wrote:
> 
> > I've been use PDFBox in my indexation of a directory . I've download  
> > the last version of  PDFBox (0.6.7.a) and I've seen that the method 
> > that I use to extract
> > was a deprecated method. PDFTextStripper.getText().
> > stripper.getText(new PDDocument(cosDoc));
> > I know a lot of person use same me this method. What  are alternative 
> > options ?
> >
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 




-
This mail sent through IMP: http://horde.org/imp/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene appreciation

2004-12-16 Thread Ben
Hi Rony

Very impressive. Is it possible for you to provide some information
about the technology behind it? Like how do you craw other job sites
and how often you do it. Do you use any other open source software and
what are they?

I think you should clean up the data in the "Recent Searches" area, it
doesn't make sense for me to see:

company%3Amicrosoft

It does make sense if you display:

company:microsoft

Cheers,
Ben


On Thu, 16 Dec 2004 11:38:20 -0500, Erik Hatcher
<[EMAIL PROTECTED]> wrote:
> Rony - nice work!  I subscribed to an alert already.
> 
> The wiki is self-serve, just log in and add yourself.
> 
> Erik
> 
> 
> On Dec 16, 2004, at 11:26 AM, Rony Kahan wrote:
> > I'd like to introduce myself and say thanks. We've recently launched
> > http://www.indeed.com, a search engine for jobs based on Lucene.  I'm
> > consistently impressed with the quality, professionalism and support
> > of the
> > Lucene project and the Lucene community. This mailing list has been a
> > great
> > help. I'd also like to give mention to some of the consultants who had
> > a big
> > hand in making our project a reality ... Thank you Otis, Aviran,
> > Sergiu &
> > Dawid.
> >
> > As for our project, we're in beta and would love to get your feedback.
> > The
> > index size is currently ~1.8m jobs. My personal email address is rony
> > a_t
> > indeed.com. If you are interested in Lucene work you can set up an rss
> > feed
> > or email alert from here:
> > http://www.indeed.com/search?q=lucene&sort=date
> >
> > Is it possible to be added to the Wiki Powered By page?
> >
> > Thanks Everyone,
> > Rony
> >
> >
> > Indeed.com - one search. all Jobs.
> > http://www.indeed.com
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: C# Ports

2004-12-15 Thread Ben Litchfield


I have created a DLL from the lucene jars for use in the PDFBox project.
It uses IKVM(http://www.ikvm.net) to create a DLL from a jar.

The binary version can be found here
http://www.csh.rit.edu/~ben/projects/pdfbox/nightly-release/PDFBox-.NET-0.7.0-dev.zip

This includes the ant script used to create the DLL files.

This method is by far the easiest way to port it, see previous posts about
advantages and disadvantages.

Ben


On Wed, 15 Dec 2004, Garrett Heaver wrote:

> I was just wondering what tools (JLCA?) people are using to port Lucene to
> c# as I'd be well interesting in converting things like snowball stemmers,
> wordnet etc.
>
>
>
> Thanks
>
> Garrett
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: QueryFilter vs CachingWrapperFilter vs RangeQuery

2004-12-07 Thread Ben Rooney
thanks chris,

you are correct that i'm not sure if i need the caching ability.  it is
more to understand right now so that if we do need to implement it, i am
able to.

the reason for the caching is that we will have listing pages for
certain content types.  for example a listing page of articles.  this
listing will be generated against lucene engine using a basic query.
the page will also have the ability to filter the articles based on date
range as one example.  so caching those results could be beneficial.

however, we will also potentially want to cache the basic query so that
subsequent queries will hit a cache.  when new content is published or
content is removed from the site, the caches will need to be invalidated
so new results are created.

for the basic query, is there any caching mechanism built into the
SearchIndexer or do we need to build our own caching mechanism?

thanks
ben

On Tue, 2004-07-12 at 12:29 -0800, Chris Hostetter wrote:

> : > executes the search, i would keep a static reference to SearchIndexer
> : > and then when i want to invalidate the cache, set it to null or create
> 
> : design of your system.  But, yes, you do need to keep a reference to it
> : for the cache to work properly.  If you use a new IndexSearcher
> : instance (I'm simplifying here, you could have an IndexReader instance
> : yourself too, but I'm ignoring that possibility) then the filtering
> : process occurs for each search rather than using the cache.
> 
> Assuming you have a finite number of Filters, and assuming those Filters
> are expensive enough to be worth it...
> 
> Another approach you can take to "share" the cache among multiple
> IndexReaders is to explicitly call the bits method on your filter(s) once,
> and then cache the resulting BitSet anywhere you want (ie: serialize it to
> disk if you so choose).  and then impliment a "BitsFilter" class that you
> can construct directly from a BitSet regardless of the IndexReader.  The
> down side of this approach is that it will *ONLY* work if you arecertain
> that the index is never being modified.  If any documents get added, or
> the index gets re-optimized you must regenerate all of the BitSets.
> 
> (That's why the CachingWrapperFilter's cache is keyed off of hte
> IndexReader ... as long as you're re-using the same IndexReader, it know's
> that the cached BitSet must still be valid, because an IndexReader
> allways sees the same index as when it was opened, even if another
> thread/process modifies it.)
> 
> 
>   class BitsFilter {
>BitSet bits;
>public BitsFilter(BitSet bits) {
>  this.bits=bits;
>}
>public BitSet bigs(IndexReader r) {
>  return bits.clone();
>}
> }
> 
> 
> 
> 
> -Hoss
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 


Re: QueryFilter vs CachingWrapperFilter vs RangeQuery

2004-12-07 Thread Ben Rooney
erik, thanks for the reply

i get the filter know and understand how the caching works.  however the
caching is only on the filtering level which means i can cache results
that are filtered.  but if i do a basic search against the index and
want to cache that, do i need to create my own caching mechanism or does
the SearchIndexer cache the results already?  if it caches them already,
then to clear the cache, is it again removing any references to the
SearchIndexer instance?

thanks again,
ben


On Tue, 2004-07-12 at 15:18 -0500, Erik Hatcher wrote:

> On Dec 7, 2004, at 3:06 PM, Ben Rooney wrote:
> > i'm trying to understand the difference/effects between QueryFilter vs
> > CachingWrapperFilter and when you would use one vs the other and how
> > they work exactly.
> 
> QueryFilter caches the results (bit set of documents) of a query by 
> IndexReader.
> 
> CachingWrapperFilter does not actually do any filtering of its own, but 
> merely wraps the results of another non-caching filter, such as 
> DateFilter.  CachingWrapperFilter was added to disconnect caching from 
> filtering.  QueryFilter is the exception as it came first and already 
> does caching.  If you're using QueryFilter, there is no need to concern 
> yourself with CachingWrapperFilter.
> 
> > also, when exactly will the cache be cleared.  looking at the source
> > code, it appears when the IndexReader is released it would be cleared.
> > does this mean i should keep a reference to the SearchIndexer until i
> > want the results to be cleared?  for example, in a class file the
> > executes the search, i would keep a static reference to SearchIndexer
> > and then when i want to invalidate the cache, set it to null or create 
> > a
> > new instance of it?
> 
> How you keep a reference to the IndexSearcher instance is up to the 
> design of your system.  But, yes, you do need to keep a reference to it 
> for the cache to work properly.  If you use a new IndexSearcher 
> instance (I'm simplifying here, you could have an IndexReader instance 
> yourself too, but I'm ignoring that possibility) then the filtering 
> process occurs for each search rather than using the cache.
> 
>   Erik
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 


QueryFilter vs CachingWrapperFilter vs RangeQuery

2004-12-07 Thread Ben Rooney
nts",
analyzer);
Query rangeQuery = new RangeQuery(new Term("publishDate",
"20040101"), new Term("publishDate", "20041231"), true);

BooleanQuery query2004 = new BooleanQuery();
query2004.add(query, true, false);
query2004.add(rangeQuery, true, false);

start = new Date();
for (int i = 0; i < 100; i++) {
hits = searcher.search(query);
if (i == 0) logger.debug(hits.length() + " total matching 
documents");
}
end = new Date();
logger.info("query 1 - all docs - total time (ms): " +
(end.getTime() - start.getTime()));

start = new Date();
for (int i = 0; i < 100; i++) {
hits = searcher.search(query2004);
if (i == 0) logger.debug(hits.length() + " total matching
documents");
}
end = new Date();
logger.info("query 2 - 2004 range query - no cache - total time
(ms): " + (end.getTime() - start.getTime()));

QueryFilter filter2004 = new QueryFilter(rangeQuery);
start = new Date();
for (int i = 0; i < 100; i++) {
hits = searcher.search(query, filter2004);
if (i == 0) logger.debug(hits.length() + " total matching
documents");
}
end = new Date();
logger.info("query 3 - 2004 docs filter - no cache - total time
(ms): " + (end.getTime() - start.getTime()));

CachingWrapperFilter cache2004 = new
CachingWrapperFilter(filter2004);
start = new Date();
for (int i = 0; i < 100; i++) {
hits = searcher.search(query, cache2004);
if (i == 0) logger.debug(hits.length() + " total matching
documents");
}
end = new Date();
logger.info("query 4 - 2004 docs filter - cached - total time
(ms): " + (end.getTime() - start.getTime()));
    
} catch (Exception e) {
logger.error("unexpected excpetion trying to execute search",
e);
}

}
}



thanks in advance for any help
ben

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



.NET Version of Lucene

2004-12-06 Thread Ben Litchfield

I know there has been talk about a .NET version of lucene.  I have been
looking into doing something similar for PDFBox and came across a project
called IKVM http://www.ikvm.net/  I don't believe it has been mentioned on
this list.

It is a little different approach than what I people have been trying.
It uses the GNU classpath to bring all of the newer JDK classes into .NET
and you can run a command line app to create a DLL from a jar.  So for
example

ikvmc.exe -reference:ikvm.gnu.classpath.dll
-reference:IKVM.AWT.WinForms.dll -out:bin\lucene-1.4.2.dll
external\lucene-1.4.2.jar

The drawback is that you will need to include the ikvm.gnu.classpath.dll
in your project which is about 3 megs, but to be able to use lucene in
.NET and not have to use a manual process when a new version comes out is
pretty cool.  I have not gotten around to running the junit tests yet, but
that is next.

For PDFBox, which depends on ANT/junit/log4j/lucene, I was able to run the
jar->DLL process for each of those projects and run PDFBox in .NET without
a problem.

One licensing note, GNU Classpath is released as GPL "with an exception",
allowing it to be rereleased under a different license.  See
http://www.gnu.org/software/classpath/license.html for more details.

Ben

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: PDF Indexing Error

2004-12-03 Thread Ben Litchfield

I don't think that is a good solution, as there are many bug fixes and
enhancements in the current version and you would never be able to
upgrade.

The message that you are seeing "You do not have permission to extract
text" is not a bug but intended functionality of PDFBox.  PDFBox honors
the security settings in a PDF, if you don't have permission to extract
the text then PDFBox won't allow you to do it, just as Acrobat will not
allow you to do it.

PDFBox supports *modification* of PDF documents as well as text
extraction.

Ben


On Fri, 3 Dec 2004, Luke Shannon wrote:

> Hi Ben;
>
> Actually I think I did update PDFBox. I will put it back to the version I
> previously had.
>
> Luke
>
> - Original Message -
> From: "Ben Litchfield" <[EMAIL PROTECTED]>
> To: "Lucene Users List" <[EMAIL PROTECTED]>
> Sent: Thursday, December 02, 2004 8:20 PM
> Subject: Re: PDF Indexing Error
>
>
> >
> > This error is because of security settings that have been applied to the
> > PDF document which disallow text extraction.
> >
> > Not sure why you would all of a sudden get this error, unless you upgraded
> > recently.  Older versions of PDFBox did not fully support PDF security.
> >
> > Ben
> >
> > On Thu, 2 Dec 2004, Luke Shannon wrote:
> >
> > > Hello All;
> > >
> > > Perhaps this should be on the PDFBox forum but I was curious if anyone
> has
> > > seen this error parsing PDF documents using packages other than PDFBox.
> > >
> > > /usr/tomcat/fb_hub/GM/Administration/Document/java/java_io.pdf
> > > java.io.IOException: You do not have permission to extract text
> > >
> > > The weird thing is it gave this error on a document I have indexed a
> million
> > > times over the last 3 weeks.
> > >
> > > Thanks,
> > >
> > > Luke
> > >
> > >
> > >
> > > -
> > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > For additional commands, e-mail: [EMAIL PROTECTED]
> > >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: PDF Indexing Error

2004-12-02 Thread Ben Litchfield

This error is because of security settings that have been applied to the
PDF document which disallow text extraction.

Not sure why you would all of a sudden get this error, unless you upgraded
recently.  Older versions of PDFBox did not fully support PDF security.

Ben

On Thu, 2 Dec 2004, Luke Shannon wrote:

> Hello All;
>
> Perhaps this should be on the PDFBox forum but I was curious if anyone has
> seen this error parsing PDF documents using packages other than PDFBox.
>
> /usr/tomcat/fb_hub/GM/Administration/Document/java/java_io.pdf
> java.io.IOException: You do not have permission to extract text
>
> The weird thing is it gave this error on a document I have indexed a million
> times over the last 3 weeks.
>
> Thanks,
>
> Luke
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: PDF Index Time

2004-11-18 Thread Ben Litchfield

PDFBox is slow, there is an open issue for it on the sourceforge site and
I am actively working on improving speed and should see significant
improvements in the next release.

I have not extensively tried the snowtide package but they have a trial
download and the docs show that it should be just as easy to integrate as
PDFBox is.  They list pricings on there site as well, which is nice that
it is not hidden as some software companies do.

Ben

On Thu, 18 Nov 2004, Luke Shannon wrote:

> Hi;
>
> I am using the PDFBox's getLuceneDocument method to parse my PDF
> documents. It returns good results and was very easy to integrate into
> the project. However it is slow.
>
> Does anyone know of a faster package? Someone mentioned snowtide on an
> earlier post. Anyone have experience with this package?
>
> Luke

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Need advice: what pdf lib to use?

2004-10-25 Thread Ben Litchfield

In order to write software that consumes PDF documents you must agree to a
list of conditions.  One of those conditions is that permissions specified
by the author of the PDF document are respected.

PDFBox complies with this statement, if there is software that does not
then they are in violation of copyright law.

That being said, PDFBox is open source so a user could make modifications
to the source code, or as a PDF library could change permissions on a
document.

Ben

On Mon, 25 Oct 2004 [EMAIL PROTECTED] wrote:

> Yes Ben, You are right.
>
> This would be correct functionality from technical perspective. But look
> it my way with application programmer eyes reporting to big boss that c.
> 30% of doc we cope with could not be indexed because of this stupid
> limitation. Neither he or me have any influence on pdf owners and any
> ideas about what made  them create files with documet security set.
>
> In short, if You also could implement this "uncorrect functionality"  the
> "closed source" guys did, it would be really great!
>
> As far as sponsoring is concerned I would be ready to hack (or at least to
> try) it even for 1/3 of that fortune:)))
>
> J.
>
>
>
>
>
> Ben Litchfield <[EMAIL PROTECTED]>
> 25.10.2004 14:02
> Please respond to "Lucene Users List"
>
>
> To: Lucene Users List <[EMAIL PROTECTED]>
> cc: (bcc: Iouli Golovatyi/X/GP/Novartis)
> Subject:Re: Need advice: what pdf lib to use?
> Category:
>
>
>
>
> PDFBox does not 'stumble' when it gives that message, that is correct
> functionality if that permission is not allowed.
>
> If your company is willing to pay a 'fortune' why not sponsor a change to
> an open source project for half a fortune.
>
> Ben
> http://www.pdfbox.org
>
> On Mon, 25 Oct 2004 [EMAIL PROTECTED] wrote:
>
> > PDFbox stumbles also with "class java.io.IOException with message:  -
> You
> > do not have permission to extract text" in case the doc is copy/print
> > protected.
> > I tested now the snowtide commercial product and it looks like it could
> > process these files as well. Performance was also not so bad.
> Unfortunatly
> > the test result could not be considered as 100%, because the free
> version
> > processed just first  8  pages.  After all this product costs a fortune
> > (as long the company is ready to pay I don't realy mind:))
> >
> > J.
> >
> >
> >
> >
> >
> > Robert Newson <[EMAIL PROTECTED]>
> > Sent by: news <[EMAIL PROTECTED]>
> > 24.10.2004 17:44
> > Please respond to "Lucene Users List"
> >
> >
> > To: [EMAIL PROTECTED]
> > cc: (bcc: Iouli Golovatyi/X/GP/Novartis)
> > Subject:Re: Need advice: what pdf lib to use?
> > Category:
> >
> >
> >
> > [EMAIL PROTECTED] wrote:
> > > Hello all,
> > >
> > > I need a piece of advice/experience..
> > >
> > > What pdf parser (written in java) u'd recommend?
> > >
> > > I played now with PDFBox-0.6.7a and would not say I was satisfied too
> > much
> > > with it
> > >
> > > On certain pdf's (not well formated but anyway readable with acrobate)
> > it
> > > run into dead loop (this I could fix in code),
> > > and on one file it produced "out of memory error" and killed jvm:(
> (this
> >
> > > problem I could not identify yet)
> > >
> > > After all the performance was not too great as well: it took c. 19 h.
> to
> >
> > > index 13000 files (c. 3.5Gb)
> > >
> > > Regards,
> > > J.
> > >
> > >
> > >
> >
> > On the specific problem of the "dead loop", I reported an instance of
> > this to Ben a week or so ago and he has fixed it in the latest
> > nightlies.  I expect an official release will include this bugfix soon.
> > The file in question was unreadable with any PDF software I have, but
> > someone managed to create it somehow...
> >
> > http://sourceforge.net/tracker/index.php?func=detail&aid=1037145&group_id=78314&atid=552832
> >
> > I've found pdfbox to be pretty good. The only time I get problems is
> > with corrupted or egregiously bad PDF files.
> >
> > B.
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> >
> >
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Need advice: what pdf lib to use?

2004-10-25 Thread Ben Litchfield

PDFBox does not 'stumble' when it gives that message, that is correct
functionality if that permission is not allowed.

If your company is willing to pay a 'fortune' why not sponsor a change to
an open source project for half a fortune.

Ben
http://www.pdfbox.org

On Mon, 25 Oct 2004 [EMAIL PROTECTED] wrote:

> PDFbox stumbles also with "class java.io.IOException with message:  - You
> do not have permission to extract text" in case the doc is copy/print
> protected.
> I tested now the snowtide commercial product and it looks like it could
> process these files as well. Performance was also not so bad. Unfortunatly
> the test result could not be considered as 100%, because the free version
> processed just first  8  pages.  After all this product costs a fortune
> (as long the company is ready to pay I don't realy mind:))
>
> J.
>
>
>
>
>
> Robert Newson <[EMAIL PROTECTED]>
> Sent by: news <[EMAIL PROTECTED]>
> 24.10.2004 17:44
> Please respond to "Lucene Users List"
>
>
> To: [EMAIL PROTECTED]
> cc: (bcc: Iouli Golovatyi/X/GP/Novartis)
> Subject:Re: Need advice: what pdf lib to use?
> Category:
>
>
>
> [EMAIL PROTECTED] wrote:
> > Hello all,
> >
> > I need a piece of advice/experience..
> >
> > What pdf parser (written in java) u'd recommend?
> >
> > I played now with PDFBox-0.6.7a and would not say I was satisfied too
> much
> > with it
> >
> > On certain pdf's (not well formated but anyway readable with acrobate)
> it
> > run into dead loop (this I could fix in code),
> > and on one file it produced "out of memory error" and killed jvm:( (this
>
> > problem I could not identify yet)
> >
> > After all the performance was not too great as well: it took c. 19 h. to
>
> > index 13000 files (c. 3.5Gb)
> >
> > Regards,
> > J.
> >
> >
> >
>
> On the specific problem of the "dead loop", I reported an instance of
> this to Ben a week or so ago and he has fixed it in the latest
> nightlies.  I expect an official release will include this bugfix soon.
> The file in question was unreadable with any PDF software I have, but
> someone managed to create it somehow...
>
> http://sourceforge.net/tracker/index.php?func=detail&aid=1037145&group_id=78314&atid=552832
>
> I've found pdfbox to be pretty good. The only time I get problems is
> with corrupted or egregiously bad PDF files.
>
> B.
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Need advice: what pdf lib to use?

2004-10-22 Thread Ben Litchfield

Please post any PDFBox issues you notice on the PDFBox sourceforge bug
list, if possible attach/email any problem PDFs that you encounter.

There are some efforts underway to improve the speed of PDFBox, you can
monitor the progress at
http://sourceforge.net/tracker/index.php?func=detail&aid=1046300&group_id=78314&atid=552832

As for other suggestions, I know some people have utilized xpdf(open
source but non Java) to extract the text.

For other Java solutions
PDFTextStream(commercial) - "Fastest PDF-to-Text Solution for Java"
http://snowtide.com/home/PDFTextStream/

Etymon PJ (GPL)
http://www.etymon.com/

Ben
http://www.pdfbox.org



On Fri, 22 Oct 2004 [EMAIL PROTECTED] wrote:

> Hello all,
>
> I need a piece of advice/experience..
>
> What pdf parser (written in java) u'd recommend?
>
> I played now with PDFBox-0.6.7a and would not say I was satisfied too much
> with it
>
> On certain pdf's (not well formated but anyway readable with acrobate)  it
> run into dead loop (this I could fix in code),
> and on one file it produced "out of memory error" and killed jvm:( (this
> problem I could not identify yet)
>
> After all the performance was not too great as well: it took c. 19 h. to
> index 13000 files (c. 3.5Gb)
>
> Regards,
> J.
>
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Google Desktop Could be Better

2004-10-16 Thread Ben Litchfield

The latest PDFBox jar is 2179K, as you point out is significantly larger
than the jar in Parsnips.  The majority of that space is used by cmap
mapping files used for proper text extraction so any classes that could be
removed would only result in a minor size reduction.  I would think that
the capability of indexing PDF documents would outweigh the extra time for
the download.

Ben




On Sat, 16 Oct 2004, Bill Tschumy wrote:

>
> On Oct 16, 2004, at 9:47 PM, Ben Litchfield wrote:
>
> >
> >> types.  It uses Lucene underneath.  I'm thinking about extending it in
> >> the direction that Google Desktop is going and automatically index
> >> certain file types and directories in your system.
> >
> > And of course supporting PDF documents right!
> >
> > Ben
> > http://www.pdfbox.org
> >
>
> Ahem...  right...  My next version will do a better job with PDF and
> RTF files.  I've looked at pdfBox, but the jar file is so big that I
> hate to burden my users by incorporating it.  Any chance of getting a
> smaller version that just does the text extraction?  Your jar file is
> more than twice the size of my entire application including
> documentation.  I really would like to solve this problem.
> --
> Bill Tschumy
> Otherwise -- Austin, TX
> http://www.otherwise.com
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Google Desktop Could be Better

2004-10-16 Thread Ben Litchfield

> types.  It uses Lucene underneath.  I'm thinking about extending it in
> the direction that Google Desktop is going and automatically index
> certain file types and directories in your system.

And of course supporting PDF documents right!

Ben
http://www.pdfbox.org

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Highlighting PDF file after the search

2004-09-27 Thread Ben Litchfield

With some work this is possible with PDFBox.  PDFBox extracts text with
positioning and sizing.  When the text was found you could add to the page
content stream the drawing of a highlighted box.

PDFBox has an open RFE for this functionality, please monitor it for
progress.

http://sourceforge.net/tracker/index.php?func=detail&aid=1035635&group_id=78314&atid=552835

Ben

On Mon, 27 Sep 2004 [EMAIL PROTECTED] wrote:

> Bruce,
> You are right, i tried this morning and when i try to stream the
> higlighter output as pdf, acrobat was not able to read or open it!!
> Which project do you recommend that would do pdf highlighting?
>
> Thanks,
> Vijay Balasubramanian
> DPRA Inc.,
>
>
>
>
>   Bruce Ritchie
>   <[EMAIL PROTECTED]To:   Lucene Users List <[EMAIL 
> PROTECTED]>
>   re.com>  cc:
>Subject:  RE: Highlighting PDF file 
> after the search
>   09/20/2004 05:35
>   PM
>   Please respond to
>   Lucene Users List
>
>
>
>
>
>
> > From: [EMAIL PROTECTED]
>
> > I can successfully index and search the PDF documents,
> > however i am not able to highlight the searched text in my
> > original PDF file (ie: like dtSearch highlights on original file)
> >
> > I took a look at the highlighter in sandbox, compiled it and
> > have it ready.  I am wondering if this highlighter is for
> > highlighting indexed documents or can it be used for PDF
> > Files as is !  Please enlighten !
>
> The highlighter code in sandbox can facilitate highlighting of text
> *extracted* from the PDF, however it does nothing for you to highlight
> search terms *inside* of the PDF. For that you will need some sort of
> tool
> that can modify the PDF on the fly as the user views it. I know of no
> quick
> and dirty tool that allows you to do this, though there is quite a few
> projects and products which allow you to manipulate PDF files which
> likely
> can be used to obtain the behavior you are looking for (with some effort
> on
> your part).
>
>
> Regards,
>
> Bruce Ritchie
>
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents

2004-09-10 Thread Ben Litchfield
> I can say that gc is not collecting these objects since I  forced gc
> runs when indexing every now and then (when parsing pdf-type objects,
> that is): No effect.

What PDF parser are you using?  Is the problem within the parser and not
lucene?  Are you releasing all resources?

Ben

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: PDF->Text Performance comparison

2004-09-09 Thread Ben Litchfield
>  1) I tried to migrate to never versions(o.6.4, 0.6.5, 0.6.6), but all the time I had
>  problems with parsing the same pdf documents, which worked well for
>  0.6.3. I mentioned my problems here:
>   https://sourceforge.net/tracker/?func=detail&atid=552832&aid=1021691&group_id=78314

I am waiting for a response from you on this issue, try to login to SF
when posting bugs so you get a notification when it is updated.



>  2) When I were started with 0.6.3 I experienced perfomance problems
>  too, especially with large pdf documents (I had several with more
>  then 20MB size). I changed a bit source, wrapping the following line
>  of BaseParser class:

I will give that a try, thanks for letting me know.

Ben

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: PDF->Text Performance comparison

2004-09-08 Thread Ben Litchfield

Yes, that and a few other adjectives, but I didn't want to get carried
away.

Ben


On Wed, 8 Sep 2004, Doug Cutting wrote:

> Ben Litchfield wrote:
> > PDFBox: slow PDF text extraction for Java applications
> > http://www.pdfbox.org
>
> Shouldn't that read, "PDFBox: *free* slow PDF text extraction for Java
> applications, with Lucene integration"?
>
> Doug
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



PDF->Text Performance comparison

2004-09-08 Thread Ben Litchfield

On Wed, 8 Sep 2004, Chas Emerick wrote:
> PDFTextStream: fast PDF text extraction for Java applications
> http://snowtide.com/home/PDFTextStream/


For those that have not seen, snowtide.com has done a performance
comparison against several Java PDF->Text libraries, including Snowtide's
PDFTextStream, PDFBox, Etymon PJ and JPedal.  It appears to be fairly well
done.

http://snowtide.com/home/PDFTextStream/Performance


PDFBox: slow PDF text extraction for Java applications
http://www.pdfbox.org

:)

Ben


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: pdf in Chinese

2004-09-08 Thread Ben Litchfield

This appears to be more of a PDFBox issue than a lucene issue, please post
an issue to the PDFBox site.

Also note, that because of certain encodings that a PDF writer can use, it
is impossible to extract text from all PDF documents.

Ben

On Wed, 8 Sep 2004, [EMAIL PROTECTED] wrote:

> it is not about analyzer ,i  need to read text from pdf file first.
>
> - Original Message -
> From: "Chandan Tamrakar" <[EMAIL PROTECTED]>
> To: "Lucene Users List" <[EMAIL PROTECTED]>
> Sent: Wednesday, September 08, 2004 4:15 PM
> Subject: Re: pdf in Chinese
>
>
> > which analyzer you are using to index chinese pdf documents ?
> > I think you should use cjkanalyzer
> > - Original Message -
> > From: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
> > To: <[EMAIL PROTECTED]>
> > Sent: Wednesday, September 08, 2004 11:27 AM
> > Subject: pdf in Chinese
> >
> >
> > > Hi all,
> > > i use pdfbox to parse pdf file to lucene document.when i parse
> > Chinese
> > > pdf file,pdfbox is not always success.
> > > Is anyone have some advice?
> > >
> > >
> > > -
> > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > For additional commands, e-mail: [EMAIL PROTECTED]
> > >
> > >
> >
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Moving from a single server to a cluster

2004-09-07 Thread Ben Sinclair
My application currently uses Lucene with an index living on the
filesystem, and it works fine. I'm moving to a clustered environment
soon and need to figure out how to keep my indexes together. Since the
index is on the filesystem, each machine in the cluster will end up
with a different index.

I looked into JDBC Directory, but it's not tested under Oracle and
doesn't seem like a very mature project.

What are other people doing to solve this problem?

-- 
Ben Sinclair
[EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: PDF indexing

2004-08-24 Thread Ben Litchfield


You need to add the log4j.jar to your classpath.



On Tue, 24 Aug 2004, sivalingam T wrote:

>   Hi

I have written one files for PDF Indexing. Here I have written as follows ..

This is my IndexPDF file.

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.TermEnum;

import org.pdfbox.searchengine.lucene.LucenePDFDocument;

import java.io.File;
import java.util.Date;
import java.util.Arrays;

class IndexPDF {
  private static boolean deleting = false;   // true during deletion pass
  private static IndexReader reader;// existing index
  private static IndexWriter writer;// new index being built
  private static TermEnum uidIter;// document id iterator

  public static void main(String[] argv) {
try {
  String index = "index";
  boolean create = false;
  File root = null;

  String usage = "IndexHTML [-create] [-index ] ";

  if (argv.length == 0) {
 System.err.println("Usage: " + usage);
 return;
  }

  for (int i = 0; i < argv.length; i++) {
 if (argv[i].equals("-index")) {// parse -index option
   index = argv[++i];
 } else if (argv[i].equals("-create")) {   // parse -create option
   create = true;
 } else if (i != argv.length-1) {
   System.err.println("Usage: " + usage);
   return;
 } else
   root = new File(argv[i]);
  }

  Date start = new Date();

  if (!create) {  // delete stale docs
 deleting = true;
 indexDocs(root, index, create);
  }

  writer = new IndexWriter(index, new StandardAnalyzer(), create);
  writer.maxFieldLength = 100;

  indexDocs(root, index, create);// add new docs

  System.out.println("Optimizing index...");
  writer.optimize();
  writer.close();

  Date end = new Date();

  System.out.print(end.getTime() - start.getTime());
  System.out.println(" total milliseconds");

} catch (Exception e) {
  System.out.println(" caught a " + e.getClass() +
"\n with message: " + e.getMessage());
}
  }

  /* Walk directory hierarchy in uid order, while keeping uid iterator from
  /* existing index in sync.  Mismatches indicate one of: (a) old documents to
  /* be deleted; (b) unchanged documents, to be left alone; or (c) new
  /* documents, to be indexed.
  */

  private static void indexDocs(File file, String index, boolean create)
  throws Exception {
if (!create) {  // incrementally update

  reader = IndexReader.open(index);// open existing index
  uidIter = reader.terms(new Term("uid", "")); // init uid iterator

  indexDocs(file);

  if (deleting) {  // delete rest of stale docs
 while (uidIter.term() != null && uidIter.term().field() == "uid") {
   System.out.println("deleting " +
   HTMLDocument.uid2url(uidIter.term().text()));
   reader.delete(uidIter.term());
   uidIter.next();
 }
 deleting = false;
  }

  uidIter.close();  // close uid iterator
  reader.close();  // close existing index

} else   // don't have exisiting
  indexDocs(file);
  }

  private static void indexDocs(File file) throws Exception
  {
if (file.isDirectory())
 { // if a directory
  String[] files = file.list();// list its files
  Arrays.sort(files); // sort the files
  for (int i = 0; i < files.length; i++)
  {  // recursively index them
 indexDocs(new File(file, files[i]));
  }

}
 if ((file.getPath().endsWith(".pdf" )) || (file.getPath().endsWith(".PDF" )))
{
System.out.println( "Indexing PDF document: " + file );
try
   {
   //Document doc = LucenePDFDocument.getDocument( file );
writer.addDocument(LucenePDFDocument.getDocument( file));
   }
   catch(Exception e)
   {}
}

  }

}

when i use the following commands, the exceptions are thrown if anybody know please 
inform me.


C:\>java org.apache.lucene.demo.IndexPDF -create -index c:\lucene\pdf c:\pdfs\Words.pdf

Indexing PDF document: c:\pdfs\Words.pdf
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/log4j/Cate
gory
at org.pdfbox.searchengine.lucene.LucenePDFDocument.addContent(LucenePDF
Document.java:197)
at org.pdfbox.searchengine.lucene.LucenePDFDocument.getDocument(LucenePD
FDocument.java:118)
at org.apache.lucene.demo.IndexPDF.indexDocs(Unknown Source)
at org.apache.lucene.demo.IndexPDF.indexDocs(Unknown Source)
at org.apache.lucene.demo.Inde

Re: integration of lucene with pdfbox

2004-08-23 Thread Ben Litchfield


If you can use lucene on its own then you already know how to add a lucene
Document to the index.  So you need to be able to take a PDF and get a
lucene Document.

org.pdfbox.searchengine.lucene.LucenePDFDocument.getDocument()

does that for you.

Ben


On Mon, 23 Aug 2004, Santosh wrote:

> I have downloaded pdfbox and lucene and kept jar files in the class path, I am able 
> to work with both of them independently but how can I integrate both
>
> regards
> Santosh kumar
>
> ---SOFTPRO DISCLAIMER--
>
> Information contained in this E-MAIL and any attachments are
> confidential being  proprietary to SOFTPRO SYSTEMS  is 'privileged'
> and 'confidential'.
>
> If you are not an intended or authorised recipient of this E-MAIL or
> have received it in error, You are notified that any use, copying or
> dissemination  of the information contained in this E-MAIL in any
> manner whatsoever is strictly prohibited. Please delete it immediately
> and notify the sender by E-MAIL.
>
> In such a case reading, reproducing, printing or further dissemination
> of this E-MAIL is strictly prohibited and may be unlawful.
>
> SOFTPRO SYSYTEMS does not REPRESENT or WARRANT that an attachment
> hereto is free from computer viruses or other defects.
>
> The opinions expressed in this E-MAIL and any ATTACHEMENTS may be
> those of the author and are not necessarily those of SOFTPRO SYSTEMS.
> 
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Fw: pdf search

2004-08-20 Thread Ben Litchfield


In order to search through a PDF document the text must be extracted from
the PDF document.  There are several libraries to do that, including
http://www.pdfbox.org   After you have the text from the PDF document you
just add it to the lucene index like any other text document.  You should
go through the intro tutorial to understand how to index/search text using
lucene.

Ben



On Fri, 20 Aug 2004, Santosh wrote:

> How can I search through PDF?
> - Original Message -
> From: Santosh
> To: Lucene Users List
> Sent: Friday, August 20, 2004 5:59 PM
> Subject: pdf search
>
>
> Hi,
>
> I am new bee to lucene.
>
> I have downloaded zip file. now how can i give my own list words to lucene?
> In the demo i saw that lucene is automatically creating index if we run the java 
> program.but I want to give my own search words, how is it possible?
>
>
> regards
> Santosh kumar
> SoftPro Systems
> Hyderabad
>
>
> "The harder you train in peace, the lesser you bleed in war"
>
> ---SOFTPRO DISCLAIMER--
>
> Information contained in this E-MAIL and any attachments are
> confidential being  proprietary to SOFTPRO SYSTEMS  is 'privileged'
> and 'confidential'.
>
> If you are not an intended or authorised recipient of this E-MAIL or
> have received it in error, You are notified that any use, copying or
> dissemination  of the information contained in this E-MAIL in any
> manner whatsoever is strictly prohibited. Please delete it immediately
> and notify the sender by E-MAIL.
>
> In such a case reading, reproducing, printing or further dissemination
> of this E-MAIL is strictly prohibited and may be unlawful.
>
> SOFTPRO SYSYTEMS does not REPRESENT or WARRANT that an attachment
> hereto is free from computer viruses or other defects.
>
> The opinions expressed in this E-MAIL and any ATTACHEMENTS may be
> those of the author and are not necessarily those of SOFTPRO SYSTEMS.
> 
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: PDFBox Issue

2004-08-17 Thread Ben Litchfield

PDFBox comes with log4j version 1.2.5(according to MANIFEST.MF in jar
file), I believe that 1.2.8 is the latest.  I will make sure that the next
version of PDFBox includes the latest log4j version, which I assume is
what everybody would like to use.

But, by looking at the below error message it appears that you might have
an older log4j in your classpath

Logger.getLogger( Class ) is available in 1.2.5 and 1.2.8


Ben


On Tue, 17 Aug 2004, Don Vaillancourt wrote:

> Wow, this is an old message.
>
> I managed to get my code to work by using the previous version of
> PDFBox.  I had used the version of log4j that had come with PDFBox.
>
> Someone had mentioned recompiling log4j, but I couldn't get the project
> to import the source into Eclipse, so I gave up.  But things work great
> with the version of PDFBox that I compiled with so I am fine with that.
>
> As for the version of log4j, I could not tell you, as I said above it
> came with PDFBox, so I'm guessing that it had probably not been tested
> with the version of log4j it was being distributed with.
>
> Paul Smith wrote:
>
> >What version of the log4j jar are you using?
> >
> >
> >
> >>-Original Message-
> >>From: Don Vaillancourt [mailto:[EMAIL PROTECTED]
> >>Sent: Tuesday, June 29, 2004 8:06 AM
> >>To: Lucene Users List
> >>Subject: PDFBox Issue
> >>
> >>Hi all,
> >>
> >>I know that this is a Lucene list but wanted to know if any of you have
> >>gotten this error before using PDFBox?
> >>
> >>I've gotten the latest version of PDFBox and it is giving me the following
> >>error:
> >>
> >>java.lang.VerifyError: (class: org/apache/log4j/LogManager, method:
> >> signature: ()V) Incompatible argument to function
> >>at org.apache.log4j.Logger.getLogger(Logger.java:94)
> >>at org.pdfbox.pdfparser.PDFParser.(PDFParser.java:57)
> >>at
> >>org.pdfbox.searchengine.lucene.LucenePDFDocument.addContent(LucenePDFDocum
> >>ent.java:197)
> >>at
> >>org.pdfbox.searchengine.lucene.LucenePDFDocument.getDocument(LucenePDFDocu
> >>ment.java:118)
> >>at Index.indexFile(Index.java:287)
> >>at Index.indexDirectory(Index.java:265)
> >>at Index.update(Index.java:63)
> >>at Lucene.main(Lucene.java:26)
> >>Exception in thread "main"
> >>
> >>I am using all the jar files that came with PDFBox.
> >>
> >>Anyone run into this problem.  I am using the following line of code:
> >>
> >>Document doc = LucenePDFDocument.getDocument(f);
> >>
> >>Thanks
> >>
> >>
> >>Don Vaillancourt
> >>Director of Software Development
> >>
> >>WEB IMPACT INC.
> >>416-815-2000 ext. 245
> >>email: [EMAIL PROTECTED]
> >>web: http://www.web-impact.com
> >>
> >>
> >>
> >>
> >>This email message is intended only for the addressee(s)
> >>and contains information that may be confidential and/or
> >>copyright.  If you are not the intended recipient please
> >>notify the sender by reply email and immediately delete
> >>this email. Use, disclosure or reproduction of this email
> >>by anyone other than the intended recipient(s) is strictly
> >>prohibited. No representation is made that this email or
> >>any attachments are free of viruses. Virus scanning is
> >>recommended and is the responsibility of the recipient.
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >
> >
> >
> >-
> >To unsubscribe, e-mail: [EMAIL PROTECTED]
> >For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> >
>
>
> --
> *Don Vaillancourt
> Director of Software Development
> *
> *WEB IMPACT INC.*
> phone: 416-815-2000 ext. 245
> fax: 416-815-2001
> email: [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>
> web: http://www.web-impact.com
>
>
>
> / This email message is intended only for the addressee(s)
> and contains information that may be confidential and/or
> copyright. If you are not the intended recipient please
> notify the sender by reply email and immediately delete
> this email. Use, disclosure or reproduction of this email
> by anyone other than the intended recipient(s) is strictly
> prohibited. No representation is made that this email or
> any attachments are free of viruses. Virus scanning is
> recommended and is the responsibility of the recipient.
> /
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: pdfbox performance.

2004-07-28 Thread Ben Litchfield


Different PDFs will exhibit different extraction speeds because of the way
that PDF documents are structured.

I assume you are using the latest version 0.6.6, could you give 0.6.5 a
try and see if you notice faster speeds.

Ben

On Thu, 29 Jul 2004, Miroslaw Milewski wrote:

> Paul Smith wrote:
>
>   > The first thing that I would do is wrap the FileInputStream with a
>   > BufferedInputStream.
>   > Change:
>   > > FileInputStream reader = new FileInputStream(file);
>   > To:
>   > InputStream reader = new BufferedInputStream(new
>   > FileInputStream(file));
>   > You get a significant boost reading in from a buffer, particularly as
>   > the size of the file grows. Try that first, and then rebenchmark.
>
>   I tested both, here is the code:
>
> File file = new File("test.pdf");
> InputStream reader = null;
>
> for(int i=1; i<=6; i++) {
>
>long step01 = Calendar.getInstance().getTimeInMillis();
>String stream = null;
>
>if(i%2 == 0) {
>  reader = new BufferedInputStream(new FileInputStream(file));
>stream = "buffered";
>}
>else {
>  reader = new FileInputStream(file);
>  stream = "no buffer";
>}
>
>PDFParser parser = null;
>PDDocument pdDoc = null;
>
>parser = new PDFParser(reader);
>parser.parse();
>pdDoc = parser.getPDDocument();
>
>long step02 = Calendar.getInstance().getTimeInMillis();
>
>PDFTextStripper stripper = new PDFTextStripper();
>tring pdftext = stripper.getText(pdDoc);
>
>long step03 = Calendar.getInstance().getTimeInMillis();
>
>pdDoc.close();
>
>long end = Calendar.getInstance().getTimeInMillis();
>
>System.out.println("iteration: " + i + " - " + stream);
>System.out.println("start: " + start);
>System.out.println("step01: " + (step01-start));
>System.out.println("step02: " + (step02-start));
>System.out.println("step03: " + (step03-start));
>System.out.println("end: " + (end-start));
> }
>
>   And below are the benchmarks for buffered and unbuffered readers. The
> difference is not stunning. It seems to get better with time, but this
> is prably due to some VM optimisation. And I'll extract the text only
> once :-).
>
> file: 9kB, text only;
>
> iteration: 1 - no buffer
> step01: 0; step02: 1492; step03: 13850; end: 13880
>
> iteration: 2 - buffered
> step01: 0; step02: 912; step03: 10245; end: 10265
>
> iteration: 3 - no buffer
> step01: 0; step02: 951 ;step03: 9924; end: 9944
>
> iteration: 4 - buffered
> step01: 0; step02: 842; step03: 10075; end: 10105
>
> iteration: 5 - no buffer
> step01: 0; step02: 831; step03: 9934; end: 9954
>
> iteration: 6 - buffered
> step01: 0; step02: 932; step03: 9944; end: 9965
>
>
> file: 74 kB; text only
>
> iteration: 1 - no buffer
> step01: 0; step02: 4918; step03: 33959; end: 33989
>
> iteration: 2 - buffered
> step01: 0; step02: 4367; step03: 32367; end: 32407
>
> iteration: 3 - no buffer
> step01: 0; step02: 4306; step03: 30995; end: 31025
>
> iteration: 4 - buffered
> step01: 0; step02: 4296; step03: 30734; end: 30764
>
> iteration: 5 - no buffer
> step01: 0; step02: 4266; step03: 30754; end: 30784
>
> iteration: 6 - buffered
> step01: 0; step02: 4256; step03: 30634; end: 30664
>
>
> file: 270 kB, text only
>
> iteration: 1 - no buffer
> step01: 0; step02: 30634; step03: 142225; end: 142265
>
> iteration: 2 - buffered
> step01: 0; step02: 29893; step03: 135354; end: 135394
>
> iteration: 3 - no buffer
> step01: 0; step02: 29553; step03: 134654; end: 134694
>
> iteration: 4 - buffered
> step01: 0; step02: 29613; step03: 134944; end: 134984
>
> iteration: 5 - no buffer
> step01: 0; step02: 29543; step03: 139070; end: 139110
>
> iteration: 6 - buffered
> step01: 0; step02: 32427; step03: 150457; end: 150487
>
>   Anyway, I suppose I made a wrong assumption while designing my app. I
> don't think I can get a performance boost of 90% or so. Thus the
> documents (at least the .pdfs) won't be extracted and indexed at the
> time of adding them to the knowledge base.
>   Since I also have a db involved, I can keep the basic data there, and
> extract and index in the meantime - most likely using a different thread.
>
>   thx,
> --
>   Miroslaw Milewski
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: PDFBox problem.

2004-07-23 Thread Ben Litchfield


I usually use use -Dlog4j.configuration=log4j.xml when invoking java from
the command line, but I believe this depends on your environment.

ex

java -Dlog4j.configuration=log4j.xml org.pdfbox.ExtractText input.pdf

Ben



On Fri, 23 Jul 2004, Christiaan Fluit wrote:

> We invoke the following code in a static initializer that simply
> disables log4j's output entirely.
>
>   static {
>   Properties props = new Properties();
>   props.put("log4j.threshold", "OFF");
>   org.apache.log4j.PropertyConfigurator.configure(props);
>   }
>
> Of course, when you make use of log4j in your own code, you have to be
> more specific.
>
>
> Regards,
>
> Chris.
> --
>
> Natarajan.T wrote:
>
> > FYI,
> >
> > I am using PDFBox.jar  to Convert PDF to Text.
> >
> > Problem is in the runtime its printing lot of object messages
> >
> > How can I avoid this one??? How can I go with this one.
> >
> > import java.io.InputStream;
> > import java.io.BufferedWriter;
> > import java.io.IOException;
> >
> > import org.pdfbox.util.PDFTextStripper;
> > import org.pdfbox.pdfparser.PDFParser;
> > import org.pdfbox.pdmodel.PDDocument;
> > import org.pdfbox.pdmodel.PDDocumentInformation;
> >
> >
> > /**
> >  * @author natarajant
> >  *
> >  * TODO To change the template for this generated type comment go to
> >  * Window - Preferences - Java - Code Generation - Code and Comments  */
> > public class PDFConverter extends DocumentConverter{
> >
> >   public PDFConverter() {
> >   }
> >
> >/**
> > * This method will construct the Lucene document object from the
> > * given information by extracting the text from PDF file.
> > *
> > * @param  reader and writer - InputStream
> > and BufferedWriter
> > * @return true or false i.e. extract the
> > text or not
> > */
> > public boolean extractText(InputStream  reader, BufferedWriter
> > writer) throws IOException{
> >
> >  PDFParser parser = null;
> >  PDDocument pdDoc = null;
> >  PDFTextStripper stripper = null;
> >  String pdftext = "";
> >  String pdftitle = "";
> >  try {
> >  parser = new PDFParser(reader);
> >parser.parse();
> >pdDoc = parser.getPDDocument();
> >
> >stripper = new PDFTextStripper();
> >pdftext = stripper.getText(pdDoc);
> >
> >writer.write(pdftext +" ");
> >
> >  PDDocumentInformation info =
> > pdDoc.getDocumentInformation();
> >pdftitle = info.getTitle();
> >
> >} catch(Exception err) {
> >
> >System.out.println(err.getMessage());
> > }
> > writer.close();
> > return true;
> >}
> >
> >
> > }
> >
> >
>
>
> --
> [EMAIL PROTECTED]
>
> Aduna
> Prinses Julianaplein 14-b
> 3817 CS Amersfoort
> The Netherlands
>
> +31 33 465 9987 phone
> +31 33 465 9987 fax
>
> http://aduna.biz
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Building query to match a sub-string of a field

2004-06-29 Thread Ben Pryor
If you are building a query using the API, the WildcardQuery class will
allow you to use a leading wildcard character. The QueryParser will not
allow this, however, so if you're getting queries using the QueryParser a
leading wildcard won't work.

I have successfully done substring queries through the API using code
previously posted to the list:
http://www.mail-archive.com/[EMAIL PROTECTED]/msg06388.html
I haven't run into any performance problems because of these classes.

There were a few minor changes that needed to be made to that code to make
it work with the latest Lucene 1.4RC3 - I think it was just a matter of
changing a constructor signature.

Ben

-Original Message-
From: Terence Lai [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, June 29, 2004 4:29 PM
To: [EMAIL PROTECTED]
Subject: Building query to match a sub-string of a field

Hi Everyone,

I am trying to construct a query which matches a sub-string of a field. As
an illustration, I would like to search the following words by using the
sub-string "test":

- test
- testing
- contest
- contestable

I realize that Lucene does support wildcard searches using "*" and "?" in
the custom query. Therefore, the query string "*test*" should give me the
right result. However, the Lucene query syntax
(http://jakarta.apache.org/lucene/docs/queryparsersyntax.html) does not
allow the wildcard "*" as the first character of the search. Therefore, the
query "*test*" is invalid. Does anyone have a solution to build the query to
achieve the same result?

Thanks,
Terence

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



queryparser: parsing boolean logic

2004-06-10 Thread Ben Pryor
Here is a follow-up to a previous message I posted, dealing with converting
user-entered boolean logic into a Query. Why does the QueryParser construct
the same query for the following two strings?

 

"apple AND orange OR pear AND grape"

"apple AND orange AND pear AND grape"

 

I think a user's expectation would be that the first query matches things
containing apple and orange, or containing pear and grape. And that the
second query would only match things containing all four items. However, the
same query is constructed both times (the constructed query requires all
four).

 

package collective.search.lucene.tests;

 

import org.apache.lucene.analysis.standard.StandardAnalyzer;

import org.apache.lucene.queryParser.ParseException;

import org.apache.lucene.queryParser.QueryParser;

import org.apache.lucene.search.Query;

 

import junit.framework.TestCase;

 

public class QPTest extends TestCase

{

public QPTest(String arg0)

{

super(arg0);

}

 

private void display(String s, Query q)

{

System.out.println("\"" + s + "\" = \"" +
q.toString() + "\"");

}

 

public void testBooleanConstruction() throws ParseException

{

String test1 = "apple AND orange OR pear AND grape";

String test2 = "apple AND orange AND pear AND
grape";



QueryParser qp = new QueryParser("df", new
StandardAnalyzer());



Query query1 = qp.parse(test1);

Query query2 = qp.parse(test2);



display(test1, query1);

display(test2, query2);

}

}



building a search query

2004-06-09 Thread Ben Pryor
I am working on a UI to allow a user to build a search query. The user
creates individual "clauses", each of which is basically a simple search
query. The user selects boolean operators (AND, OR, NOT), to connect these
clauses. When the user is finished constructing the search, there will be N
clauses and N-1 boolean connectors.

 

Each clause is backed by an object that knows how to generate a Lucene Query
from the clause. The objective is to combine the clauses and the boolean
operators into a BooleanQuery. 

 

What is the best way to programmatically make the final BooleanQuery object?
It seems there is a modeling mismatch: the user sees N clauses connected
with N-1 connectors, but the BooleanQuery will require N Querys with each
Query having its own required and prohibited flags set correctly.

 

I looked briefly at the QueryParser class - it appears to have logic to
bridge these two different ways of modeling complex queries (in the
addClause method). Is this the best approach? What have others done?

 

Thanks,

 Ben



Re: too many files open error

2004-03-26 Thread Ben Litchfield

As PDFBox is an all Java solution there is no specific linux/unix version.
The source that is available with the downloaded package should suit your
needs.  What does the sourceforge site not provide for you?

Ben




On Fri, 26 Mar 2004, Charlie Smith wrote:

> Is there another source for the pdfbox than the sourceforge link from
> pdfbox.org?
>
> I'd like to get the linux/unix version, and wonder if the source there is ok to
> use?
> Couldn't this be made available to jakarta, or maybe it has?
>
>
> >> Otis wrote on 3/24/04
> >>Subject:Re: analyzer for word perfect?
> >
> >I just finished writing a chapter for Lucene in Action that deals with
> >that.
>
> >PDF: pdfbox.org
> >MS Word/Excel: jakarta.apache.org/poi
> >WP: http://www.google.com/search?q=java+word+perfect+parser
>
> >Note that what you need are parsers.  The term Analyzer has a special
> >meaning in Lucene realm.
>
> >Otis
>
>
> >--- Charlie Smith  wrote:
> >> Is there an analyzer for WordPerfect files?
> >>
> >> I have a need to be able to index WP files as well as MS files, pdfs,
> >> etc.
> >>
> >>
> > -- Otis wrote on 3/24/04
> >Subject:Re: analyzer for word perfect?
> >
> >I just finished writing a chapter for Lucene in Action that deals with
> >that.
>
> >PDF: pdfbox.org
> >MS Word/Excel: jakarta.apache.org/poi
> >WP: http://www.google.com/search?q=java+word+perfect+parser
>
> >Note that what you need are parsers.  The term Analyzer has a special
> >meaning in Lucene realm.
>
> >Otis
>
>
> >--- Charlie Smith  wrote:
> >> Is there an analyzer for WordPerfect files?
> >>
> >> I have a need to be able to index WP files as well as MS files, pdfs,
> >> etc.
> >>
> >>
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Problem while Indexing Pdf files

2004-03-25 Thread Ben Litchfield

The latest release of PDFBox changed the way it dealt with fonts and
introduced this bug, please try the version in CVS and let me know if you
are still having a problem.

Ben


On Thu, 25 Mar 2004, Ankur Goel wrote:

>
> Hi,
>
> I have to index PDF files. For that I am using pdfbox. But when I try to
> extract text from pdf file using pdfbox I get the following error:
>
> java.io.IOException: Error: No 'ToUnicode' and no 'Encoding' for Font
>
>   at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:347)
>
>   at
> org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:169)
>
>   at
> org.pdfbox.util.PDFTextStripper.showString(PDFTextStripper.java:461)
>
>   at
> org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:692)
>
>   at
> org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:128)
>
>   at
> org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:268)
>
>   at
> org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:200)
>
>   at
> org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:172)
>
>   at
> org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:120)
>
>   at org.pdfbox.ExtractText.main(ExtractText.java:213)
>
>   at test.LuceneExampleIndexer.indexFile(LuceneExampleIndexer.java:67)
>
>   at
> test.LuceneExampleIndexer.indexDirectory(LuceneExampleIndexer.java:47)
>
>   at test.LuceneExampleIndexer.index(LuceneExampleIndexer.java:30)
>
>   at test.LuceneExampleIndexer.main(LuceneExampleIndexer.java:118)
>
>
> Please tell me how to go about it.
>
> Thanks,
> Ankur
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing japanese PDF documents

2004-03-22 Thread Ben Litchfield

Yes he did, but I was away the past couple days.  As this is more of a
PDFBox issue I responded in the PDFBox forums, please follow the thread
there if you are interested.

Ben



On Mon, 22 Mar 2004, Otis Gospodnetic wrote:

> I have not tried these other tools yet.
> Have you asked Ben Litchfield, the PDFBox author, about handling of
> Japanese text?
>
> Otis
>
> --- Chandan Tamrakar <[EMAIL PROTECTED]> wrote:
> > I am using latest PDFbox library for parsing . I can parse a english
> > documents successfully but when I parse a document containing english
> > and
> > japanese I do not get as I expected .
> >
> > Have anyone tried using PDFBox library for parsing a japanese
> > documents ? Or
> > do i need to use other parser like xPDF ,Jpedal ?
> >
> > Thanks in advace
> > Chandan
> >
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: use Lucene LOCAL (looking for a frontend)

2004-01-28 Thread Ben Keeping
For an "out of the box" job, I found searchblox pretty impressive, and easy to install.

-Original Message-
From: Sebastian Fey [mailto:[EMAIL PROTECTED]
Sent: 28 January 2004 14:23
To: Lucene Users List
Subject: AW: use Lucene LOCAL (looking for a frontend)


>Not being funny, but if you have no experience in Java, then why are you using a Java 
>API >for index building/text searching ?

im just testing some possibilities.
though i cant write an java application, i can read it and, if someone gives me 
something to start with, im sure ill make it. if lucene seems to be the best solution, 
ill spend some time to leran something about java.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



This e-mail and any attachments may be confidential and/or legally
privileged. If you have received this e-mail and you are not a named
addressee, please inform Landmark Information Group on 01392 441700
and then delete the e-mail from your system. If you are not a named
addressee you must not use, disclose, distribute, copy, print or rely 
on this e-mail. This email and any attachments have been scanned for
viruses and to the best of our knowledge are clean. To ensure 
regulatory compliance and for the protection of our clients and 
business, we may monitor and read e-mails sent to and from our 
servers.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: use Lucene LOCAL (looking for a frontend)

2004-01-28 Thread Ben Keeping
Not being funny, but if you have no experience in Java, then why are you using a Java 
API for index building/text searching ?

-Original Message-
From: Sebastian Fey [mailto:[EMAIL PROTECTED]
Sent: 28 January 2004 14:01
To: Lucene Users List
Subject: RE: use Lucene LOCAL (looking for a frontend)


>To index local files leverage some of the 
>code I have put in my java.net articles, or use the Ant  task 
>that resides in the sandbox repository, or write your own. 

im satisfied with the index ive for now, but later on ill take a look ...

>How you present the search results will be up to you and the needs of your 
>project.

ive NO experience with java.
it would be nice to see an example of a webinterface, that implements lucene to have 
something to start with.

thx,

Sebastian


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



This e-mail and any attachments may be confidential and/or legally
privileged. If you have received this e-mail and you are not a named
addressee, please inform Landmark Information Group on 01392 441700
and then delete the e-mail from your system. If you are not a named
addressee you must not use, disclose, distribute, copy, print or rely 
on this e-mail. This email and any attachments have been scanned for
viruses and to the best of our knowledge are clean. To ensure 
regulatory compliance and for the protection of our clients and 
business, we may monitor and read e-mails sent to and from our 
servers.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: SearchBlox J2EE Search Component Version 1.1 released

2003-12-02 Thread Ben Keeping

I am seriously impressed with that - very smooth looking, and easy to use 

 its a shame its quite pricey ...

-Original Message-
From: Tate Avery [mailto:[EMAIL PROTECTED]
Sent: 02 December 2003 15:45
To: Lucene Users List
Subject: RE: SearchBlox J2EE Search Component Version 1.1 released



If you buy it, apparently:
http://www.searchblox.com/buy.html



-Original Message-
From: Tun Lin [mailto:[EMAIL PROTECTED]
Sent: Tuesday, December 02, 2003 10:43 AM
To: 'Lucene Users List'; [EMAIL PROTECTED]
Subject: RE: SearchBlox J2EE Search Component Version 1.1 released


Hi,

Just a feedback.

SearchBlox can only search for html files. Will Searchblox support pdf, xml and
word documents in future? It will be perfect if it can support all document
types mentioned above.

-Original Message-
From: Robert Selvaraj [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, December 02, 2003 10:42 PM
To: Lucene Users List; [EMAIL PROTECTED]
Subject: SearchBlox J2EE Search Component Version 1.1 released

SearchBlox is a J2EE search component that enables you to add search
functionality to your applications, intranets or portals in a matter of minutes.
SearchBlox uses Lucene Search API and features integrated HTTP and File System
crawlers, support for different document formats, support for indexing and
searching content in 15 languages and customizable search results, all
controlled from a browser-based Admin Console.


Main features in this update:
=
- Asian language support. SearchBlox now supports Japanese, Chinese Simplified,
Chinese Traditional and Korean language content.
- Performance enhancements to search
- Improved Hit Highlighting

SearchBlox is available as a Web Archive (WAR) and is deployable on any Servlet
2.3/JSP 1.2 compliant server. SearchBlox Getting-Started Guides are available
for the following servers:

JBoss - http://www.searchblox.com/gettingstarted_jboss.html
Jetty - http://www.searchblox.com/gettingstarted_jetty.html
JRun - http://www.searchblox.com/gettingstarted_jrun.html
Pramati - http://www.searchblox.com/gettingstarted_pramati.html
Resin - http://www.searchblox.com/gettingstarted_resin.html
Tomcat - http://www.searchblox.com/gettingstarted_tomcat.html
Weblogic - http://www.searchblox.com/gettingstarted_weblogic.html
Websphere - http://www.searchblox.com/gettingstarted_websphere.html


The SearchBlox FREE Edition is available free of charge and can index up to 1000
HTML documents.

The software can be downloaded from http://www.searchblox.com



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



This e-mail and any attachments may be confidential and/or legally
privileged. If you have received this e-mail and you are not a named
addressee, please inform Landmark Information Group on 01392 441700
and then delete the e-mail from your system. If you are not a named
addressee you must not use, disclose, distribute, copy, print or rely 
on this e-mail. This email and any attachments have been scanned for
viruses and to the best of our knowledge are clean. To ensure 
regulatory compliance and for the protection of our clients and 
business, we may monitor and read e-mails sent to and from our 
servers.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Lucene refresh index function (incremental indexing).

2003-11-25 Thread Ben Litchfield

Logging uses log4j and can be configured.  If you are having issues with
specific PDFs then you can post a bug on the sourceforge site or mail me
the PDFs directly and I will look at them.

Ben
http://www.pdfbox.org


On Tue, 25 Nov 2003, Zhou, Oliver wrote:

> I do have other problems with PDFBox-0.6.4.  For one, it has annoying debug
> information at very low level parsing process.  The other, I got infinite
> loop while indexing pdf files although they say the infinite loop bug has
> been fixed in their release notes.  Anybody knows what's going on?
>
> Thanks,
> Oliver
>
>
>
> -Original Message-
> From: Ben Litchfield [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, November 25, 2003 9:45 AM
> To: Lucene Users List
> Subject: RE: Lucene refresh index function (incremental indexing).
>
>
>
> Yes, just add the log4j configuration.  The easiest way to do that is as a
> system parameter like this
>
> java -Dlog4j.configuration=log4j.xml org.apache.lucene.demo.IndexHTML
> -create -index c:\\index ..
>
> Where log4j.xml is the path to your log4j config, PDFBox has an example
> one you can use.
>
> Ben
> http://www.pdfbox.org
>
> On Tue, 25 Nov 2003, Zhou, Oliver wrote:
>
> > Lucene doesn't have pdf parser.  In order to index pdf files you have to
> add
> > one by your self.  PDFBox is a good choice.  You may just ignore the
> warning
> > for log4j or you can add log4j in your classpath.
> >
> > Oliver
> >
> >
> > -Original Message-
> > From: Tun Lin [mailto:[EMAIL PROTECTED]
> > Sent: Monday, November 24, 2003 10:07 PM
> > To: 'Lucene Users List'
> > Subject: RE: Lucene refresh index function (incremental indexing).
> >
> >
> > Does it support indexing the contents of pdf files? I have found one
> project
> > called PDFBox that can be integrated with Lucene to search inside of the
> pdf
> > files. Currently, Lucene can only search for the pdf filename. I tried
> with
> > PDFBox and I got the following message when I typed the command: java
> > org.apache.lucene.demo.IndexHTML -create -index c:\\index ..
> >
> > log4j:WARN No appenders could be found for logger
> > (org.pdfbox.pdfparser.PDFParse
> > r).
> > log4j:WARN Please initialize the log4j system properly.
> >
> > Can anyone advise?
> >
> > -Original Message-
> > From: Doug Cutting [mailto:[EMAIL PROTECTED]
> > Sent: Tuesday, November 25, 2003 5:01 AM
> > To: Lucene Users List
> > Subject: Re: Lucene refresh index function (incremental indexing).
> >
> > Tun Lin wrote:
> > > These are the steps I took:
> > >
> > > 1) I compile all the files in a particular directory using the command:
> > > java org.apache.lucene.demo.IndexHTML -create -index c:\\index ..
> > > , putting all the indexed files in c:\\index.
> > > 2) Everytime, I added an additional file in that directory. I need to
> > > reindex/recompile that directory to generate the indexes again. As the
> > > directory gets larger, the indexing takes a longer time.
> > >
> > > My question is how do I generate the indexes automatically everytime a
> > > new document is added in that directory without me recompiling everytime
> > manually?
> >
> > To update, try removing the '-create' from the command line.  The demo
> code
> > supports incremental updates.  It will re-scan the directory and figure
> out
> > which files have changed, what new files have appeared and which
> previously
> > existing files have been removed.
> >
> > Doug
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Lucene refresh index function (incremental indexing).

2003-11-25 Thread Ben Litchfield

Yes, just add the log4j configuration.  The easiest way to do that is as a
system parameter like this

java -Dlog4j.configuration=log4j.xml org.apache.lucene.demo.IndexHTML
-create -index c:\\index ..

Where log4j.xml is the path to your log4j config, PDFBox has an example
one you can use.

Ben
http://www.pdfbox.org

On Tue, 25 Nov 2003, Zhou, Oliver wrote:

> Lucene doesn't have pdf parser.  In order to index pdf files you have to add
> one by your self.  PDFBox is a good choice.  You may just ignore the warning
> for log4j or you can add log4j in your classpath.
>
> Oliver
>
>
> -Original Message-
> From: Tun Lin [mailto:[EMAIL PROTECTED]
> Sent: Monday, November 24, 2003 10:07 PM
> To: 'Lucene Users List'
> Subject: RE: Lucene refresh index function (incremental indexing).
>
>
> Does it support indexing the contents of pdf files? I have found one project
> called PDFBox that can be integrated with Lucene to search inside of the pdf
> files. Currently, Lucene can only search for the pdf filename. I tried with
> PDFBox and I got the following message when I typed the command: java
> org.apache.lucene.demo.IndexHTML -create -index c:\\index ..
>
> log4j:WARN No appenders could be found for logger
> (org.pdfbox.pdfparser.PDFParse
> r).
> log4j:WARN Please initialize the log4j system properly.
>
> Can anyone advise?
>
> -Original Message-
> From: Doug Cutting [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, November 25, 2003 5:01 AM
> To: Lucene Users List
> Subject: Re: Lucene refresh index function (incremental indexing).
>
> Tun Lin wrote:
> > These are the steps I took:
> >
> > 1) I compile all the files in a particular directory using the command:
> > java org.apache.lucene.demo.IndexHTML -create -index c:\\index ..
> > , putting all the indexed files in c:\\index.
> > 2) Everytime, I added an additional file in that directory. I need to
> > reindex/recompile that directory to generate the indexes again. As the
> > directory gets larger, the indexing takes a longer time.
> >
> > My question is how do I generate the indexes automatically everytime a
> > new document is added in that directory without me recompiling everytime
> manually?
>
> To update, try removing the '-create' from the command line.  The demo code
> supports incremental updates.  It will re-scan the directory and figure out
> which files have changed, what new files have appeared and which previously
> existing files have been removed.
>
> Doug
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Missing pdf document title

2003-11-10 Thread Ben Litchfield

I would try two things.

1)Is PDFBox getting the title from the document?
You can run this example to find out

java org.pdfbox.examples.pdmodel.PrintDocumentMetaData 

2)Is the lucene field getting properly set in the lucene database.  I
would use luke(http://www.getopt.org/luke/) to verify that lucene is
getting the field.

Other than that I would double check your code that gets the "Title" field
correctly.

Ben

On Mon, 10 Nov 2003, Zhou, Oliver wrote:

> Hi,
>
> I'm using lucene demo IndexHTML.java with pdfbox-0.6.4 to index pdf files.
> It created the index files.  However, the pdf document title was empty when
> I did search.  Any idea on why?
>
> Thanks
> Oliver
>
>
>
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Exotic format indexing?

2003-10-30 Thread Ben Litchfield
Unfortunately, it is not quite so easy.  I am not sure about Word
documents but PDFs usually have there contents compressed so a raw
"fishing" around for text would be pointless.  Your best bet is to use a
package like the one from textmining.org that handles various formats for
you.

Ben


On Thu, 30 Oct 2003, petite_abeille wrote:

> Hello,
>
> Indexing a multitude of esoteric formats (MS Office, PDF, etc) is a
> popular question on this list...
>
> The traditional approach seems to be to try to find some kind of format
> specific reader to properly extract the textual part of such documents
> for indexing. The drawback of such an approach is that its complicated
> and cumborsome: many different formats, not that many Java libraries to
> understand them all.
>
> An alternative to such a mess could be perhaps to convert those
> multitude of formats into something more or less standard and then
> extract the text from that. But again, this doesn't seem to be such a
> straightforward proposition. For example, one could image "printing"
> every document to PDF and then convert the resulting PDF to text. Not a
> piece of cake in Java.
>
> Finally, a while back, somebody on this list mentioned quiet a
> different approach: simply read the raw binary document and go fishing
> for what looks like text. I would like to try that :)
>
> Does anyone remember this proposal? Has anyone tried such an approach?
>
> Thanks for any pointers.
>
> Cheers,
>
> PA.
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Does the Lucene search engine work with PDF's?

2003-10-17 Thread Ben Litchfield


You need to be able to extract the text from them and feed that to lucene.
http://ww.pdfbox.org can extract text from pdf documents.

Ben


On Fri, 17 Oct 2003, Andre Hughes wrote:

> Hello,
> Can the Lucene search engine index and search though PDF documents?
> What are the file format limits for Lucene search engine.
>
> Thanks in Advance,
>
> Andre'
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene demo ideas?

2003-09-17 Thread Ben Litchfield

> - Index text and HTML files.  Any others?


What, no PDF files!!

Ben

--
http://www.pdfbox.org

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Question on Lucene when indexing big pdf files

2003-08-20 Thread Ben Litchfield


> "cisco". I use Luke and my searcher program as the searching client,
> it seems no problem. Can anyone help me? Or any comments on this

When you use luke to look at your index does it show the correct contents
for those documents?

Ben

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: about PDF / HTML index

2003-07-16 Thread Ben Litchfield

PDFBox comes with the class
org.pdfbox.searchengine.lucene.LucenePDFDocument which shows how to
parse /index a pdf document.

Ben


On Tue, 15 Jul 2003, alvaro z wrote:

>
> im using lucene with TXT and HTML files , its working.
>
> the only problem with HTML files is that i have to index html files as txt first , 
> before to index them as HTML.
>
> do anyone have try to index pdf files ?
>
> im trying the pdfbox , is there any samples for indexing pdf files ? (i dont find 
> any samples to do that) with any of the parsers (pdfbox, jpedal ,etc).
>
> thanks for helping,
>
> Alvaro. from Lima - Peru
>
>
> -
> Do you Yahoo!?
> SBC Yahoo! DSL - Now only $29.95 per month!

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: out of memory

2003-04-02 Thread Ben Litchfield
It is possible that it is one single PDF that is having an issue.  Can you
track it down to that one and let me know which it is.  It would be very
helpful if you could send it to me as well.

Ben
http://www.pdfbox.org




On Wed, 2 Apr 2003, Eoghan S wrote:

> i have tried every memory setting using the -X options, up as far as
> 512M actually, no effect. i also tried increasing the thread stack in
> case this could have caused it, still no difference.
>
> thanks all the same
>
>
> On Wed, 2003-04-02 at 20:44, Lichtner, Guglielmo wrote:
>
> OutOfMemory errors sometimes are not errors. You may need to use -mx to
> reset the maximum memory allocated to the jvm.
>
> -Original Message-
> From: Eoghan S [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, April 02, 2003 2:23 PM
> To: [EMAIL PROTECTED]
> Subject: out of memory
>
>
> hi!
> i am using lucene1.2 in a file sharing system, my average file amount
> is about 400 totalling about 50megs (small), when run on linux it is
> fine using jdk1.4.1, however using jdk1.4.1 on windows i get an outof
> memory error. i am using pdfbox 0.6.1, i have also tried 0.5.6, however
> same problem. i am not sure where the problem lies,whether pdfbox or
> lucene or something in jdk, but was wondering if anyone else had the
> same experience.. or a solution
> thanks
>
> --
> Eoghans Fortune For Wed Apr 2 17:43:01 IST 2003
> All the world's a stage and most of us are desperately unrehearsed.
>   -- Sean O'Casey
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>
> --
> Eoghans Fortune For Wed Apr 2 17:43:01 IST 2003
> All the world's a stage and most of us are desperately unrehearsed.
>   -- Sean O'Casey
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: getting PDFBox O/P into a stream

2003-03-25 Thread Ben Litchfield

I am not sure what you mean by O/P.  You can call into the
org.pdfbox.searchengine.lucene.LucenePDFDocument to create a Lucene
Document, which then can be added to the index.  PDFBox also comes with a
version of the IndexFiles that is basically the same as the demo one from
lucene.  This class can be called from the command line to create an
index.

Ben Litchfield


-- 

On Tue, 25 Mar 2003, Ramrakhiani, Vikas wrote:

> Can some one please help me with the command to get O/P from PDFBox on
> command line or into streams rather that dumping it into a text file.
>
> thanks,
> vikas.
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [ANN] PDFBox 0.6.0

2003-03-09 Thread Ben Litchfield

I believe this problem has been fixed with 0.6.1.  Please give it a try.

Ben Litchfield

-- 

On Thu, 6 Mar 2003, Eric Anderson wrote:

> When it throws the exception, the indexer fails, so I cannot continue the index.
>
> It appears that it's only related to some files, as I have been able to remove
> some of the files, and it will continue past that point, but if it encounters
> one of these files, the index fails.
>
> Eric Anderson
> LanRx Network Solutions
> 815-505-6132
>
>
> Quoting Ben Litchfield <[EMAIL PROTECTED]>:
>
> > In this release I have changed how I parsed the document, which may have
> > introduced this bug.  I have received another report of this and will have
> > it fixed for the next point release.
> >
> > You said you tried with reasonably sized PDF repository.  Did you stop
> > indexing at this error or did you continue?  If you continued, is this the
> > only error that you got?
> >
> > -Ben
> >
> >
> >
> >
> > --
> >
> > On Thu, 6 Mar 2003, Eric Anderson wrote:
> >
> > > Ben-
> > > In attempting to use the PDFBox-0.6.0, I rec'd the following error when
> > > attempting to scan a reasonably sized PDF repository.
> > >
> > > Any thoughts?
> > >
> > >
> > >  caught a class java.io.EOFException
> > >  with message: Unexpected end of ZLIB input stream
> > >
> > >
> > > Eric Anderson
> > > LanRx Network Solutions
> > >
> > >
> > > Quoting Ben Litchfield <[EMAIL PROTECTED]>:
> > >
> > > > I would like to announce the next release of PDFBox.  PDFBox allows for
> > > > PDF documents to be indexed using lucene through a simple interface.
> > > > Please take a look at org.pdfbox.searchengine.lucene.LucenePDFDocument,
> > > > which will extract all text and PDF document summary properties as
> > lucene
> > > > fields.
> > > >
> > > > You can obtain the latest release from http://www.pdfbox.org
> > > >
> > > > Please send all bug reports to me and attach the PDF document when
> > > > possible.
> > > >
> > > > RELEASE 0.6.0
> > > > -Massive improvements to memory footprint.
> > > > -Must call close() on the COSDocument(LucenePDFDocument does this for
> > you)
> > > > -Really fixed the bug where small documents were not being indexed.
> > > > -Fixed bug where no whitespace existed between obj and start of object.
> > > > Exception in thread "main" java.io.IOException: expected='obj'
> > > > actual='obj< > > > -Fixed issue with spacing where textLineMatrix was not being copied
> > > >  properly
> > > > -Fixed 'bug' where parsing would fail with some pdfs with double endobj
> > > >  definitions
> > > > -Added PDF document summary fields to the lucene document
> > > >
> > > >
> > > > Thank you,
> > > > Ben Litchfield
> > > > http://www.pdfbox.org
> > > >
> > > >
> > > >
> > > > -
> > > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > > For additional commands, e-mail: [EMAIL PROTECTED]
> > > >
> > >
> > > LanRx Network Solutions, Inc.
> > > Providing Enterprise Level Solutions...On A Small Business Budget
> > >
> > > -
> > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > For additional commands, e-mail: [EMAIL PROTECTED]
> > >
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
>
> LanRx Network Solutions, Inc.
> Providing Enterprise Level Solutions...On A Small Business Budget
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [ANN] PDFBox 0.6.0

2003-03-06 Thread Ben Litchfield
In this release I have changed how I parsed the document, which may have
introduced this bug.  I have received another report of this and will have
it fixed for the next point release.

You said you tried with reasonably sized PDF repository.  Did you stop
indexing at this error or did you continue?  If you continued, is this the
only error that you got?

-Ben




-- 

On Thu, 6 Mar 2003, Eric Anderson wrote:

> Ben-
> In attempting to use the PDFBox-0.6.0, I rec'd the following error when
> attempting to scan a reasonably sized PDF repository.
>
> Any thoughts?
>
>
>  caught a class java.io.EOFException
>  with message: Unexpected end of ZLIB input stream
>
>
> Eric Anderson
> LanRx Network Solutions
>
>
> Quoting Ben Litchfield <[EMAIL PROTECTED]>:
>
> > I would like to announce the next release of PDFBox.  PDFBox allows for
> > PDF documents to be indexed using lucene through a simple interface.
> > Please take a look at org.pdfbox.searchengine.lucene.LucenePDFDocument,
> > which will extract all text and PDF document summary properties as lucene
> > fields.
> >
> > You can obtain the latest release from http://www.pdfbox.org
> >
> > Please send all bug reports to me and attach the PDF document when
> > possible.
> >
> > RELEASE 0.6.0
> > -Massive improvements to memory footprint.
> > -Must call close() on the COSDocument(LucenePDFDocument does this for you)
> > -Really fixed the bug where small documents were not being indexed.
> > -Fixed bug where no whitespace existed between obj and start of object.
> > Exception in thread "main" java.io.IOException: expected='obj'
> > actual='obj< > -Fixed issue with spacing where textLineMatrix was not being copied
> >  properly
> > -Fixed 'bug' where parsing would fail with some pdfs with double endobj
> >  definitions
> > -Added PDF document summary fields to the lucene document
> >
> >
> > Thank you,
> > Ben Litchfield
> > http://www.pdfbox.org
> >
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
>
> LanRx Network Solutions, Inc.
> Providing Enterprise Level Solutions...On A Small Business Budget
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[ANN] PDFBox 0.6.0

2003-03-05 Thread Ben Litchfield
I would like to announce the next release of PDFBox.  PDFBox allows for
PDF documents to be indexed using lucene through a simple interface.
Please take a look at org.pdfbox.searchengine.lucene.LucenePDFDocument,
which will extract all text and PDF document summary properties as lucene
fields.

You can obtain the latest release from http://www.pdfbox.org

Please send all bug reports to me and attach the PDF document when
possible.

RELEASE 0.6.0
-Massive improvements to memory footprint.
-Must call close() on the COSDocument(LucenePDFDocument does this for you)
-Really fixed the bug where small documents were not being indexed.
-Fixed bug where no whitespace existed between obj and start of object.
Exception in thread "main" java.io.IOException: expected='obj'
actual='obj

RE: OutOfMemoryException while Indexing an XML file/PdfParser

2003-02-18 Thread Ben Litchfield

I am aware of the issues with parsing certain PDF documents.  I am
currently working on refactoring PDFBox to deal with large documents.  You
will see this in the next release.  I would like to thank people for
feedback and sending problem documents.

Ben Litchfield
http://www.pdfbox.org


On Tue, 18 Feb 2003, Pinky Iyer wrote:

>
> I am having similar problem but indexing pdf documents using pdfbox parser 
>(available at www.pdfbox.com). I get an exception saying "Exception in thread "main" 
>java.lang.OutOfMemoryError" Any body who has implemented the above code? Any help 
>appreciated???
> Thanks!
> PI
>  Rob Outar <[EMAIL PROTECTED]> wrote:We are aware of DOM limitations/memory 
>problems, but I am using SAX to parse
> the file and index elements and attributes in my content handler.
>
> Thanks,
>
> Rob
>
> -Original Message-
> From: Tatu Saloranta [mailto:[EMAIL PROTECTED]]
> Sent: Friday, February 14, 2003 8:18 PM
> To: Lucene Users List
> Subject: Re: OutOfMemoryException while Indexing an XML file
>
>
> On Friday 14 February 2003 07:27, Aaron Galea wrote:
> > I had this problem when using xerces to parse xml documents. The problem I
> > think lies in the Java garbage collector. The way I solved it was to
> create
>
> It's unlikely that GC is the culprit. Current ones are good at purging
> objects
> that are unreachable, and only throw OutOfMem exception when they really
> have
> no other choice.
> Usually it's the app that has some dangling references to objects that
> prevent
> GC from collecting objects not useful any more.
>
> However, it's good to note that Xerces (and DOM parsers in general)
> generally
> use more memory than the input XML files they process; this because they
> usually have to keep the whole document struct in memory, and there is
> overhead on top of text segments. So it's likely to be at least 2 * input
> file size (files usually use UTF-8 which most of the time uses 1 byte per
> char; in memory 16-bit unicode-2 chars are used for performance), plus some
> additional overhead for storing element structure information and all that.
>
> And since default max. java heap size is 64 megs, big XML files can cause
> problems.
>
> More likely however is that references to already processed DOM trees are
> not
> nulled in a loop or something like that? Especially if doing one JVM process
> for item solves the problem.
>
> > a shell script that invokes a java program for each xml file that adds it
> > to the index.
>
> -+ Tatu +-
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>
> -
> Do you Yahoo!?
> Yahoo! Shopping - Send Flowers for Valentine's Day

-- 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: PDF Text extraction

2002-12-27 Thread Ben Litchfield

You need to do something like

//first get the document field
Field contentsField = doc.getField( "contents" );

//Then get the reader from the field
BufferedReader contentsReader =
new BufferedReader( contentsField.readerValue() );

//finally dump the contents of the reader to System.out
String line = null;
while( (line = contentsReader.readLine() ) != null )
{
System.out.println( line );
}

I have not tested if this compiles but it should be pretty close.

Ben Litchfield


On Fri, 27 Dec 2002, Suhas Indra wrote:

> Hello List
>
> I am using PDFBox to index some of the PDF documents. The parser works fine
> and I can read the summary. But the contents are displayed as
> java.io.InputStream.
>
> When I try the following:
> System.out.println(doc.getField("contents")) (where doc is the Document
> object)
>
> The result will be:
>
> Text
>
> I want to print the extracted data.
>
> Can anyone please let me know how to extract the contents?
>
> Regards
>
> Suhas
>
>
>
> --
> Robosoft Technologies - Partners in Product Development
>
>
>
>
>
>
>
>
>
> --
> To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
>

-- 


--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>




PDFBox 0.5.6

2002-11-28 Thread Ben Litchfield

PDFBox version 0.5.6 is now available at http://www.pdfbox.org

PDFBox makes it easy to add PDF Documents to a lucene index.

Fixes over the last version

-Fixed bug in LucenePDFDocument where stream was not being closed and
small documents were not being indexed.
-Fixed a spacing issue for some PDF documents.
-Fixed error while parsing the version number
-Fixed NullPointer in persistence example.
-Create example lucene IndexFiles class which models the demo from lucene.
-Fixed bug where garbage at the end of file caused an infinite loop
-Fixed bug in parsing boolean values with stuff at the end like "true>>"


Ben Litchfield



--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>




IOException not a directory

2002-10-28 Thread Ben Litchfield

Has anybody seen this type of error before.  This used to work and all of
a sudden broke.  That path is a folder.

Ben Litchfield



2002-10-28 12:51:31,109 [Default] java.io.IOException:
\\Finsrv04\JBoss-2.4.1_Tomcat-3.2.3\fast_generated_output\lucene\website\index
not a directory
2002-10-28 12:51:31,109 [Default]   at
org.apache.lucene.store.FSDirectory.(Unknown Source)
2002-10-28 12:51:31,109 [Default]
2002-10-28 12:51:31,109 [Default]   at
org.apache.lucene.store.FSDirectory.getDirectory(Unknown Source)
2002-10-28 12:51:31,109 [Default]
2002-10-28 12:51:31,109 [Default]   at
org.apache.lucene.store.FSDirectory.getDirectory(Unknown Source)
2002-10-28 12:51:31,109 [Default]
2002-10-28 12:51:31,109 [Default]   at
org.apache.lucene.index.IndexReader.open(Unknown Source)
2002-10-28 12:51:31,109 [Default]
2002-10-28 12:51:31,109 [Default]   at
_0002fwebsite_0002dresults_0002ejspwebsite_0002dresults_jsp_1._jspService(_0002fwebsite_0002dresults_0002ejspwebsite_0002dresults_jsp_1.java:98)
2002-10-28 12:51:31,109 [Default]
2002-10-28 12:51:31,109 [Default]   at
org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:119)
2002-10-28 12:51:31,109 [Default]
2002-10-28 12:51:31,109 [Default]   at
javax.servlet.http.HttpServlet.service(HttpServlet.java:853)
2002-10-28 12:51:31,109 [Default]
2002-10-28 12:51:31,109 [Default]   at
org.apache.jasper.servlet.JspServlet$JspCountedServlet.service(JspServlet.java:130)
2002-10-28 12:51:31,109 [Default]
2002-10-28 12:51:31,109 [Default]   at
javax.servlet.http.HttpServlet.service(HttpServlet.java:853)
2002-10-28 12:51:31,109 [Default]
2002-10-28 12:51:31,109 [Default]   at
org.apache.jasper.servlet.JspServlet$JspServletWrapper.service(JspServlet.java:282)
2002-10-28 12:51:31,109 [Default]
2002-10-28 12:51:31,109 [Default]   at
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:429)
2002-10-28 12:51:31,109 [Default]
2002-10-28 12:51:31,109 [Default]   at
org.apache.jasper.servlet.JspServlet.service(JspServlet.java:500)


--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@;jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@;jakarta.apache.org>




Re: pdfbox on solaris

2002-08-28 Thread Ben Litchfield


I know that there are some memory issues with some documents.  The next
release of pdfbox fixes some of these.  Although I am not sure why it
would run differently under windows than solaris.  Off the top of my head
maybe the solaris JVM uses more memory per object than the windows JVM.
The easiest workaround is to increase the maximum heap size(mhs) of the
jvm using the -Xmx option of the jvm.


Example:

java -Xmx128m 

The default mhs of java is 64m since JDK1.2 so maybe try 128 or 256.

-Ben

http://www.pdfbox.org



On Wed, 28 Aug 2002, Deenesh wrote:

> Hi,
> i am using the pdfbox on solaris 8 and am trying to index a pdf file which is around 
>1 mb.
>
> I am getting a java.outofmemory error.
>
> Though the same code works fime under windows.
>
> Has anyone get the same problem?? Any suggestion?
>
> Thanks
> Deenesh
>

-- 


--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>




Re: problems with HTML Parser

2002-08-14 Thread Ben Litchfield

Maurits,

You can get a PDF parser from http://www.pdfbox.org

-Ben


On Wed, 14 Aug 2002, Maurits van Wijland wrote:

> Keith,
>
> I haven't noticed the problem with the Parser...but you trigger me
> by saying that you have a PDFParser!!!
>
> Are you able to contribute this PDFParser??
>
> Maurits.


--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>




Re: PDF Text Stripper

2002-07-09 Thread Ben Litchfield

Can you send me the PDF document that you are having problems with and I
will look into it.

There are still some issues that I am working out with the spacing of
characters.
-Ben



On Tue, 9 Jul 2002, Keith Gunn wrote:

> On Tue, 9 Jul 2002, Ben Litchfield wrote:
>
> > Hi,
> >
> > I have written a PDF library that can be used to strip text from PDF
> > documents.  It is released under LGPL so have fun.
> >
> > There is one class which can be used to easily index PDF documents.
> > pdfparser.searchengine.lucene.LucenePDFDocument  has a getDocument
> > method which will take a PDF file and return a Lucene Document which you
> > can add to an index.
> >
> > If you would like to see the quality of the text extraction you can run
> > pdfparser.Main from the command line which will take a PDF document and
> > write a txt file.
> >
> > I am looking for any input that you might have.  Please mail me if you
> > have any bugs or feature requests.
> >
> > The library can be retrieved from
> > http://www.csh.rit.edu/~ben/projects/pdfparser/
> >
> > -Ben Litchfield
>
> hi,
>
> I downloaded the zip and quickly ran the demo on a few files, it displays
> .notdef between words and there are spaces between every letter for words,
> is there code in your dist. to remove these so that just terms remain?
>
> Keith Gunn
> University Of Aberdeen
>
>
>
> --
> To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
>

-- 


--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>




PDF Text Stripper

2002-07-09 Thread Ben Litchfield

Hi,

I have written a PDF library that can be used to strip text from PDF
documents.  It is released under LGPL so have fun.

There is one class which can be used to easily index PDF documents.
pdfparser.searchengine.lucene.LucenePDFDocument  has a getDocument
method which will take a PDF file and return a Lucene Document which you
can add to an index.

If you would like to see the quality of the text extraction you can run
pdfparser.Main from the command line which will take a PDF document and
write a txt file.

I am looking for any input that you might have.  Please mail me if you
have any bugs or feature requests.

The library can be retrieved from
http://www.csh.rit.edu/~ben/projects/pdfparser/

-Ben Litchfield


--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>