Hi Ankit,
you've already received good and well informed advices.

Just a few of links to other Apache projects you might find useful:

 - http://commons.apache.org/fileupload/
 - http://tika.apache.org/1.0/formats.html#Portable_Document_Format
 - http://pdfbox.apache.org/
 - http://lucene.apache.org/ (as others pointed out)

And maybe (but this is probably too much in your case):

 - http://chemistry.apache.org/
 - http://jackrabbit.apache.org/

I would keep things very simple:

 - store PDFs files in the file system
 - extract metadata (when/it available) out of PDFs using Tika and
   store it as RDF in Jena TDB
 - extract text out of PDFs using PDFBox and index it using Lucene|Solr
 - provide free text search capabilities using Lucene|Solr

It is often the case that people need to deal with metadata and content/blobs.
Storing content/blobs in the file system or a remote content store (such, for
example, Amazon S3) is quite common.

Jena helps you only for the metadata bit (and only if you model the metadata
in RDF).

My 2 cents,
Paolo

Ankit Verma wrote:
> Hi all,
> 
>            Can we persist any other document like .pdf rather than .rdf file 
> using jena .
> Thanks in advance for the reply.
> 
> 
> Thanks
> Ankit

Reply via email to