Thanks for your reply. I am calling Apache Tika in Java code like this:

 public String extractPDFText(String faInputFileName) throws
IOException,TikaException {

       //Handler for body text of the PDF article
 BodyContentHandler handler = new BodyContentHandler();

        //Metadata of the article
        Metadata metadata = new Metadata();

        //Input file path
        FileInputStream inputstream = new FileInputStream(new
File(faInputFileName));

        //Parser context. It is used to parse InputStream
        ParseContext pcontext = new ParseContext();

 try
{
        //parsing the document using PDF parser from Tika. Case statement
will be added for handling other file types.
 PDFParser pdfparser = new PDFParser();

 //Do the parsing by calling the parse function of pdfparser
 pdfparser.parse(inputstream, handler, metadata,pcontext);

}catch(Exception e)
{
System.out.println("Exception caught:");
}
      //Convert the body handler to string and return the string to the
calling function
     return handler.toString();
  }

Regards,


On Thu, Jun 8, 2017 at 4:29 PM, Nick Burch <apa...@gagravarr.org> wrote:

> On Thu, 8 Jun 2017, tesm...@gmail.com wrote:
>
>> My tika code is not extracting full body text of larger PDF files.
>>
>> Files more than 1 MB  in size and around 20 pages are partially extracted.
>> Is there any limit on input PDF file  size in tika
>>
>
> How are you calling Apache Tika? Direct java calls to TikaConfig +
> AutoDetectParser? Using the Tika facade class? Using the Tika App on the
> command line? Tika Server? Other?
>
> Nick
>

Reply via email to