Re: Limit on input PDF file size in Tika?

2017-06-08 Thread Nick Burch

On Thu, 8 Jun 2017, tesm...@gmail.com wrote:

Thanks for your reply. I am calling Apache Tika in Java code like this:

public String extractPDFText(String faInputFileName) throws
IOException,TikaException {

  //Handler for body text of the PDF article
BodyContentHandler handler = new BodyContentHandler();


Change this for "new BodyContentHandler(-1)" to remove the write limit. 
More details in the javadocs:

https://tika.apache.org/1.15/api/org/apache/tika/sax/BodyContentHandler.html#BodyContentHandler-int-

Nick


Re: Limit on input PDF file size in Tika?

2017-06-08 Thread tesm...@gmail.com
Thanks for your reply. I am calling Apache Tika in Java code like this:

 public String extractPDFText(String faInputFileName) throws
IOException,TikaException {

   //Handler for body text of the PDF article
 BodyContentHandler handler = new BodyContentHandler();

//Metadata of the article
Metadata metadata = new Metadata();

//Input file path
FileInputStream inputstream = new FileInputStream(new
File(faInputFileName));

//Parser context. It is used to parse InputStream
ParseContext pcontext = new ParseContext();

 try
{
//parsing the document using PDF parser from Tika. Case statement
will be added for handling other file types.
 PDFParser pdfparser = new PDFParser();

 //Do the parsing by calling the parse function of pdfparser
 pdfparser.parse(inputstream, handler, metadata,pcontext);

}catch(Exception e)
{
System.out.println("Exception caught:");
}
  //Convert the body handler to string and return the string to the
calling function
 return handler.toString();
  }

Regards,


On Thu, Jun 8, 2017 at 4:29 PM, Nick Burch  wrote:

> On Thu, 8 Jun 2017, tesm...@gmail.com wrote:
>
>> My tika code is not extracting full body text of larger PDF files.
>>
>> Files more than 1 MB  in size and around 20 pages are partially extracted.
>> Is there any limit on input PDF file  size in tika
>>
>
> How are you calling Apache Tika? Direct java calls to TikaConfig +
> AutoDetectParser? Using the Tika facade class? Using the Tika App on the
> command line? Tika Server? Other?
>
> Nick
>


Re: Limit on input PDF file size in Tika?

2017-06-08 Thread Nick Burch

On Thu, 8 Jun 2017, tesm...@gmail.com wrote:

My tika code is not extracting full body text of larger PDF files.

Files more than 1 MB  in size and around 20 pages are partially extracted.
Is there any limit on input PDF file  size in tika


How are you calling Apache Tika? Direct java calls to TikaConfig + 
AutoDetectParser? Using the Tika facade class? Using the Tika App on the 
command line? Tika Server? Other?


Nick


Grobid with TXT and HTML files

2017-06-08 Thread tesm...@gmail.com
Dear Thamme,


https://grobid.readthedocs.io/en/latest/grobid-04-2015.pdf

The above presentation says that Grobid supports raw text. My input files
are in TXT and HTML formats. Do you have any idea how can this be supported
as raw text?



Regards,




On Wed, May 3, 2017 at 6:16 PM, Thamme Gowda  wrote:

> Hello,
>
> There is a nice project called Grobid [1] that does most of what you are
> describing.
> Tika has Grobid parser built in (it calls grobid over REST API) . checkout
> [2] for details
>
> I have a project that makes use of Tika with Grobid and NER support. It
> also builds a search index using solr.
> Check out [3] for setting up and [4] for parsing and indexing to solr if
> you like to try out my python project.
> Here I am able to extract title, author names, affiliations, and the whole
> text of articles.
> I did not extract sections within the main body of research articles.  I
> assume there should be a way to configure it in Grobid.
>
> Alternatively, if Grobid can't detect sections, you can try XHTML content
> handler which preserves the basic structure of PDF file usingand
> heading tags. So technically it should be possible to write a wrapper to
> break XHTML output from tika into sections
>
> To get it:
>
> # In bash do `pip install tika’ if tika isn’t already installed
> import tika
> tika.initVM()
> from tika import parser
>
>
> file_path = "/2538.pdf"
> data = parser.from_file(file_path, xmlContent=True)
> print(data['content'])
>
>
>
>
> Best,
> Thamme
>
> [1] http://grobid.readthedocs.io/en/latest/Introduction/
> [2] https://wiki.apache.org/tika/GrobidJournalParser
> [3] https://github.com/USCDataScience/parser-indexer-
> py/tree/master/parser-server
> [4] https://github.com/USCDataScience/parser-indexer-
> py/blob/master/docs/parser-index-journals.md
>
> *--*
> *Thamme Gowda*
> TG | @thammegowda 
> ~Sent via somebody's Webmail server!
>
> On Wed, May 3, 2017 at 9:34 AM, tesm...@gmail.com 
> wrote:
>
>> Hi,
>>
>> I am working with published research articles using Apache Tika. These
>> articles have distinct sections like abstract, introduction, literature
>> review, methodology, experimental setup, discussion and conclusions. Is
>> there some way to extract document sections with Apache Tika
>>
>> Regards,
>>
>
>