Grobid with TXT and HTML files

[email protected] Thu, 08 Jun 2017 07:45:33 -0700

Dear Thamme,


https://grobid.readthedocs.io/en/latest/grobid-04-2015.pdf

The above presentation says that Grobid supports raw text. My input files
are in TXT and HTML formats. Do you have any idea how can this be supported
as raw text?



Regards,




On Wed, May 3, 2017 at 6:16 PM, Thamme Gowda <[email protected]> wrote:

> Hello,
>
> There is a nice project called Grobid [1] that does most of what you are
> describing.
> Tika has Grobid parser built in (it calls grobid over REST API) . checkout
> [2] for details
>
> I have a project that makes use of Tika with Grobid and NER support. It
> also builds a search index using solr.
> Check out [3] for setting up and [4] for parsing and indexing to solr if
> you like to try out my python project.
> Here I am able to extract title, author names, affiliations, and the whole
> text of articles.
> I did not extract sections within the main body of research articles.  I
> assume there should be a way to configure it in Grobid.
>
> Alternatively, if Grobid can't detect sections, you can try XHTML content
> handler which preserves the basic structure of PDF file using <p>  <br> and
> heading tags. So technically it should be possible to write a wrapper to
> break XHTML output from tika into sections
>
> To get it:
>
> # In bash do `pip install tika’ if tika isn’t already installed
> import tika
> tika.initVM()
> from tika import parser
>
>
> file_path = "<pdf_dir>/2538.pdf"
> data = parser.from_file(file_path, xmlContent=True)
> print(data['content'])
>
>
>
>
> Best,
> Thamme
>
> [1] http://grobid.readthedocs.io/en/latest/Introduction/
> [2] https://wiki.apache.org/tika/GrobidJournalParser
> [3] https://github.com/USCDataScience/parser-indexer-
> py/tree/master/parser-server
> [4] https://github.com/USCDataScience/parser-indexer-
> py/blob/master/docs/parser-index-journals.md
>
> *--*
> *Thamme Gowda*
> TG | @thammegowda <https://twitter.com/thammegowda>
> ~Sent via somebody's Webmail server!
>
> On Wed, May 3, 2017 at 9:34 AM, [email protected] <[email protected]>
> wrote:
>
>> Hi,
>>
>> I am working with published research articles using Apache Tika. These
>> articles have distinct sections like abstract, introduction, literature
>> review, methodology, experimental setup, discussion and conclusions. Is
>> there some way to extract document sections with Apache Tika
>>
>> Regards,
>>
>
>

Grobid with TXT and HTML files

Reply via email to