A while back I had a similar problem  while extracting text from HTML using
Tika. What I did was to hack the Tika HTML parser to extract the text as I
needed. I can't remember exactly how it was, but as far as I remember Tika
raises events when it finds a markup (at least a HTML markup), that is not
handled by default. If you know the structure of the document you are
reading, you can decide what to do with the markup and maybe change the
output (adding a space, a line break etc).



2014-07-15 5:00 GMT-03:00 Jörn Kottmann <kottm...@gmail.com>:

> Text extracted from PDFs must often be cleaned up first, e.g.
> fix tokenization, remove page header/footer, fix hyphenation, detect
> headlines/titles, etc.
>
> If there are fundamental issues with the plain text the OpenNLP components
> trained on cleaned text will not work very well.
>
> Jörn
>
>
> On 07/15/2014 05:38 AM, Carlos Scheidecker wrote:
>
>> Hello all,
>>
>> I have an interesting problem here. More of a challenge.
>>
>> I have been doing text cleansing for bad characters and all.
>>
>> Then I have another interesting problem.
>>
>> Extracted a public PDF with Tika does not necessary mean you will get
>> clean
>> text because the original PDF might have different fonts within a section
>> that will cause weird behaviors.
>>
>> If you then divide it into Senteces via OpenNLP you will then get some
>> interesting sentences.
>>
>> Trying to parse those sentences then it gets worse.
>>
>> I am showing an example bellow and I would like to ask about solutions to
>> it, considering the text can be noisy.
>>
>> I do not think that it will be easy to fix the Sentence Parser. Here is
>> what I think on approaching it:
>>
>> Instead, the best way to do is to look at the sentences poorly parsed,
>> parse them and extract the inner (S) from the parse as separate sentences.
>>
>> What would you suggest?
>>
>> Here is an example of a piece of text extracted with Tika from a public
>> pdf. This part is what OpenNLP considered to be a sentence:
>>
>> ----
>>
>> related research DocumentsBrief: Your Next Portal should Be An Engagement
>> WorkplaceFebruary 3, 2014Microsoft Aims sharePoint To The CloudJanuary 27,
>> 2014setting The Technology Foundation For Your social Business And
>> Collaboration strategyJuly 29, 2013The Forrester wave : enterprise social
>> Platforms, Q2 2014The 13 Providers That Matter Most And How They stack
>> Upby
>> rob Koplowitzwith Peter Burris and Nancy Wang2257913JUNE 5, 2014For CIos
>> The Forrester Wave : Enterprise social Platforms, Q2 2014 2  2014,
>> Forrester Research, Inc. Reproduction Prohibited June 5, 2014 eNTeRPRIse
>> sOCIaL PLaTFORM MaRKeT MaTuRes aMID CONsOLIDaTION aND INTegRaTIONThe
>> enterprise social platform is no longer in its infancy as offerings become
>> increasingly functional.
>>
>> ----
>>
>> It is now parsed as follows:
>>
>>
>> (S (S (S (NP (VBN related) (NN research) (NNP DocumentsBrief:) (NNP Your)
>> (NNP Next) (NNP Portal)) (VP (MD should) (VP (VB Be) (NP (NP (DT An) (NNP
>> Engagement) (NNP WorkplaceFebruary) (CD 3,) (JJ 2014Microsoft) (NNP Aims)
>> (NN sharePoint)) (PP (TO To) (NP (DT The) (NNP CloudJanuary))))))) (VP
>> (VBD
>> 27,) (S (VP (VBG 2014setting) (NP (NP (NP (DT The) (NNP Technology) (NNP
>> Foundation)) (PP (IN For) (NP (PRP$ Your) (JJ social) (NNP Business))))
>> (CC
>> And) (NP (NNP Collaboration))) (PP (RB strategyJuly) (NP (CD 29,) (CD
>> 2013The) (NNP Forrester) (NN wave))))))) (: :) (S (VP (VB enterprise) (NP
>> (JJ social) (NN Platforms,)) (PP (IN Q2) (NP (NP (DT 2014The) (CD 13) (NNS
>> Providers)) (NP (NP (DT That) (NNP Matter) (JJS Most)) (SBAR (S (CC And)
>> (SBAR (WHADVP (WRB How)) (S (NP (PRP They)) (VP (VBP stack) (PP (IN Upby)
>> (NP (NP (NN rob)) (PP (IN Koplowitzwith) (NP (NP (NNP Peter) (NNP Burris))
>> (CC and) (NP (NP (NNP Nancy) (NNP Wang2257913JUNE) (CD 5,)) (PP (IN
>> 2014For) (NP (NP (NNP CIos)) (NP (DT The) (NNP Forrester) (NNP Wave)) (:
>> :)
>> (S (NP (NP (NP (NP (NN Enterprise) (JJ social) (NN Platforms,)) (PP (IN
>> Q2)
>> (NP (CD 2014) (CD 2) (JJ 2014,) (NNP Forrester) (NNP Research,) (NNP Inc.)
>> (NNP Reproduction) (NNP Prohibited) (NNP June) (CD 5,) (CD 2014) (NN
>> eNTeRPRIse) (NN sOCIaL)))) (NP (NNP PLaTFORM) (NNP MaRKeT) (NNP MaTuRes)
>> (NN aMID) (NN CONsOLIDaTION))) (PP (IN aND) (NP (DT INTegRaTIONThe) (NN
>> enterprise) (JJ social) (NN platform)))) (VP (VP (VBZ is) (ADVP (RB no)
>> (RB
>> longer)) (PP (IN in) (NP (PRP$ its) (NN infancy))) (SBAR (IN as) (S (NP
>> (NNS offerings)) (VP (VBP become) (ADVP (RB increasingly)))))) (VBG
>> functional.)))))))))))))))))))))
>>
>>
>> Notice that I have more than one (S (S (S
>>
>> And then I have the first correct structure as (S (NP ..... (VP.....
>>
>> What is the best way to deal with it?
>>
>>
>

Reply via email to