Re: How to sanitize and parse noisy text

2014-07-16 Thread Carlos Scheidecker
Guys.

Thanks for the replies. That is exactly the question. How to segment better
before using the sentence parser. If I cannot segment it better, than
another choice is to use the parser for segmentation as I have stated.

Tika is crappy and, no, I do not know the PDFs structures from the
beginning. Also, Tika does not do PDF by itself, instead it uses another
java library for it as I had checked.

So I guess I still need to think on how to fix the text prior as well.



On Tue, Jul 15, 2014 at 5:49 AM, William Colen william.co...@gmail.com
wrote:

 A while back I had a similar problem  while extracting text from HTML using
 Tika. What I did was to hack the Tika HTML parser to extract the text as I
 needed. I can't remember exactly how it was, but as far as I remember Tika
 raises events when it finds a markup (at least a HTML markup), that is not
 handled by default. If you know the structure of the document you are
 reading, you can decide what to do with the markup and maybe change the
 output (adding a space, a line break etc).



 2014-07-15 5:00 GMT-03:00 Jörn Kottmann kottm...@gmail.com:

  Text extracted from PDFs must often be cleaned up first, e.g.
  fix tokenization, remove page header/footer, fix hyphenation, detect
  headlines/titles, etc.
 
  If there are fundamental issues with the plain text the OpenNLP
 components
  trained on cleaned text will not work very well.
 
  Jörn
 
 
  On 07/15/2014 05:38 AM, Carlos Scheidecker wrote:
 
  Hello all,
 
  I have an interesting problem here. More of a challenge.
 
  I have been doing text cleansing for bad characters and all.
 
  Then I have another interesting problem.
 
  Extracted a public PDF with Tika does not necessary mean you will get
  clean
  text because the original PDF might have different fonts within a
 section
  that will cause weird behaviors.
 
  If you then divide it into Senteces via OpenNLP you will then get some
  interesting sentences.
 
  Trying to parse those sentences then it gets worse.
 
  I am showing an example bellow and I would like to ask about solutions
 to
  it, considering the text can be noisy.
 
  I do not think that it will be easy to fix the Sentence Parser. Here is
  what I think on approaching it:
 
  Instead, the best way to do is to look at the sentences poorly parsed,
  parse them and extract the inner (S) from the parse as separate
 sentences.
 
  What would you suggest?
 
  Here is an example of a piece of text extracted with Tika from a public
  pdf. This part is what OpenNLP considered to be a sentence:
 
  
 
  related research DocumentsBrief: Your Next Portal should Be An
 Engagement
  WorkplaceFebruary 3, 2014Microsoft Aims sharePoint To The CloudJanuary
 27,
  2014setting The Technology Foundation For Your social Business And
  Collaboration strategyJuly 29, 2013The Forrester wave : enterprise
 social
  Platforms, Q2 2014The 13 Providers That Matter Most And How They stack
  Upby
  rob Koplowitzwith Peter Burris and Nancy Wang2257913JUNE 5, 2014For CIos
  The Forrester Wave : Enterprise social Platforms, Q2 2014 2  2014,
  Forrester Research, Inc. Reproduction Prohibited June 5, 2014 eNTeRPRIse
  sOCIaL PLaTFORM MaRKeT MaTuRes aMID CONsOLIDaTION aND INTegRaTIONThe
  enterprise social platform is no longer in its infancy as offerings
 become
  increasingly functional.
 
  
 
  It is now parsed as follows:
 
 
  (S (S (S (NP (VBN related) (NN research) (NNP DocumentsBrief:) (NNP
 Your)
  (NNP Next) (NNP Portal)) (VP (MD should) (VP (VB Be) (NP (NP (DT An)
 (NNP
  Engagement) (NNP WorkplaceFebruary) (CD 3,) (JJ 2014Microsoft) (NNP
 Aims)
  (NN sharePoint)) (PP (TO To) (NP (DT The) (NNP CloudJanuary))) (VP
  (VBD
  27,) (S (VP (VBG 2014setting) (NP (NP (NP (DT The) (NNP Technology) (NNP
  Foundation)) (PP (IN For) (NP (PRP$ Your) (JJ social) (NNP Business
  (CC
  And) (NP (NNP Collaboration))) (PP (RB strategyJuly) (NP (CD 29,) (CD
  2013The) (NNP Forrester) (NN wave))) (: :) (S (VP (VB enterprise)
 (NP
  (JJ social) (NN Platforms,)) (PP (IN Q2) (NP (NP (DT 2014The) (CD 13)
 (NNS
  Providers)) (NP (NP (DT That) (NNP Matter) (JJS Most)) (SBAR (S (CC And)
  (SBAR (WHADVP (WRB How)) (S (NP (PRP They)) (VP (VBP stack) (PP (IN
 Upby)
  (NP (NP (NN rob)) (PP (IN Koplowitzwith) (NP (NP (NNP Peter) (NNP
 Burris))
  (CC and) (NP (NP (NNP Nancy) (NNP Wang2257913JUNE) (CD 5,)) (PP (IN
  2014For) (NP (NP (NNP CIos)) (NP (DT The) (NNP Forrester) (NNP Wave)) (:
  :)
  (S (NP (NP (NP (NP (NN Enterprise) (JJ social) (NN Platforms,)) (PP (IN
  Q2)
  (NP (CD 2014) (CD 2) (JJ 2014,) (NNP Forrester) (NNP Research,) (NNP
 Inc.)
  (NNP Reproduction) (NNP Prohibited) (NNP June) (CD 5,) (CD 2014) (NN
  eNTeRPRIse) (NN sOCIaL (NP (NNP PLaTFORM) (NNP MaRKeT) (NNP MaTuRes)
  (NN aMID) (NN CONsOLIDaTION))) (PP (IN aND) (NP (DT INTegRaTIONThe) (NN
  enterprise) (JJ social) (NN platform (VP (VP (VBZ is) (ADVP (RB no)
  (RB
  longer)) (PP (IN in) (NP (PRP$ its) (NN infancy))) (SBAR (IN as) (S (NP
  

Re: How to sanitize and parse noisy text

2014-07-15 Thread Rodrigo Agerri
Hi Carlos,

In my opinion, you would need to properly segment that sentence. It
is virtually impossible the parser will get anything right if you pass
it such sentences. Perhaps you can use the newlines in your cleaned
text to create shorter more grammatical sentences. Also, if I had to
deal with such text, and depending what the aim is, I would ask myself
whether I actually need to use constituent parsing :)

HTH,

Rodrigo

On Tue, Jul 15, 2014 at 4:38 AM, Carlos Scheidecker nando@gmail.com wrote:
 Hello all,

 I have an interesting problem here. More of a challenge.

 I have been doing text cleansing for bad characters and all.

 Then I have another interesting problem.

 Extracted a public PDF with Tika does not necessary mean you will get clean
 text because the original PDF might have different fonts within a section
 that will cause weird behaviors.

 If you then divide it into Senteces via OpenNLP you will then get some
 interesting sentences.

 Trying to parse those sentences then it gets worse.

 I am showing an example bellow and I would like to ask about solutions to
 it, considering the text can be noisy.

 I do not think that it will be easy to fix the Sentence Parser. Here is
 what I think on approaching it:

 Instead, the best way to do is to look at the sentences poorly parsed,
 parse them and extract the inner (S) from the parse as separate sentences.

 What would you suggest?

 Here is an example of a piece of text extracted with Tika from a public
 pdf. This part is what OpenNLP considered to be a sentence:

 

 related research DocumentsBrief: Your Next Portal should Be An Engagement
 WorkplaceFebruary 3, 2014Microsoft Aims sharePoint To The CloudJanuary 27,
 2014setting The Technology Foundation For Your social Business And
 Collaboration strategyJuly 29, 2013The Forrester wave : enterprise social
 Platforms, Q2 2014The 13 Providers That Matter Most And How They stack Upby
 rob Koplowitzwith Peter Burris and Nancy Wang2257913JUNE 5, 2014For CIos
 The Forrester Wave : Enterprise social Platforms, Q2 2014 2  2014,
 Forrester Research, Inc. Reproduction Prohibited June 5, 2014 eNTeRPRIse
 sOCIaL PLaTFORM MaRKeT MaTuRes aMID CONsOLIDaTION aND INTegRaTIONThe
 enterprise social platform is no longer in its infancy as offerings become
 increasingly functional.

 

 It is now parsed as follows:


 (S (S (S (NP (VBN related) (NN research) (NNP DocumentsBrief:) (NNP Your)
 (NNP Next) (NNP Portal)) (VP (MD should) (VP (VB Be) (NP (NP (DT An) (NNP
 Engagement) (NNP WorkplaceFebruary) (CD 3,) (JJ 2014Microsoft) (NNP Aims)
 (NN sharePoint)) (PP (TO To) (NP (DT The) (NNP CloudJanuary))) (VP (VBD
 27,) (S (VP (VBG 2014setting) (NP (NP (NP (DT The) (NNP Technology) (NNP
 Foundation)) (PP (IN For) (NP (PRP$ Your) (JJ social) (NNP Business (CC
 And) (NP (NNP Collaboration))) (PP (RB strategyJuly) (NP (CD 29,) (CD
 2013The) (NNP Forrester) (NN wave))) (: :) (S (VP (VB enterprise) (NP
 (JJ social) (NN Platforms,)) (PP (IN Q2) (NP (NP (DT 2014The) (CD 13) (NNS
 Providers)) (NP (NP (DT That) (NNP Matter) (JJS Most)) (SBAR (S (CC And)
 (SBAR (WHADVP (WRB How)) (S (NP (PRP They)) (VP (VBP stack) (PP (IN Upby)
 (NP (NP (NN rob)) (PP (IN Koplowitzwith) (NP (NP (NNP Peter) (NNP Burris))
 (CC and) (NP (NP (NNP Nancy) (NNP Wang2257913JUNE) (CD 5,)) (PP (IN
 2014For) (NP (NP (NNP CIos)) (NP (DT The) (NNP Forrester) (NNP Wave)) (: :)
 (S (NP (NP (NP (NP (NN Enterprise) (JJ social) (NN Platforms,)) (PP (IN Q2)
 (NP (CD 2014) (CD 2) (JJ 2014,) (NNP Forrester) (NNP Research,) (NNP Inc.)
 (NNP Reproduction) (NNP Prohibited) (NNP June) (CD 5,) (CD 2014) (NN
 eNTeRPRIse) (NN sOCIaL (NP (NNP PLaTFORM) (NNP MaRKeT) (NNP MaTuRes)
 (NN aMID) (NN CONsOLIDaTION))) (PP (IN aND) (NP (DT INTegRaTIONThe) (NN
 enterprise) (JJ social) (NN platform (VP (VP (VBZ is) (ADVP (RB no) (RB
 longer)) (PP (IN in) (NP (PRP$ its) (NN infancy))) (SBAR (IN as) (S (NP
 (NNS offerings)) (VP (VBP become) (ADVP (RB increasingly)) (VBG
 functional.)


 Notice that I have more than one (S (S (S

 And then I have the first correct structure as (S (NP . (VP.

 What is the best way to deal with it?


Re: How to sanitize and parse noisy text

2014-07-15 Thread William Colen
A while back I had a similar problem  while extracting text from HTML using
Tika. What I did was to hack the Tika HTML parser to extract the text as I
needed. I can't remember exactly how it was, but as far as I remember Tika
raises events when it finds a markup (at least a HTML markup), that is not
handled by default. If you know the structure of the document you are
reading, you can decide what to do with the markup and maybe change the
output (adding a space, a line break etc).



2014-07-15 5:00 GMT-03:00 Jörn Kottmann kottm...@gmail.com:

 Text extracted from PDFs must often be cleaned up first, e.g.
 fix tokenization, remove page header/footer, fix hyphenation, detect
 headlines/titles, etc.

 If there are fundamental issues with the plain text the OpenNLP components
 trained on cleaned text will not work very well.

 Jörn


 On 07/15/2014 05:38 AM, Carlos Scheidecker wrote:

 Hello all,

 I have an interesting problem here. More of a challenge.

 I have been doing text cleansing for bad characters and all.

 Then I have another interesting problem.

 Extracted a public PDF with Tika does not necessary mean you will get
 clean
 text because the original PDF might have different fonts within a section
 that will cause weird behaviors.

 If you then divide it into Senteces via OpenNLP you will then get some
 interesting sentences.

 Trying to parse those sentences then it gets worse.

 I am showing an example bellow and I would like to ask about solutions to
 it, considering the text can be noisy.

 I do not think that it will be easy to fix the Sentence Parser. Here is
 what I think on approaching it:

 Instead, the best way to do is to look at the sentences poorly parsed,
 parse them and extract the inner (S) from the parse as separate sentences.

 What would you suggest?

 Here is an example of a piece of text extracted with Tika from a public
 pdf. This part is what OpenNLP considered to be a sentence:

 

 related research DocumentsBrief: Your Next Portal should Be An Engagement
 WorkplaceFebruary 3, 2014Microsoft Aims sharePoint To The CloudJanuary 27,
 2014setting The Technology Foundation For Your social Business And
 Collaboration strategyJuly 29, 2013The Forrester wave : enterprise social
 Platforms, Q2 2014The 13 Providers That Matter Most And How They stack
 Upby
 rob Koplowitzwith Peter Burris and Nancy Wang2257913JUNE 5, 2014For CIos
 The Forrester Wave : Enterprise social Platforms, Q2 2014 2  2014,
 Forrester Research, Inc. Reproduction Prohibited June 5, 2014 eNTeRPRIse
 sOCIaL PLaTFORM MaRKeT MaTuRes aMID CONsOLIDaTION aND INTegRaTIONThe
 enterprise social platform is no longer in its infancy as offerings become
 increasingly functional.

 

 It is now parsed as follows:


 (S (S (S (NP (VBN related) (NN research) (NNP DocumentsBrief:) (NNP Your)
 (NNP Next) (NNP Portal)) (VP (MD should) (VP (VB Be) (NP (NP (DT An) (NNP
 Engagement) (NNP WorkplaceFebruary) (CD 3,) (JJ 2014Microsoft) (NNP Aims)
 (NN sharePoint)) (PP (TO To) (NP (DT The) (NNP CloudJanuary))) (VP
 (VBD
 27,) (S (VP (VBG 2014setting) (NP (NP (NP (DT The) (NNP Technology) (NNP
 Foundation)) (PP (IN For) (NP (PRP$ Your) (JJ social) (NNP Business
 (CC
 And) (NP (NNP Collaboration))) (PP (RB strategyJuly) (NP (CD 29,) (CD
 2013The) (NNP Forrester) (NN wave))) (: :) (S (VP (VB enterprise) (NP
 (JJ social) (NN Platforms,)) (PP (IN Q2) (NP (NP (DT 2014The) (CD 13) (NNS
 Providers)) (NP (NP (DT That) (NNP Matter) (JJS Most)) (SBAR (S (CC And)
 (SBAR (WHADVP (WRB How)) (S (NP (PRP They)) (VP (VBP stack) (PP (IN Upby)
 (NP (NP (NN rob)) (PP (IN Koplowitzwith) (NP (NP (NNP Peter) (NNP Burris))
 (CC and) (NP (NP (NNP Nancy) (NNP Wang2257913JUNE) (CD 5,)) (PP (IN
 2014For) (NP (NP (NNP CIos)) (NP (DT The) (NNP Forrester) (NNP Wave)) (:
 :)
 (S (NP (NP (NP (NP (NN Enterprise) (JJ social) (NN Platforms,)) (PP (IN
 Q2)
 (NP (CD 2014) (CD 2) (JJ 2014,) (NNP Forrester) (NNP Research,) (NNP Inc.)
 (NNP Reproduction) (NNP Prohibited) (NNP June) (CD 5,) (CD 2014) (NN
 eNTeRPRIse) (NN sOCIaL (NP (NNP PLaTFORM) (NNP MaRKeT) (NNP MaTuRes)
 (NN aMID) (NN CONsOLIDaTION))) (PP (IN aND) (NP (DT INTegRaTIONThe) (NN
 enterprise) (JJ social) (NN platform (VP (VP (VBZ is) (ADVP (RB no)
 (RB
 longer)) (PP (IN in) (NP (PRP$ its) (NN infancy))) (SBAR (IN as) (S (NP
 (NNS offerings)) (VP (VBP become) (ADVP (RB increasingly)) (VBG
 functional.)


 Notice that I have more than one (S (S (S

 And then I have the first correct structure as (S (NP . (VP.

 What is the best way to deal with it?