Re: Matching REPLACEd text

Peter Klügl Tue, 21 May 2019 04:06:04 -0700

Hi Dominik,


the REPLACE action is maybe not what you are looking for. It is a part
of a use case where you want to set a replacement for specific parts of
the document, e.g. deidentification. However, as the document text is
static, these replacements are only used for modifications later on by a
different analysis engine, which essentially creates a new CAS. Thus,
applying the REPLACE action without the additional analysis engine and
another pipeline has not effect as it only sets the value of a feature
in RutaBasic.


Fixing the OCR error and producing a new document text for further
processing is in general a good idea, but that action is probably not
the best solution for implementing it.


If the initial document text should not be changed in an additional step
(multiple pipeline or multi view CAS), then you can still implement the
use case by storing the corrected word in a feature, e.g., in a feature
"normalized" of a type "Token". This depends of course on your pipeline
and type system. Then, you would need some dictionary lookup on the
feature values. This is currently not supported by the wordlists in
Ruta, there is still an open feature request for this. Other dictionary
lookup components most likely support such functionality. The same
applies for stem instead of corrected OCR errors. Regarding Ruta rules,
you can simply match in the feature values as usual:
t:Token{t.normalized == "bla"} or  t:Token{t.stem == "bla"}


Some more comments to your specific question below...


Am 20.05.2019 um 09:19 schrieb Dominik Terweh:
>
> Dear All,
>
>  
>
> I am using uima to detect certain parts of contracts. Unfortunately
> the documents are not originals but scanned and due to the recognition
> of OCR I have a rather high percentage of errors. Furthermore I have
> some situations, where I would like to get the root or lemma of a word
> and match on their basis, so I thought the best solution for both of
> these problems would be the REPLACE() action, but unfortunately I seem
> not to get it working.
>
>  
>
> What I would like to achieve, given the sentences:
>
> “They worked hard”,
>
> “They were warking hard”,
>
> “He vvorks hard”,
>
> “I work hard”
>
> I would want to perform some OCR correction (“warking” -> “working”,
> “vvorks” -> “works”), like:
>
> WrongWord{-> REPLACE(CorrectWord)};
>
> And some stemming/lemmatizing (“working”,”works”,”worked” -> “work”),
> like:
>
>                 Word{-> REPLACE(Stem)};
>
> After that I would like to match on the replaced text, by simply using
> the stems, like:
>
>                 ANY “work” “hard”{-> MARK(WhatIWant, 2, 3)};
>
>  
>
> Now my main questions are:
>
>   * Is it possible to match on replaced text?
>
Yes, but it depends on the representation and the component. It is
possible to match on any feature values using Ruta rules, but not using
wordlists.

>   * If so, can I highlight it in the original text?
>

The highlighting depends on the CAS viewer and thus normally on the
type. For a different highlighting, you would need an additional type.


>   * Can I see the changed text in the Annotation Browser View?
>

You cannot see the replaced text in this view since it is represented in
a feature of RutaBasic which is hidden in that view.


>   * Do I first need to write the outcome to a file and then reread and
>     process it?
>

It depends on your overall use case and your pipline setup. No, if you
only rely on matching on feature values.



Best,


Peter


>  
>
> I hope you can help me with my request,
>
> Dominik
>
> Dominik Terweh
> Praktikant
>
> *Drooms GmbH*
> Eschersheimer Landstraße 6
> 60322 Frankfurt, Germany
> www.drooms.com <http://www.drooms.com>
>
> Phone:        
> Mail:         d.ter...@drooms.com <mailto:d.ter...@drooms.com>
>
> <https://drooms.com/en/newsletter?utm_source=newslettersignup&utm_medium=emailsignature>
>
> *Drooms GmbH*; Sitz der Gesellschaft / Registered Office:
> Eschersheimer Landstr. 6, D-60322 Frankfurt am Main; Geschäftsführung
> / Management Board: Alexandre Grellier;
> Registergericht / Court of Registration: Amtsgericht Frankfurt am
> Main, HRB 76454; Finanzamt / Tax Office: Finanzamt Frankfurt am Main,
> USt-IdNr.: DE 224007190
>
-- 
Dr. Peter Klügl
R&D Text Mining/Machine Learning

Averbis GmbH
Salzstr. 15
79098 Freiburg
Germany

Fon: +49 761 708 394 0
Fax: +49 761 708 394 10
Email: peter.klu...@averbis.com
Web: https://averbis.com

Headquarters: Freiburg im Breisgau
Register Court: Amtsgericht Freiburg im Breisgau, HRB 701080
Managing Directors: Dr. med. Philipp Daumke, Dr. Kornél Markó

Re: Matching REPLACEd text

Reply via email to