Hi,
here are some quick rules. It could be solved with fewer rules and also
with better or faster rules. You need essentially a rule for detecting
the structure and a rule for assigning the semantics. The rules would
also work if you have a plain text table with more rows.
Let me know if you have questions about some parts.
Best,
Peter
TYPESYSTEM utils.PlainTextTypeSystem;
ENGINE utils.PlainTextAnnotator;
DECLARE Header;
DECLARE ColumnDelimiter;
DECLARE Cell(INT column);
DECLARE Keyword (STRING label);
DECLARE Keyword UnderWriterNameKeyword, AppraiserNameLicenseKeyword,
AppraisalCompanyNameKeyword;
"Underwriter's Name" -> UnderWriterNameKeyword ( "label" = "UnderWriter
Name");
"Appraiser's Name/License" -> AppraiserNameLicenseKeyword ( "label" =
"Appraiser Name");
"Appraisal Company Name" -> AppraisalCompanyNameKeyword ( "label" =
"Appraisal Company Name");
DECLARE Entry(Keyword keyword);
EXEC(PlainTextAnnotator, {Line,Paragraph});
ADDRETAINTYPE(WS);
Line{->TRIM(WS)};
Paragraph{->TRIM(WS)};
SPACE[3,100]{-PARTOF(ColumnDelimiter) -> ColumnDelimiter};
Line -> {ANY+{-PARTOF(Cell),-PARTOF(ColumnDelimiter) -> Cell};};
REMOVERETAINTYPE(WS);
INT index = 0;
BLOCK(structure) Line{}{
ASSIGN(index, 0);
Line{STARTSWITH(Paragraph) -> Header};
c:Cell{-> c.column = index, index = index + 1};
}
Header<-{hc:Cell{hc.column == c.column}<-{k:Keyword;};}
# c:@Cell{-PARTOF(Header) -> e:Entry, e.keyword = k};
DECLARE Entity (STRING label, STRING value);
DECLARE Entity UnderWriterName, AppraiserNameLicense, AppraisalCompanyName;
FOREACH(entry) Entry{}{
entry{ -> CREATE(UnderWriterName, "label" = k.label, "value" =
entry.ct)}<-{k:entry.keyword{PARTOF(UnderWriterNameKeyword)};};
entry{ -> CREATE(AppraiserNameLicense, "label" = k.label, "value" =
entry.ct)}<-{k:entry.keyword{PARTOF(AppraiserNameLicenseKeyword)};};
entry{ -> CREATE(AppraisalCompanyName, "label" = k.label, "value" =
entry.ct)}<-{k:entry.keyword{PARTOF(AppraisalCompanyNameKeyword)};};
}
Am 06.11.2019 um 12:45 schrieb Shashank Pathak:
> Hi Peter,
>
> I am trying to get information from a indented text file.
>
> Input file text:
> Underwriter's Name Appraiser's Name/License Appraisal
> Company Name
> Alice Wheaton Bruce Banner Stark
> Industries
>
> Approach:
>I am trying to annotate fixed keywords like "Underwriter's Name" and
> then go to line next to this annotated keyword.
>But I am not able to fetch UnderWriter's Name. It is giving all
> instances which are matched(Alice Wheaton Bruce, Wheaton Bruce Banner,
> etc).
>
>
> Code :
>
> TYPESYSTEM utils.PlainTextTypeSystem;
> ENGINE utils.PlainTextAnnotator;
>
> EXEC(PlainTextAnnotator, {Line});
> ADDRETAINTYPE(WS);
> Line{->TRIM(WS)};
> REMOVERETAINTYPE(WS);
> Document{->FILTERTYPE(SPECIAL)};
>
> DECLARE UnderWriterKeyword, NameKeyword, UnderWriterNameKeyword;
> DECLARE UnderWriterName(String label, String value);
>
> CW{REGEXP("\\bUnderwriter") -> UnderWriterKeyword};
> CW{REGEXP("Name")->NameKeyword};
> (UnderWriterKeyword SW NameKeyword){->UnderWriterNameKeyword};
> Line{CONTAINS(UnderWriterNameKeyword)} Line -> {
>n:CW[1,3]{-> CREATE(UnderWriterName, "label"="UnderWriter Name",
> "value"=n.ct)};
>};
>
> Please tell me whether it is possible to achieve this using RUTA or not.
> Also share steps to get Underwriter's Name, Appraiser's Name/License and
> Appraisal Comapny Name.
> I have already posted question similar to this on stackoverflow
> https://stackoverflow.com/questions/58726610/using-ruta-get-a-data-present-in-next-line-of-annotated-keyword/58728364#58728364
>
> Thanks,
>
> Shashank Pathak
>
--
Dr. Peter Klügl
R&D Text Mining/Machine Learning
Averbis GmbH
Salzstr. 15
79098 Freiburg
Germany
Fon: +49 761 708 394 0
Fax: +49 761 708 394 10
Email: peter.klu...@averbis.com
Web: https://averbis.com
Headquarters: Freiburg im Breisgau
Register Court: Amtsgericht Freiburg im Breisgau, HRB 701080
Managing Directors: Dr. med. Philipp Daumke, Dr. Kornél Markó