Re: Dictionary Matching using Concept Mapper for single word entry.

Jens Grivolla Thu, 23 Jul 2015 08:17:10 -0700

Hi Khirod,

could it be that your single-word document doesn't get marked as a
sentence? You have SpanFeatureStructure set to com.naukri.parse.type.Sentence,
so ConceptMapper only works on things that are within a Sentence
annotation. Tokens that are not part of a sentence will not be seen at all.


This has happened to us when working on malformed text where some sentence
segmenters would leave parts of the text unmarked.

Best,
Jens

On Sun, Jul 19, 2015 at 4:00 PM, Khirod Kant Naik <kkantn...@gmail.com>
wrote:

> Hi everyone,
>
> I am unable to match text from dictionary if the enclosing span contains
> only a single token.
>
> For example - I am trying to match word "education" from my dictionary and
> for the enclosing span I am using a sentence. So if sentence contains a
> single token then I am not able to match it from dictionary.
>
> Here is what I have tried,
>
> When I have a sentence like - "Education <**something else**>" then
> conceptMapper matches "education".
> While if I have a sentence like - "Education" then conceptMapper is not
> picking it from dictionary.
>
> So I have a question that *does conceptMapper requires you to have more
> than 1 TokenAnnotation within the specified spanFeatureStructure ? *
>
> P.S : This is the descriptor I am using
>
> <?xml version="1.0" encoding="UTF-8"?>
> > <taeDescription xmlns="http://uima.apache.org/resourceSpecifier";>
> >   <frameworkImplementation>org.apache.uima.java</frameworkImplementation>
> >   <primitive>true</primitive>
> >
> >
> <annotatorImplementationName>org.apache.uima.conceptMapper.ConceptMapper</annotatorImplementationName>
> >   <analysisEngineMetaData>
> >     <name>Segment Heading Annotator</name>
> >     <description/>
> >     <version>1</version>
> >     <vendor/>
> >     <configurationParameters>
> >       <configurationParameter>
> >         <name>caseMatch</name>
> >         <description>this parameter specifies the case folding mode:
> >                     ignoreall - fold everything to lowercase for
> >                     matching insensitive - fold only tokens with initial
> >                     caps to lowercase digitfold - fold all (and only)
> >                     tokens with a digit sensitive - perform no case
> >                     folding</description>
> >         <type>String</type>
> >         <multiValued>false</multiValued>
> >         <mandatory>true</mandatory>
> >       </configurationParameter>
> >       <configurationParameter>
> >         <name>Stemmer</name>
> >         <description>Name of stemmer class to use before matching. MUST
> >                     have a zero-parameter constructor! If not specified,
> >                     no stemming will be performed.</description>
> >         <type>String</type>
> >         <multiValued>false</multiValued>
> >         <mandatory>false</mandatory>
> >       </configurationParameter>
> >       <configurationParameter>
> >         <name>ResultingAnnotationName</name>
> >         <description>Name of the annotation type created by this TAE,
> >                     must match the typeSystemDescription
> > entry</description>
> >         <type>String</type>
> >         <multiValued>false</multiValued>
> >         <mandatory>true</mandatory>
> >       </configurationParameter>
> >       <configurationParameter>
> >         <name>ResultingEnclosingSpanName</name>
> >         <description>Name of the feature in the resultingAnnotation to
> >                     contain the span that encloses it (i.e. its
> >                     sentence)</description>
> >         <type>String</type>
> >         <multiValued>false</multiValued>
> >         <mandatory>false</mandatory>
> >       </configurationParameter>
> >       <configurationParameter>
> >         <name>AttributeList</name>
> >         <description>List of attribute names for XML dictionary entry
> >                     record - must correspond to FeatureList</description>
> >         <type>String</type>
> >         <multiValued>true</multiValued>
> >         <mandatory>true</mandatory>
> >       </configurationParameter>
> >       <configurationParameter>
> >         <name>FeatureList</name>
> >         <description>List of feature names for CAS annotation - must
> >                     correspond to AttributeList</description>
> >         <type>String</type>
> >         <multiValued>true</multiValued>
> >         <mandatory>true</mandatory>
> >       </configurationParameter>
> >       <configurationParameter>
> >         <name>TokenAnnotation</name>
> >         <description/>
> >         <type>String</type>
> >         <multiValued>false</multiValued>
> >         <mandatory>true</mandatory>
> >       </configurationParameter>
> >       <configurationParameter>
> >         <name>TokenClassFeatureName</name>
> >         <description>Name of feature used when doing lookups against
> >                     IncludedTokenClasses and
> > ExcludedTokenClasses</description>
> >         <type>String</type>
> >         <multiValued>false</multiValued>
> >         <mandatory>false</mandatory>
> >       </configurationParameter>
> >       <configurationParameter>
> >         <name>TokenTextFeatureName</name>
> >         <description/>
> >         <type>String</type>
> >         <multiValued>false</multiValued>
> >         <mandatory>false</mandatory>
> >       </configurationParameter>
> >       <configurationParameter>
> >         <name>SpanFeatureStructure</name>
> >         <description>Type of annotation which corresponds to spans of
> >                     data for processing (e.g. a Sentence)</description>
> >         <type>String</type>
> >         <multiValued>false</multiValued>
> >         <mandatory>true</mandatory>
> >       </configurationParameter>
> >       <configurationParameter>
> >         <name>OrderIndependentLookup</name>
> >         <description>True if should ignore element order during lookup
> >                     (i.e., "top box" would equal "box top"). Default is
> >                     False.</description>
> >         <type>Boolean</type>
> >         <multiValued>false</multiValued>
> >         <mandatory>false</mandatory>
> >       </configurationParameter>
> >       <configurationParameter>
> >         <name>TokenTypeFeatureName</name>
> >         <description>Name of feature used when doing lookups against
> >                     IncludedTokenTypes and
> ExcludedTokenTypes</description>
> >         <type>String</type>
> >         <multiValued>false</multiValued>
> >         <mandatory>false</mandatory>
> >       </configurationParameter>
> >       <configurationParameter>
> >         <name>IncludedTokenTypes</name>
> >         <description>Type of tokens to include in lookups (if not
> >                     supplied, then all types are included except those
> >                     specifically mentioned in
> > ExcludedTokenTypes)</description>
> >         <type>Integer</type>
> >         <multiValued>true</multiValued>
> >         <mandatory>false</mandatory>
> >       </configurationParameter>
> >       <configurationParameter>
> >         <name>ExcludedTokenTypes</name>
> >         <description/>
> >         <type>Integer</type>
> >         <multiValued>true</multiValued>
> >         <mandatory>false</mandatory>
> >       </configurationParameter>
> >       <configurationParameter>
> >         <name>ExcludedTokenClasses</name>
> >         <description>Class of tokens to exclude from lookups (if not
> >                     supplied, then all classes are excluded except those
> >                     specifically mentioned in IncludedTokenClasses,
> >                     unless IncludedTokenClasses is not supplied, in
> >                     which case none are excluded)</description>
> >         <type>String</type>
> >         <multiValued>true</multiValued>
> >         <mandatory>false</mandatory>
> >       </configurationParameter>
> >       <configurationParameter>
> >         <name>IncludedTokenClasses</name>
> >         <description>Class of tokens to include in lookups (if not
> >                     supplied, then all classes are included except those
> >                     specifically mentioned in
> > ExcludedTokenClasses)</description>
> >         <type>String</type>
> >         <multiValued>true</multiValued>
> >         <mandatory>false</mandatory>
> >       </configurationParameter>
> >       <configurationParameter>
> >         <name>TokenClassWriteBackFeatureNames</name>
> >         <description>names of features that should be written back to a
> >                     token, such as a POS tag</description>
> >         <type>String</type>
> >         <multiValued>true</multiValued>
> >         <mandatory>false</mandatory>
> >       </configurationParameter>
> >       <configurationParameter>
> >         <name>ResultingAnnotationMatchedTextFeature</name>
> >         <type>String</type>
> >         <multiValued>false</multiValued>
> >         <mandatory>false</mandatory>
> >       </configurationParameter>
> >       <configurationParameter>
> >         <name>PrintDictionary</name>
> >         <type>Boolean</type>
> >         <multiValued>false</multiValued>
> >         <mandatory>false</mandatory>
> >       </configurationParameter>
> >       <configurationParameter>
> >         <name>SearchStrategy</name>
> >         <description>Can be either "SkipAnyMatch",
> >                     "SkipAnyMatchAllowOverlap" or
> >                     "ContiguousMatch"&#13;&#13;ContiguousMatch: longest
> >                     match of contiguous tokens within enclosing
> >                     span(taking into account included/excluded items).
> >                     DEFAULT strategy &#13;SkipAnyMatch: longest match of
> >                     not-necessarily contiguous tokens within enclosing
> >                     span (taking into account included/excluded items).
> >                     Subsequent lookups begin in span after complete
> >                     match. IMPLIES order-independent lookup
> >                     &#13;SkipAnyMatchAllowOverlap: longest match of
> >                     not-necessarily contiguous tokens within enclosing
> >                     span (taking into account included/excluded items).
> >                     Subsequent lookups begin in span after next token.
> >                     IMPLIES order-independent lookup</description>
> >         <type>String</type>
> >         <multiValued>false</multiValued>
> >         <mandatory>false</mandatory>
> >       </configurationParameter>
> >       <configurationParameter>
> >         <name>StopWords</name>
> >         <type>String</type>
> >         <multiValued>true</multiValued>
> >         <mandatory>false</mandatory>
> >       </configurationParameter>
> >       <configurationParameter>
> >         <name>FindAllMatches</name>
> >         <type>Boolean</type>
> >         <multiValued>false</multiValued>
> >         <mandatory>false</mandatory>
> >       </configurationParameter>
> >       <configurationParameter>
> >         <name>MatchedTokensFeatureName</name>
> >         <type>String</type>
> >         <multiValued>false</multiValued>
> >         <mandatory>false</mandatory>
> >       </configurationParameter>
> >       <configurationParameter>
> >         <name>ReplaceCommaWithAND</name>
> >         <type>Boolean</type>
> >         <multiValued>false</multiValued>
> >         <mandatory>false</mandatory>
> >       </configurationParameter>
> >       <configurationParameter>
> >         <name>TokenizerDescriptorPath</name>
> >         <type>String</type>
> >         <multiValued>false</multiValued>
> >         <mandatory>true</mandatory>
> >       </configurationParameter>
> >       <configurationParameter>
> >         <name>LanguageID</name>
> >         <type>String</type>
> >         <multiValued>false</multiValued>
> >         <mandatory>false</mandatory>
> >       </configurationParameter>
> >     </configurationParameters>
> >     <configurationParameterSettings>
> >       <nameValuePair>
> >         <name>caseMatch</name>
> >         <value>
> >           <string>ignoreall</string>
> >         </value>
> >       </nameValuePair>
> >       <nameValuePair>
> >         <name>AttributeList</name>
> >         <value>
> >           <array>
> >             <string>canonical</string>
> >             <string>group</string>
> >             <string>class</string>
> >           </array>
> >         </value>
> >       </nameValuePair>
> >       <nameValuePair>
> >         <name>FeatureList</name>
> >         <value>
> >           <array>
> >             <string>DictCanon</string>
> >             <string>group</string>
> >             <string>segmentClass</string>
> >           </array>
> >         </value>
> >       </nameValuePair>
> >       <nameValuePair>
> >         <name>TokenAnnotation</name>
> >         <value>
> >           <string>com.naukri.parse.type.TokenAnnotation</string>
> >         </value>
> >       </nameValuePair>
> >       <nameValuePair>
> >         <name>ResultingAnnotationName</name>
> >         <value>
> >           <string>com.naukri.parse.type.resume.SegmentHeading</string>
> >         </value>
> >       </nameValuePair>
> >       <nameValuePair>
> >         <name>SpanFeatureStructure</name>
> >         <value>
> >           <string>com.naukri.parse.type.Sentence</string>
> >         </value>
> >       </nameValuePair>
> >       <nameValuePair>
> >         <name>OrderIndependentLookup</name>
> >         <value>
> >           <boolean>false</boolean>
> >         </value>
> >       </nameValuePair>
> >       <nameValuePair>
> >         <name>TokenClassWriteBackFeatureNames</name>
> >         <value>
> >           <array/>
> >         </value>
> >       </nameValuePair>
> >       <nameValuePair>
> >         <name>IncludedTokenClasses</name>
> >         <value>
> >           <array/>
> >         </value>
> >       </nameValuePair>
> >       <nameValuePair>
> >         <name>PrintDictionary</name>
> >         <value>
> >           <boolean>false</boolean>
> >         </value>
> >       </nameValuePair>
> >       <nameValuePair>
> >         <name>FindAllMatches</name>
> >         <value>
> >           <boolean>true</boolean>
> >         </value>
> >       </nameValuePair>
> >       <nameValuePair>
> >         <name>StopWords</name>
> >         <value>
> >           <array/>
> >         </value>
> >       </nameValuePair>
> >       <nameValuePair>
> >         <name>ReplaceCommaWithAND</name>
> >         <value>
> >           <boolean>false</boolean>
> >         </value>
> >       </nameValuePair>
> >       <nameValuePair>
> >         <name>TokenizerDescriptorPath</name>
> >         <value>
> >           <string>desc/tokenizerAE.xml</string>
> >         </value>
> >       </nameValuePair>
> >       <nameValuePair>
> >         <name>ResultingEnclosingSpanName</name>
> >         <value>
> >           <string>enclosingSpan</string>
> >         </value>
> >       </nameValuePair>
> >       <nameValuePair>
> >         <name>MatchedTokensFeatureName</name>
> >         <value>
> >           <string>matchedTokens</string>
> >         </value>
> >       </nameValuePair>
> >       <nameValuePair>
> >         <name>ResultingAnnotationMatchedTextFeature</name>
> >         <value>
> >           <string>matchedText</string>
> >         </value>
> >       </nameValuePair>
> >       <nameValuePair>
> >         <name>SearchStrategy</name>
> >         <value>
> >           <string>ContiguousMatch</string>
> >         </value>
> >       </nameValuePair>
> >       <nameValuePair>
> >         <name>LanguageID</name>
> >         <value>
> >           <string>en</string>
> >         </value>
> >       </nameValuePair>
> >     </configurationParameterSettings>
> >     <typeSystemDescription>
> >       <imports>
> >         <import location="typesystem.xml"/>
> >       </imports>
> >     </typeSystemDescription>
> >     <typePriorities>
> >       <priorityList>
> >         <type>com.naukri.parse.type.TokenAnnotation</type>
> >       </priorityList>
> >     </typePriorities>
> >     <fsIndexCollection/>
> >     <capabilities>
> >       <capability>
> >         <inputs>
> >           <type
> > allAnnotatorFeatures="true">ucom.naukri.parse.type.TokenAnnotation</type>
> >         </inputs>
> >         <outputs>
> >           <type
> > allAnnotatorFeatures="true">com.naukri.parse.type.DictTerm</type>
> >           <type
> > allAnnotatorFeatures="true">com.naukri.parse.type.TokenAnnotation</type>
> >           <type
> > allAnnotatorFeatures="true">uima.tcas.DocumentAnnotation</type>
> >         </outputs>
> >         <languagesSupported/>
> >       </capability>
> >     </capabilities>
> >     <operationalProperties>
> >       <modifiesCas>true</modifiesCas>
> >       <multipleDeploymentAllowed>true</multipleDeploymentAllowed>
> >       <outputsNewCASes>false</outputsNewCASes>
> >     </operationalProperties>
> >   </analysisEngineMetaData>
> >   <externalResourceDependencies>
> >     <externalResourceDependency>
> >       <key>DictionaryFile</key>
> >       <description>dictionary file loader.</description>
> >
> >
> <interfaceName>org.apache.uima.conceptMapper.support.dictionaryResource.DictionaryResource</interfaceName>
> >       <optional>false</optional>
> >     </externalResourceDependency>
> >   </externalResourceDependencies>
> >   <resourceManagerConfiguration>
> >     <externalResources>
> >       <externalResource>
> >         <name>segment_heading_dict</name>
> >         <description>A file containing the dictionary. Modify this URL to
> >                     use a different dictionary.</description>
> >         <fileResourceSpecifier>
> >           <fileUrl>dict/segment.heading.xml</fileUrl>
> >         </fileResourceSpecifier>
> >
> >
> <implementationName>org.apache.uima.conceptMapper.support.dictionaryResource.DictionaryResource_impl</implementationName>
> >       </externalResource>
> >     </externalResources>
> >     <externalResourceBindings>
> >       <externalResourceBinding>
> >         <key>DictionaryFile</key>
> >         <resourceName>segment_heading_dict</resourceName>
> >       </externalResourceBinding>
> >     </externalResourceBindings>
> >   </resourceManagerConfiguration>
> > </taeDescription>
> >
>

Re: Dictionary Matching using Concept Mapper for single word entry.

Reply via email to