[Devicemap Wiki] Update of "DataSpec2" by rezan

Apache Wiki Sun, 18 Jan 2015 00:02:47 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Devicemap Wiki" for 
change notification.


The "DataSpec2" page has been changed by rezan:
https://wiki.apache.org/devicemap/DataSpec2?action=diff&rev1=39&rev2=40

   domain::
   :: a versioned pattern and attribute file
  
+ 
+ 
+ = Input Parsing =
+ 
+ This step parses the input string and creates the token stream.
+ 
+ Each pattern file defines these input parsing rules:
+ 
+  InputTransformers::
+  :: Type: list of transformers
+  :: Optional. Default: none
+ 
+  TokenSeparators::
+  :: Type: list of token seperator strings
+  :: Optional. Default: none
+ 
+  NgramConcatSize::
+  :: Type: integer, greater than zero
+  :: Optional. Default: 1
+ 
+ The input string first gets processed thru the transformers.
+ Then it gets tokenized using the configured seperators. Then ngram
+ concatenation happens. The final result of these 3 steps is the token stream.
+ 
+ 
+ === Notes ===
+ 
+ Empty tokens are removed from the tokenization step.
+ 
+ When a token is added to the token stream, it can be processed by the
+ pattern matching step before moving on to the next token. This algorithm is 
pipeline
+ and thread safe.
+ 
+ If the Ngram``Concat``Size is greater than 1, ngrams must be added to the 
token stream ordered largest to smallest.
+ 
+ 
+ === Example ===
+ 
+ {{{
+ InputTransformers: Lowercase(), ReplaceAll(find: '-', replaceWith: '')
+ TokenSeparators:   [space]
+ NgramConcatSize:   2
+ 
+ Input string:  'A 12 x-yZ'
+ 
+ Transform:     'a 12 xyz'
+ 
+ Tokenization:  a, 12, xyz
+ 
+ Ngram:         a12, a, 12xyz, 12, xyz
+ }}}
+ 
+ 
+ 
+ = Pattern Matching =
+ 
+ This step processes the token stream and returns the highest ranking 
candidate pattern.
+ 
+ The pattern file defines a pattern set. All patterns in the pattern set are 
evaluated to find the candidates.
+ 
+ Each pattern has 2 main attributes,
+ its pattern type and its pattern rank. The pattern
+ type defines how the pattern is supposed to be matched against the token 
stream.
+ The pattern rank defines how the pattern ranks against other patterns.
+ 
+ All the pattern types in 2.0 are prefixed with 'Simple'. This means that each 
pattern token is matched
+ using a plain byte or string comparison. No regex or other syntax is allowed 
in Simple patterns.
+ This allows the algorithm to use simple byte or string hashing for matching. 
This gives maximum performance and scaling complexity equal to a hashtable 
implementation. A Simple``Hash``Count attribute can be optionally defined which 
hints the classifier as to how many unique hashes it would need to generate to 
support the pattern set.
+ 
+ Pattern attributes:
+ 
+  PatternId::
+  :: Type: string
+  :: Required.
+ 
+  RankType::
+  :: Type: string
+  :: Required.
+ 
+  RankValue::
+  :: Type: integer, -1000 to 1000
+  :: Optional. Default: 0.
+ 
+  PatternType::
+  :: Type: string
+  :: Required.
+ 
+  PatternTokens::
+  :: Type: list of pattern token strings
+  :: Required.
+ 
+ Pattern set attributes:
+ 
+  DefaultId::
+  :: Type: string
+  :: Optional. Default: none
+ 
+  SimpleHashCount::
+  :: Type: integer, greater than zero
+  :: Optional. Default: none. Must be defined before the pattern set.
+ 
+ 
+ == PatternType ==
+ 
+ The following pattern types are defined:
+ 
+  SimpleOrderedAnd::
+  :: Each pattern token must appear in the token stream in index order, as 
defined in the Pattern``Tokens list. Its okay for non matched tokens to appear 
inbetween matched tokens as long as the matched tokens are still in order.
+ 
+  SimpleAnd::
+  :: Each pattern token must appear in the token stream. Order does not matter.
+ 
+  Simple::
+  :: Only one pattern token must appear in the token stream.
+ 
+ 
+ == RankType ==
+ 
+ The following rank types are defined:
+ 
+  Strong::
+  :: Strong patterns are ranked higher than Weak and None. The Rank``Value is 
ignored and they are ranked by their position in the pattern stream. 
Specifically, the last matched token position. The lower the position, the 
higher the rank. When a Strong pattern is found, the pattern matching step can 
stop and this pattern can be returned without analyzing the rest of the stream. 
This is because its impossible for another pattern to rank higher.
+ 
+  Weak::
+  :: Weak patterns are ranked below Strong but above None. A Weak candidate 
can only be returned in the absence of a Strong candidate. Weak candidates 
always rank higher than None candidates, regardless of their Rank``Value. The 
Rank``Value is used to rank between other Weak patterns.
+ 
+  None::
+  :: None patterns are ranked below Strong and Weak. A None candidate can only 
be returned in the absence of Strong and Weak candidates. The Rank``Value is 
used to rank between other None patterns.
+ 
+ In the case where 2 or more candidates have the same Rank``Type and 
Rank``Value resulting in a tie,
+ the candidate with the longest concatenated matched pattern length is used. 
If that results in
+ another tie, the candidate with the first matched token found is returned.
+ 
+ 
+ === Notes ===
+ 
+ If no candidate patterns are found, the Default``Id is returned. If no
+ Default``Id is defined, a null pattern is returned.
+ 
+ 2 or more patterns may share the same Pattern``Id. These patterns function 
completely independent of each other.
+ 
+ 2 or more patterns cannot have identical Rank``Type, Rank``Value, and pattern 
tokens. This results in undefined behavior when the patterns are candidates 
since they have identical rank. The classifier is free to choose any one 
candidate in this situation.
+ 
+ New pattern types and ranks can be introduced in future specifications. If a 
classifier encounters a definition it cannot support, it must immediately 
return an initialization error.
+ 
+ 
+ === Examples ===
+ 
+ {{{
+ Pattern:
+   PatternId: p1
+   RankType: Strong
+   PatternType: Simple
+   PatternTokens: bingo, jackpot
+ 
+ Pattern:
+   PatternId: p2
+   RankType: Weak
+   RankValue: 100
+   PatternType: SimpleOrderedAnd
+   PatternTokens: two, four, six
+ 
+ Pattern:
+   PatternId: p3
+   RankType: None
+   RankValue: 1000
+   PatternType: Simple
+   PatternTokens: two, four, six
+ 
+ Token stream: one, two, three, four, five, six, seven
+ Pattern: p2
+ 
+ Token stream: one, two, three, six, five, four, seven
+ Pattern: p3
+ 
+ Token stream: one, two, three, four, five, six, bingo, seven
+ Pattern: p1
+ }}}
+ 
+ 
+ 
+ = Attribute Retrieval =
+ 
+ This step processes the result of the Pattern Matching step. The Pattern``Id 
is used
+ to look up the corresponding attribute map. The Pattern``Id and the attribute 
map
+ are returned.
+ 
+ 
+ === Attribute Parsing ===
+ 
+ An attribute map can contain attributes values which are parsed out of the 
input string.
+ This is done by configuring the attribute as a set of transformers. The 
attribute can also
+ have a default value if the transformers return an error.
+ 
+ 
+ === Notes ===
+ 
+ If no attribute map is found, an empty map is used.
+ 
+ If a null pattern is returned from the previous step, this must be properly 
returned to the user.
+ A null pattern must be discernible from a user defined pattern.
+ 
+ 
+ 
+ = Transformers =
+ 
+ Transformers accept a string, apply an action, and then return a string.
+ If multiple transformers are defined in a set, the outputs and inputs are
+ linked together.
+ Transformers are used in the input parsing phase and the attribute retrieval 
phase.
+ 
+ Transformers can cause errors. Errors in input parsing are fatal, input 
parsing
+ is immediately stopped and an error is returned to the user. Errors in 
attribute retrieval
+ are okay. The error is written to [attribute]_error and the attribute is set 
to the default value,
+ if configured, or a blank value. [attribute]_error is a reserved attribute 
name.
+ 
+ The following transformer functions are supported:
+ 
+  Lowercase::
+  :: Description: converts the input to all lowercase
+  :: Return: the input in lowercase
+ 
+  Uppercase::
+  :: Description: converts the input to all uppercase
+  :: Return: the input in uppercase
+ 
+  ReplaceFirst::
+  :: Description: replace the first occurrence of a string with another string
+  :: Parameter - find: the substring to replace
+  :: Parameter - replaceWith: the string to replace 'find' with
+  :: Return: the string with the replacement made
+ 
+  ReplaceAll::
+  :: Description: replace all occurrences of a string with another string
+  :: Parameter - find: the substring to replace
+  :: Parameter - replaceWith: the string to replace 'find' with
+  :: Return: the string with the replacements made
+ 
+  Substring::
+  :: Description: return a substring of the input
+  :: Parameter - start: the starting index, 0 based
+  :: Parameter - maxLength: optional. If defined, the maximum amount of 
characters to return.
+  :: Return: the specified substring
+  :: Error: if 'start' is out of bounds
+ 
+  SplitAndGet::
+  :: Description: split the input and return a part of the split
+  :: Parameter - delimiter: the delimiter to use for splitting. If not found, 
the entire string is part 0. Empty parts are ignored.
+  :: Parameter - get: the part of the split string to return, 0 based index. 
-1 is the last part.
+  :: Return: the specified part of the split string
+  :: Error: if the 'get' index does not exist
+ 
+  IsNumber::
+  :: Description: checks if the input is a number
+  :: Return: the input string
+  :: Error: if the input is not a number
+ 
+ 
+ === Notes ===
+ 
+ New transformers can be introduced in future specifications. If a classifier 
encounters a definition it cannot support, it must immediately return an 
initialization error.
+ 
+ 
+ === Examples ===
+ 
+ {{{
+ Input string: 'aaa bbb 123 ccc'
+ 
+ Transformers:
+ 
+ SplitAndGet(delimiter: 'ccc', get: 0)
+ Result: 'aaa bbb 123 '
+ 
+ SplitAndGet(delimiter: ' ', get: -1)
+ Result: '123'
+ 
+ IsNumber()
+ Result: '123'
+ }}}
+ 
+ 
+ 
+ = Patch Files =
+ 
+ The pattern and attribute files can be patched with a user created pattern and
+ attribute file. In this case, parsing configurations override, the pattern 
sets get appended (you can override using pattern ranking), and attributes
+ override using the Pattern``Id.
+ 
+ 
+ 
- === Format ===
+ = Format =
  
  The pattern and attribute files are JSON objects. These files will contain:
  
-  * Format version
   * Specification version
   * Type (pattern, attribute, etc)
   * Domain name
@@ -67, +356 @@

   * Description
   * Publish date
  
- The files will also contain the attributes defined below in this 
specification.
+ TODO: define the JSON and example 
  
- TODO: define the 1.0 JSON format spec
- 
- 
- 
- = Input Parsing =
- 
- This step parses the input string and creates the token stream.
- 
- Each pattern file defines these input parsing rules:
- 
-  InputTransformers::
-  :: Type: list of transformers
-  :: Optional. Default: none
- 
-  TokenSeparators::
-  :: Type: list of token seperator strings
-  :: Optional. Default: none
- 
-  NgramConcatSize::
-  :: Type: integer, greater than zero
-  :: Optional. Default: 1
- 
- The input string first gets processed thru the transformers.
- Then it gets tokenized using the configured seperators. Then ngram
- concatenation happens. The final result of these 3 steps is the token stream.
- 
- 
- === Notes ===
- 
- Empty tokens are removed from the tokenization step.
- 
- When a token is added to the token stream, it can be processed by the
- pattern matching step before moving on to the next token. This algorithm is 
pipeline
- and thread safe.
- 
- If the Ngram``Concat``Size is greater than 1, ngrams must be added to the 
token stream ordered largest to smallest.
- 
- 
- === Example ===
- 
- {{{
- InputTransformers: Lowercase(), ReplaceAll(find: '-', replaceWith: '')
- TokenSeparators:   [space]
- NgramConcatSize:   2
- 
- Input string:  'A 12 x-yZ'
- 
- Transform:     'a 12 xyz'
- 
- Tokenization:  a, 12, xyz
- 
- Ngram:         a12, a, 12xyz, 12, xyz
- }}}
- 
- 
- 
- = Pattern Matching =
- 
- This step processes the token stream and returns the highest ranking 
candidate pattern.
- 
- The pattern file defines a pattern set. All patterns in the pattern set are 
evaluated to find the candidates.
- 
- Each pattern has 2 main attributes,
- its pattern type and its pattern rank. The pattern
- type defines how the pattern is supposed to be matched against the token 
stream.
- The pattern rank defines how the pattern ranks against other patterns.
- 
- All the pattern types in 2.0 are prefixed with 'Simple'. This means that each 
pattern token is matched
- using a plain byte or string comparison. No regex or other syntax is allowed 
in Simple patterns.
- This allows the algorithm to use simple byte or string hashing for matching. 
This gives maximum performance and scaling complexity equal to a hashtable 
implementation. A Simple``Hash``Count attribute can be optionally defined which 
hints the classifier as to how many unique hashes it would need to generate to 
support the pattern set.
- 
- Pattern attributes:
- 
-  PatternId::
-  :: Type: string
-  :: Required.
- 
-  RankType::
-  :: Type: string
-  :: Required.
- 
-  RankValue::
-  :: Type: integer, -1000 to 1000
-  :: Optional. Default: 0.
- 
-  PatternType::
-  :: Type: string
-  :: Required.
- 
-  PatternTokens::
-  :: Type: list of pattern token strings
-  :: Required.
- 
- Pattern set attributes:
- 
-  DefaultId::
-  :: Type: string
-  :: Optional. Default: none
- 
-  SimpleHashCount::
-  :: Type: integer, greater than zero
-  :: Optional. Default: none. Must be defined before the pattern set.
- 
- 
- == PatternType ==
- 
- The following pattern types are defined:
- 
-  SimpleOrderedAnd::
-  :: Each pattern token must appear in the token stream in index order, as 
defined in the Pattern``Tokens list. Its okay for non matched tokens to appear 
inbetween matched tokens as long as the matched tokens are still in order.
- 
-  SimpleAnd::
-  :: Each pattern token must appear in the token stream. Order does not matter.
- 
-  Simple::
-  :: Only one pattern token must appear in the token stream.
- 
- 
- == RankType ==
- 
- The following rank types are defined:
- 
-  Strong::
-  :: Strong patterns are ranked higher than Weak and None. The Rank``Value is 
ignored and they are ranked by their position in the pattern stream. 
Specifically, the last matched token position. The lower the position, the 
higher the rank. When a Strong pattern is found, the pattern matching step can 
stop and this pattern can be returned without analyzing the rest of the stream. 
This is because its impossible for another pattern to rank higher.
- 
-  Weak::
-  :: Weak patterns are ranked below Strong but above None. A Weak candidate 
can only be returned in the absence of a Strong candidate. Weak candidates 
always rank higher than None candidates, regardless of their Rank``Value. The 
Rank``Value is used to rank between other Weak patterns.
- 
-  None::
-  :: None patterns are ranked below Strong and Weak. A None candidate can only 
be returned in the absence of Strong and Weak candidates. The Rank``Value is 
used to rank between other None patterns.
- 
- In the case where 2 or more candidates have the same Rank``Type and 
Rank``Value resulting in a tie,
- the candidate with the longest concatenated matched pattern length is used. 
If that results in
- another tie, the candidate with the first matched token found is returned.
- 
- 
- === Notes ===
- 
- If no candidate patterns are found, the Default``Id is returned. If no
- Default``Id is defined, a null pattern is returned.
- 
- 2 or more patterns may share the same Pattern``Id. These patterns function 
completely independent of each other.
- 
- 2 or more patterns cannot have identical Rank``Type, Rank``Value, and pattern 
tokens. This results in undefined behavior when the patterns are candidates 
since they have identical rank. The classifier is free to choose any one 
candidate in this situation.
- 
- New pattern types and ranks can be introduced in future specifications. If a 
classifier encounters a definition it cannot support, it must immediately 
return an initialization error.
- 
- 
- === Examples ===
- 
- {{{
- Pattern:
-   PatternId: p1
-   RankType: Strong
-   PatternType: Simple
-   PatternTokens: bingo, jackpot
- 
- Pattern:
-   PatternId: p2
-   RankType: Weak
-   RankValue: 100
-   PatternType: SimpleOrderedAnd
-   PatternTokens: two, four, six
- 
- Pattern:
-   PatternId: p3
-   RankType: None
-   RankValue: 1000
-   PatternType: Simple
-   PatternTokens: two, four, six
- 
- Token stream: one, two, three, four, five, six, seven
- Pattern: p2
- 
- Token stream: one, two, three, six, five, four, seven
- Pattern: p3
- 
- Token stream: one, two, three, four, five, six, bingo, seven
- Pattern: p1
- }}}
- 
- 
- 
- = Attribute Retrieval =
- 
- This step processes the result of the Pattern Matching step. The Pattern``Id 
is used
- to look up the corresponding attribute map. The Pattern``Id and the attribute 
map
- are returned.
- 
- 
- === Attribute Parsing ===
- 
- An attribute map can contain attributes values which are parsed out of the 
input string.
- This is done by configuring the attribute as a set of transformers. The 
attribute can also
- have a default value if the transformers return an error.
- 
- 
- === Notes ===
- 
- If no attribute map is found, an empty map is used.
- 
- If a null pattern is returned from the previous step, this must be properly 
returned to the user.
- A null pattern must be discernible from a user defined pattern.
- 
- 
- 
- = Transformers =
- 
- Transformers accept a string, apply an action, and then return a string.
- If multiple transformers are defined in a set, the outputs and inputs are
- linked together.
- Transformers are used in the input parsing phase and the attribute retrieval 
phase.
- 
- Transformers can cause errors. Errors in input parsing are fatal, input 
parsing
- is immediately stopped and an error is returned to the user. Errors in 
attribute retrieval
- are okay. The error is written to [attribute]_error and the attribute is set 
to the default value,
- if configured, or a blank value. [attribute]_error is a reserved attribute 
name.
- 
- The following transformer functions are supported:
- 
-  Lowercase::
-  :: Description: converts the input to all lowercase
-  :: Return: the input in lowercase
- 
-  Uppercase::
-  :: Description: converts the input to all uppercase
-  :: Return: the input in uppercase
- 
-  ReplaceFirst::
-  :: Description: replace the first occurrence of a string with another string
-  :: Parameter - find: the substring to replace
-  :: Parameter - replaceWith: the string to replace 'find' with
-  :: Return: the string with the replacement made
- 
-  ReplaceAll::
-  :: Description: replace all occurrences of a string with another string
-  :: Parameter - find: the substring to replace
-  :: Parameter - replaceWith: the string to replace 'find' with
-  :: Return: the string with the replacements made
- 
-  Substring::
-  :: Description: return a substring of the input
-  :: Parameter - start: the starting index, 0 based
-  :: Parameter - maxLength: optional. If defined, the maximum amount of 
characters to return.
-  :: Return: the specified substring
-  :: Error: if 'start' is out of bounds
- 
-  SplitAndGet::
-  :: Description: split the input and return a part of the split
-  :: Parameter - delimiter: the delimiter to use for splitting. If not found, 
the entire string is part 0. Empty parts are ignored.
-  :: Parameter - get: the part of the split string to return, 0 based index. 
-1 is the last part.
-  :: Return: the specified part of the split string
-  :: Error: if the 'get' index does not exist
- 
-  IsNumber::
-  :: Description: checks if the input is a number
-  :: Return: the input string
-  :: Error: if the input is not a number
- 
- 
- === Notes ===
- 
- New transformers can be introduced in future specifications. If a classifier 
encounters a definition it cannot support, it must immediately return an 
initialization error.
- 
- 
- === Examples ===
- 
- {{{
- Input string: 'I am 47 years old.'
- 
- Transformers:
- 
- SplitAndGet(delimiter: 'years old', get: 0)
- Result: 'I am 47 '
- 
- SplitAndGet(delimiter: ' ', get: -1)
- Result: '47'
- 
- IsNumber()
- Result: '47'
- }}}
- 
- 
- 
- = Patch Files =
- 
- The pattern and attribute files can be patched with a user created pattern and
- attribute file. In this case, parsing configurations override, the pattern 
sets get appended (you can override using pattern ranking), and attributes
- override using the Pattern``Id.
-

[Devicemap Wiki] Update of "DataSpec2" by rezan

Reply via email to