This is an automated email from the ASF dual-hosted git repository.
aradzinski pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-nlpcraft.git
The following commit(s) were added to refs/heads/master by this push:
new 74b7997 Update NCPipeline.java
74b7997 is described below
commit 74b7997e51172d171593dcfde90ca0dfdc90fe6f
Author: Aaron Radzinski <[email protected]>
AuthorDate: Tue Mar 29 15:27:40 2022 -0700
Update NCPipeline.java
---
.../main/scala/org/apache/nlpcraft/NCPipeline.java | 115 ++++++++++++++++-----
1 file changed, 92 insertions(+), 23 deletions(-)
diff --git a/nlpcraft/src/main/scala/org/apache/nlpcraft/NCPipeline.java
b/nlpcraft/src/main/scala/org/apache/nlpcraft/NCPipeline.java
index 867d20e..f25dd95 100644
--- a/nlpcraft/src/main/scala/org/apache/nlpcraft/NCPipeline.java
+++ b/nlpcraft/src/main/scala/org/apache/nlpcraft/NCPipeline.java
@@ -28,81 +28,150 @@ import java.util.Optional;
* pipeline and produce the list of {@link NCEntity entities} at the end of
the pipeline.
* Schematically the pipeline looks like this:
* <pre>
- * +----------+ +-----------+
- * *=========* +---------+ +---+-------+ | +---+-------+ |
- * : Text : -> | Token | -> | Token | | -> | Token | | ----.
- * : Input : | Parser | | Enrichers |--+ | Validators |--+ \
- * *=========* +---------+ +-----------+ +------------+ \
- * }
- * +-----------+ +----------+ +--------+ /
- * *=========* +---+--------+ | +---+-------+ | +---+-----+ | /
- * : Entity : <- | Entity | | <- | Entity | | <- | Entity | | <-
- * : List : | Validators |--+ | Enrichers |--+ | Parsers |--+
- * *=========* +------------+ +-----------+ +---------+
+ * ,---------.
,----------. ,-------.
+ * o/ *=========* ,---------. ,---'-------. |
,---'--------. | ,---'-----. |
+ * /| -> : Text : -> | Token | -> | Token | | -> | Token
| | -> | Entity | |
+ * / \ : Input : | Parser | | Enrichers |-' | Validators
|-' | Parsers |-'
+ * *=========* `---------' `-----------'
`------------' `---------'
+ *
|
+ * ,----------.
,---------. |
+ * *============* ,---------. ,---'--------. |
,---'-------. | |
+ * Intent <- : Entity : <- | Variant | <- | Entity | | <- | Entity
| | <-----'
+ * Matching : Variants : | Filter | | Validators |-' |
Enrichers |-'
+ * *============* `---------' `------------'
`-----------'
* </pre>
* <p>
* Pipeline has the following components:
* <ul>
* <li>
- * {@link NCTokenParser} is responsible for taking the input text and
tokenize it into a list of
- * {@link NCToken}. This process is called tokenization, i.e. the
process of demarcating and
- * classifying sections of a string of input characters. There's only
one token parser for the pipeline.
+ * <p>
+ * {@link NCTokenParser} is responsible for taking the input text
and tokenize it into a list of
+ * {@link NCToken}. This process is called tokenization, i.e. the
process of demarcating and
+ * classifying sections of a string of input characters. There's
only one token parser for the pipeline
+ * and token parser is mandatory part of the pipeline.
+ * </p>
* </li>
* <li>
- * After the initial list of token is
+ * <p>
+ * After the initial list of token is created one or more {@link
NCTokenEnricher} are called to enrich
+ * each token. Enrichment consists of adding properties to {@link
NCToken} instance. Example of enrichers
+ * could be stopword detection, geo-location detection, POS
tagging, etc. Token enrichers are optional and
+ * by default the list of token enrichers is empty.
+ * </p>
+ * </li>
+ * <li>
+ * <p>
+ * After all tokens are enriched the {@link NCTokenValidator} are
called. Token validators provide an opportunity
+ * to reject input request at the early stage of token
processing. Some of the examples of token validation
+ * can be curse words filtration, privacy checks, adult content
blocking, etc. Token validators are optional
+ * and by default the list of token validators is empty.
+ * </p>
+ * </li>
+ * <li>
+ * <p>
+ * Once tokens are parsed, enriched and validated they are passed
into one or more {@link NCEntityParser}.
+ * Entity parser is responsible for taking a list of tokens and
converting them into a list of entity, where
+ * an entity is typically has a consistent semantic meaning and
usually denotes a real-world object, such as
+ * persons, locations, number, date and time, organizations,
products, etc. - where such objects can be
+ * abstract or have a physical existence.
+ * </p>
+ * <p>
+ * At least one entity parser must be defined in the pipeline. If
multiple parsers are defined their collective
+ * output is combined for further processing. Note that it is
possible and in many cases is required that a single
+ * list of tokens can be converted to the list of entities in
more than one way that is called {@link NCVariant}.
+ * Having multiple entity parsers allows to compartmentalize this
logic.
+ * </p>
+ * </li>
+ * <li>
+ * <p>
+ * Just like with tokens, once entity list (or lists) are
obtained, they go through {@link NCEntityEnricher}.
+ * Entity enrichment consists of adding properties to {@link
NCEntity} instance. Entity enrichers are optional
+ * and by default the list of entity enrichers is empty. Examples
of the entity enrichment are always application
+ * specific since they are dealing with application specific
entities: it could be access tokens, special
+ * markers, etc.
+ * </p>
+ * </li>
+ * <li>
+ * <p>
+ * After entity enrichment is done the list(s) of entities go
through {@link NCEntityValidator}. Just like
+ * token validators, entity validators allow to reject input
request at the level of entity processing.
+ * Entity validators are optional and by default the list of
entity validators is empty. Examples of the entity
+ * validators can be security checks, authentication and
authorization, ACL checks, etc.
+ * </p>
+ * </li>
+ * <li>
+ * <p>
+ * Finally, there is an optional filter for {@link NCVariant}
instances before they get into intent matching. This
+ * filter allows to filter out unnecessary (or spurious) parsing
variants based on application-specific logic.
+ * Note that amount of parsing variants directly correlates to
the overall performance of intent matching.
+ * </p>
* </li>
* </ul>
*
- *
+ * @see NCEntity
+ * @see NCToken
+ * @see NCTokenParser
+ * @see NCTokenEnricher
+ * @see NCTokenValidator
+ * @see NCEntityParser
+ * @see NCEntityEnricher
+ * @see NCEntityValidator
*/
public interface NCPipeline {
/**
+ * Gets mandatory token parser.
*
- * @return
+ * @return Token parser.
*/
NCTokenParser getTokenParser();
/**
+ * Gets the list of entity parser. At least one entity parser is required.
*
- * @return
+ * @return List of entity parser. List should contain at least one entity
parser.
*/
List<NCEntityParser> getEntityParsers();
/**
+ * Gets optional list of token enrichers.
*
- * @return
+ * @return Optional list of token enrichers. Can be empty but never {@code
null}.
*/
default List<NCTokenEnricher> getTokenEnrichers() {
return Collections.emptyList();
}
/**
+ * Gets optional list of entity enrichers.
*
- * @return
+ * @return Optional list of entity enrichers. Can be empty but never
{@code null}.
*/
default List<NCEntityEnricher> getEntityEnrichers() {
return Collections.emptyList();
}
/**
+ * Gets optional list of token validators.
*
- * @return
+ * @return Optional list of token validators. Can be empty but never
{@code null}.
*/
default List<NCTokenValidator> getTokenValidators() {
return Collections.emptyList();
}
/**
+ * Gets optional list of entity validators.
*
- * @return
+ * @return Optional list of entity validators. Can be empty but never
{@code null}.
*/
default List<NCEntityValidator> getEntityValidators() {
return Collections.emptyList();
}
/**
+ * Gets optional variant filter.
*
- * @return
+ * @return Optional variant filter.
*/
default Optional<NCVariantFilter> getVariantFilter() {
return Optional.empty();