ilovemesomeramen opened a new pull request, #1639:
URL: https://github.com/apache/systemds/pull/1639
This PR expands the existing Tokenisation to work with multi-threading, and
additionally adds some minor features and bug fixes.
MINOR bug fix: Removed hardcoded thread values in MultiColumnEncoder.
Tokenisation was moved into the `build` and `apply` paradigm as was
introduced for `transformencode`. The multithreading implementation uses the
same structure as `transformencode`. In the `build` stage the input gets split
into tokens and saved in an internal representation, also additional metadata
is computed which is then needed in the `apply` phase. During the `apply` the
computed data is retrieved and written to the output. The current
implementation splits the input frame into row partitions. Default is 64, which
can be changed with the `sysds.parallel.tokenize.numBlocks` configuration.
This is the first implementation, some known issues are, (1) memory
consumption, the execution DAG is not yet well optimized for memory
consumption. This could be fixed in the future by computing subsets one after
another (only possible when padding is enabled). (2) cache performance, similar
issue as 1, by computing subsets first in a cache-aware (unrolling loops)
manner performance could increase.
There is still quite some redundant code in the `TokenizerApplier`'s,
although this is not so easy to clean up without a major refactor of the
previous implementation.
Multithreading is disabled per default at the moment, and can be activated
via the `sysds.parallel.tokenize` config.
Introduced `ngram_type` as a configuration for the `ngram` tokenizer.
Differentiates between `token` and `document`.
`token` creates the ngram over each token e.g. your tokens are ['hello',
'this', 'is', 'a', 'nice', 'pr'] a 3 ngram would give you the tokens ['hel',
'ell', 'llo', 'thi', 'his', 'nic', 'ice']
if you would use `document`on the other hand, the ngram is computed over the
tokens in the document giving you:
['"('hello', 'this', 'is')"', '"('this', 'is', 'a')"', '"('is', 'a',
'nice')"', '"('a', 'nice', 'pr')"']
Introduced `apply_padding` for tokenizer spec, specifies if the output
should be padded to `max_tokens`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]