[systemds] branch master updated: [SYSTEMDS-2881] DML Tokenizer API DIA project WS2020/21 Closes #1169.

ssiddiqi Thu, 04 Mar 2021 02:06:48 -0800

This is an automated email from the ASF dual-hosted git repository.

ssiddiqi pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/systemds.git



The following commit(s) were added to refs/heads/master by this push:
     new 8ef0d5a  [SYSTEMDS-2881] DML Tokenizer API DIA project WS2020/21 
Closes #1169.
8ef0d5a is described below

commit 8ef0d5afef50aa11384f7af3851f930348b7e539
Author: Markus Reiter-Haas <[email protected]>
AuthorDate: Thu Mar 4 10:42:59 2021 +0100

    [SYSTEMDS-2881] DML Tokenizer API
    DIA project WS2020/21
    Closes #1169.
    
    Co-authored-by: Samuel Kogler <[email protected]>
    Co-authored-by: David Froehlich <[email protected]>
---
 docs/site/dml-language-reference.md                |  56 +++++-
 .../java/org/apache/sysds/common/Builtins.java     |   1 +
 src/main/java/org/apache/sysds/common/Types.java   |   2 +-
 .../apache/sysds/hops/ParameterizedBuiltinOp.java  |   1 +
 .../apache/sysds/lops/ParameterizedBuiltin.java    |   3 +-
 .../org/apache/sysds/parser/DMLTranslator.java     |   1 +
 .../ParameterizedBuiltinFunctionExpression.java    |  19 ++
 .../functionobjects/ParameterizedBuiltin.java      |   8 +-
 .../runtime/instructions/CPInstructionParser.java  |   1 +
 .../runtime/instructions/SPInstructionParser.java  |   1 +
 .../cp/ParameterizedBuiltinCPInstruction.java      |  16 ++
 .../spark/ParameterizedBuiltinSPInstruction.java   |  55 +++++-
 .../runtime/transform/tokenize/Tokenizer.java      |  84 ++++++++
 .../transform/tokenize/TokenizerFactory.java       | 109 ++++++++++
 .../runtime/transform/tokenize/TokenizerPost.java  |  33 ++++
 .../transform/tokenize/TokenizerPostCount.java     | 121 ++++++++++++
 .../transform/tokenize/TokenizerPostHash.java      | 159 +++++++++++++++
 .../transform/tokenize/TokenizerPostPosition.java  | 137 +++++++++++++
 .../runtime/transform/tokenize/TokenizerPre.java   |  29 +++
 .../transform/tokenize/TokenizerPreNgram.java      | 100 ++++++++++
 .../tokenize/TokenizerPreWhitespaceSplit.java      |  92 +++++++++
 .../test/functions/transform/TokenizeTest.java     | 220 +++++++++++++++++++++
 .../input/20news/20news_subset_untokenized.csv     |   3 +
 .../transform/tokenize/TokenizeNgramPosLong.dml    |  47 +++++
 .../transform/tokenize/TokenizeNgramPosLong.json   |  12 ++
 .../transform/tokenize/TokenizeNgramPosWide.dml    |  47 +++++
 .../transform/tokenize/TokenizeNgramPosWide.json   |  12 ++
 .../transform/tokenize/TokenizeSplitCountLong.dml  |  49 +++++
 .../transform/tokenize/TokenizeSplitCountLong.json |   9 +
 .../transform/tokenize/TokenizeUniHashWide.dml     |  46 +++++
 .../transform/tokenize/TokenizeUniHashWide.json    |  14 ++
 31 files changed, 1477 insertions(+), 10 deletions(-)

diff --git a/docs/site/dml-language-reference.md 
b/docs/site/dml-language-reference.md
index 27bbbc6..27ffd5d 100644
--- a/docs/site/dml-language-reference.md
+++ b/docs/site/dml-language-reference.md
@@ -2026,15 +2026,22 @@ The following example uses 
<code>transformapply()</code> with the input matrix a
 
 ### Processing Frames
 
-The built-in function <code>map()</code> provides support for the lambda 
expressions.
-
-**Table F5**: Frame map built-in function
+**Table F5**: Frame processing built-in functions
 
 Function | Description | Parameters | Example
 -------- | ----------- | ---------- | -------
-map() | It will execute the given lambda expression on a frame.| Input: (X 
&lt;frame&gt;, y &lt;String&gt;) <br/>Output: &lt;frame&gt;. <br/> X is a frame 
and <br/>y is a String containing the lambda expression to be executed on frame 
X. | X = read("file1", data_type="frame", rows=2, cols=3, format="binary") 
<br/> y = "lambda expression" <br/> Z = map(X, y) <br/> # Dimensions of Z = 
Dimensions of X; <br/> example: <br/> <code> Z = map(X, "x -> x.charAt(2)")     
</code>
+map() | It will execute the given lambda expression on a frame.| Input: (X 
&lt;frame&gt;, y &lt;String&gt;) <br/>Output: &lt;frame&gt;. <br/> X is a frame 
and <br/>y is a String containing the lambda expression to be executed on frame 
X. | [map](#map) 
+tokenize() | Transforms a frame to tokenized frame using specification. 
Tokenization is valid only for string columns. | Input:<br/> target = 
&lt;frame&gt; <br/> spec = &lt;json specification&gt; <br/> Outputs: 
&lt;matrix&gt;, &lt;frame&gt; | [tokenize](#tokenize)
+
+#### map
+
+The built-in function <code>map()</code> provides support for the lambda 
expressions.
+
+Simple example
 
-Example let X = 
+    X = read("file1", data_type="frame", rows=2, cols=3, format="binary") 
<br/> y = "lambda expression" <br/> Z = map(X, y) <br/> # Dimensions of Z = 
Dimensions of X; <br/> example: <br/> <code> Z = map(X, "x -> x.charAt(2)")     
</code>
+
+Example with data let X = 
 
     # FRAME: nrow = 10, ncol = 1
     # C1 
@@ -2085,6 +2092,45 @@ print(toString(dist)) </code>
       0,600 0,286 0,125 1,000 0,286 0,125 0,125 0,600 0,600 0,000
     #
     
+#### tokenize
+
+Simple example
+
+    X = read(“file1”, data_type=”frame”, rows=3, cols=2, format=”binary”);
+    spec = "{\"algo\": \"whitespace\",\"out\": \"count\",\"id_cols\": 
[1],\"tokenize_col\": 2}";
+    Y = tokenize(target=X, spec=jspec, max_tokens=1000);
+    write(Y, "file2");
+    
+Example spec
+
+    {
+      "algo": "split",
+      "out": "count",
+      "id_cols": [1],
+      "tokenize_col": 2
+    }
+
+The frame is tokenized along the `tokenize_col` and replicates the `id_cols`.
+
+The output frame can be converted into a matrix with the transform functions. 
For instance, using `transformencode` with `recode`, followed by `table`.
+Alternatively, for certain algorithms by specifying `"format_wide": true` 
expands the tokens in the columns instead of creating new rows.
+
+**Table F6**: Tokenizer Algorithms for `algo` field
+
+Algorithm | Algo Description | Parameters | Spec Example
+-------- | ----------- | ---------- | -------
+whitespace | Splits the tokens along whitespace characters. | None | 
`"{\"algo\": \"whitespace\",\"out\": \"count\",\"out_params\": {\"sort_alpha\": 
true},\"id_cols\": \[2\],\"tokenize_col\": 3}"`
+ngram | Pretokenizes using `whitespace` then splits the tokens into ngrams | 
`min_gram` and `max_gram` specify the length of the ngrams. | `"{\"algo\": 
\"ngram\",\"algo_params\": {\"min_gram\": 2,\"max_gram\": 3},\"out\": 
\"position\",\"id_cols\": \[1,2\],\"tokenize_col\": 3}"`
+
+**Table F7**: Output Representations of Tokens for `out` field
+
+Out Representation | Format Description | Parameters | Format Example
+-------- | ----------- | ---------- | -------
+count | Outputs the `id_cols`, the `tokens`, and the number of token 
`occurences` per document. | `sort_alpha` specifies whether the tokens are 
sorted alphanumerically per document. | `id1,id2,token1,3`
+position | Outputs the `id_cols`, the `position` within the document, and the 
`token`. | None | `id1,id2,1,token1`
+hash | Outputs the `id_cols`, the `index` of non-zero hashes, and the `hashes` 
| `num_features` specifies the number of output features | `id1,id2,2,64`
+
+
 * * *
 
 ## Modules
diff --git a/src/main/java/org/apache/sysds/common/Builtins.java 
b/src/main/java/org/apache/sysds/common/Builtins.java
index a0c0222..fd9f218 100644
--- a/src/main/java/org/apache/sysds/common/Builtins.java
+++ b/src/main/java/org/apache/sysds/common/Builtins.java
@@ -259,6 +259,7 @@ public enum Builtins {
        SCALEAPPLY("scaleApply", true, false),
        TIME("time", false),
        CVLM("cvlm", true, false),
+       TOKENIZE("tokenize", false, true),
        TOSTRING("toString", false, true),
        TRANSFORMAPPLY("transformapply", false, true),
        TRANSFORMCOLMAP("transformcolmap", false, true),
diff --git a/src/main/java/org/apache/sysds/common/Types.java 
b/src/main/java/org/apache/sysds/common/Types.java
index 335d60f..284ba9c 100644
--- a/src/main/java/org/apache/sysds/common/Types.java
+++ b/src/main/java/org/apache/sysds/common/Types.java
@@ -460,7 +460,7 @@ public class Types
                INVALID, CDF, INVCDF, GROUPEDAGG, RMEMPTY, REPLACE, REXPAND,
                LOWER_TRI, UPPER_TRI,
                TRANSFORMAPPLY, TRANSFORMDECODE, TRANSFORMCOLMAP, TRANSFORMMETA,
-               TOSTRING, LIST, PARAMSERV
+               TOKENIZE, TOSTRING, LIST, PARAMSERV
        }
        
        public enum OpOpDnn {
diff --git a/src/main/java/org/apache/sysds/hops/ParameterizedBuiltinOp.java 
b/src/main/java/org/apache/sysds/hops/ParameterizedBuiltinOp.java
index 298fd6a..68128e0 100644
--- a/src/main/java/org/apache/sysds/hops/ParameterizedBuiltinOp.java
+++ b/src/main/java/org/apache/sysds/hops/ParameterizedBuiltinOp.java
@@ -182,6 +182,7 @@ public class ParameterizedBuiltinOp extends 
MultiThreadedHop {
                        case REPLACE:
                        case LOWER_TRI:
                        case UPPER_TRI:
+                       case TOKENIZE:
                        case TRANSFORMAPPLY:
                        case TRANSFORMDECODE:
                        case TRANSFORMCOLMAP:
diff --git a/src/main/java/org/apache/sysds/lops/ParameterizedBuiltin.java 
b/src/main/java/org/apache/sysds/lops/ParameterizedBuiltin.java
index 698e739..154bc54 100644
--- a/src/main/java/org/apache/sysds/lops/ParameterizedBuiltin.java
+++ b/src/main/java/org/apache/sysds/lops/ParameterizedBuiltin.java
@@ -173,7 +173,8 @@ public class ParameterizedBuiltin extends Lop
                                }
                                
                                break;
-                       
+
+                       case TOKENIZE:
                        case TRANSFORMAPPLY:
                        case TRANSFORMDECODE:
                        case TRANSFORMCOLMAP:
diff --git a/src/main/java/org/apache/sysds/parser/DMLTranslator.java 
b/src/main/java/org/apache/sysds/parser/DMLTranslator.java
index 7e5d063..b050a3b 100644
--- a/src/main/java/org/apache/sysds/parser/DMLTranslator.java
+++ b/src/main/java/org/apache/sysds/parser/DMLTranslator.java
@@ -2003,6 +2003,7 @@ public class DMLTranslator
                        case REPLACE:
                        case LOWER_TRI:
                        case UPPER_TRI:
+                       case TOKENIZE:
                        case TRANSFORMAPPLY:
                        case TRANSFORMDECODE:
                        case TRANSFORMCOLMAP:
diff --git 
a/src/main/java/org/apache/sysds/parser/ParameterizedBuiltinFunctionExpression.java
 
b/src/main/java/org/apache/sysds/parser/ParameterizedBuiltinFunctionExpression.java
index 4d111b0..ec731e6 100644
--- 
a/src/main/java/org/apache/sysds/parser/ParameterizedBuiltinFunctionExpression.java
+++ 
b/src/main/java/org/apache/sysds/parser/ParameterizedBuiltinFunctionExpression.java
@@ -202,6 +202,10 @@ public class ParameterizedBuiltinFunctionExpression 
extends DataIdentifier
                case ORDER:
                        validateOrder(output, conditional);
                        break;
+
+               case TOKENIZE:
+                       validateTokenize(output, conditional);
+                       break;
                
                case TRANSFORMAPPLY:
                        validateTransformApply(output, conditional);
@@ -337,6 +341,21 @@ public class ParameterizedBuiltinFunctionExpression 
extends DataIdentifier
                }
        }
 
+       private void validateTokenize(DataIdentifier output, boolean 
conditional)
+       {
+               //validate data / metadata (recode maps)
+               checkDataType("tokenize", TF_FN_PARAM_DATA, DataType.FRAME, 
conditional);
+
+               //validate specification
+               checkDataValueType(false, "tokenize", TF_FN_PARAM_SPEC, 
DataType.SCALAR, ValueType.STRING, conditional);
+               validateTransformSpec(TF_FN_PARAM_SPEC, conditional);
+
+               //set output dimensions
+               output.setDataType(DataType.FRAME);
+               output.setValueType(ValueType.STRING);
+               output.setDimensions(-1, -1);
+       }
+
        // example: A = transformapply(target=X, meta=M, spec=s)
        private void validateTransformApply(DataIdentifier output, boolean 
conditional) 
        {
diff --git 
a/src/main/java/org/apache/sysds/runtime/functionobjects/ParameterizedBuiltin.java
 
b/src/main/java/org/apache/sysds/runtime/functionobjects/ParameterizedBuiltin.java
index 6bea475..c15f6da 100644
--- 
a/src/main/java/org/apache/sysds/runtime/functionobjects/ParameterizedBuiltin.java
+++ 
b/src/main/java/org/apache/sysds/runtime/functionobjects/ParameterizedBuiltin.java
@@ -40,11 +40,11 @@ import org.apache.sysds.runtime.util.UtilFunctions;
 public class ParameterizedBuiltin extends ValueFunction
 {      
 
-       private static final long serialVersionUID = -5966242955816522697L;
+       private static final long serialVersionUID = -7987603644903675052L;
        
        public enum ParameterizedBuiltinCode { 
                CDF, INVCDF, RMEMPTY, REPLACE, REXPAND, LOWER_TRI, UPPER_TRI,
-               TRANSFORMAPPLY, TRANSFORMDECODE, PARAMSERV }
+               TOKENIZE, TRANSFORMAPPLY, TRANSFORMDECODE, PARAMSERV }
        public enum ProbabilityDistributionCode { 
                INVALID, NORMAL, EXP, CHISQ, F, T }
        
@@ -61,6 +61,7 @@ public class ParameterizedBuiltin extends ValueFunction
                String2ParameterizedBuiltinCode.put( "lowertri", 
ParameterizedBuiltinCode.LOWER_TRI);
                String2ParameterizedBuiltinCode.put( "uppertri", 
ParameterizedBuiltinCode.UPPER_TRI);
                String2ParameterizedBuiltinCode.put( "rexpand", 
ParameterizedBuiltinCode.REXPAND);
+               String2ParameterizedBuiltinCode.put( "tokenize", 
ParameterizedBuiltinCode.TOKENIZE);
                String2ParameterizedBuiltinCode.put( "transformapply", 
ParameterizedBuiltinCode.TRANSFORMAPPLY);
                String2ParameterizedBuiltinCode.put( "transformdecode", 
ParameterizedBuiltinCode.TRANSFORMDECODE);
                String2ParameterizedBuiltinCode.put( "paramserv", 
ParameterizedBuiltinCode.PARAMSERV);
@@ -172,6 +173,9 @@ public class ParameterizedBuiltin extends ValueFunction
                        
                        case REXPAND:
                                return new 
ParameterizedBuiltin(ParameterizedBuiltinCode.REXPAND);
+
+                       case TOKENIZE:
+                               return new 
ParameterizedBuiltin(ParameterizedBuiltinCode.TOKENIZE);
                        
                        case TRANSFORMAPPLY:
                                return new 
ParameterizedBuiltin(ParameterizedBuiltinCode.TRANSFORMAPPLY);
diff --git 
a/src/main/java/org/apache/sysds/runtime/instructions/CPInstructionParser.java 
b/src/main/java/org/apache/sysds/runtime/instructions/CPInstructionParser.java
index 273e59a..2232fa3 100644
--- 
a/src/main/java/org/apache/sysds/runtime/instructions/CPInstructionParser.java
+++ 
b/src/main/java/org/apache/sysds/runtime/instructions/CPInstructionParser.java
@@ -214,6 +214,7 @@ public class CPInstructionParser extends InstructionParser
                String2CPInstructionType.put( "uppertri",       
CPType.ParameterizedBuiltin);
                String2CPInstructionType.put( "rexpand",        
CPType.ParameterizedBuiltin);
                String2CPInstructionType.put( "toString",       
CPType.ParameterizedBuiltin);
+               String2CPInstructionType.put( "tokenize",       
CPType.ParameterizedBuiltin);
                String2CPInstructionType.put( "transformapply", 
CPType.ParameterizedBuiltin);
                String2CPInstructionType.put( 
"transformdecode",CPType.ParameterizedBuiltin);
                String2CPInstructionType.put( 
"transformcolmap",CPType.ParameterizedBuiltin);
diff --git 
a/src/main/java/org/apache/sysds/runtime/instructions/SPInstructionParser.java 
b/src/main/java/org/apache/sysds/runtime/instructions/SPInstructionParser.java
index 4b77bff..1adc79f 100644
--- 
a/src/main/java/org/apache/sysds/runtime/instructions/SPInstructionParser.java
+++ 
b/src/main/java/org/apache/sysds/runtime/instructions/SPInstructionParser.java
@@ -266,6 +266,7 @@ public class SPInstructionParser extends InstructionParser
                String2SPInstructionType.put( "rexpand",        
SPType.ParameterizedBuiltin);
                String2SPInstructionType.put( "lowertri",       
SPType.ParameterizedBuiltin);
                String2SPInstructionType.put( "uppertri",       
SPType.ParameterizedBuiltin);
+               String2SPInstructionType.put( "tokenize",       
SPType.ParameterizedBuiltin);
                String2SPInstructionType.put( "transformapply", 
SPType.ParameterizedBuiltin);
                String2SPInstructionType.put( 
"transformdecode",SPType.ParameterizedBuiltin);
                String2SPInstructionType.put( 
"transformencode",SPType.MultiReturnBuiltin);
diff --git 
a/src/main/java/org/apache/sysds/runtime/instructions/cp/ParameterizedBuiltinCPInstruction.java
 
b/src/main/java/org/apache/sysds/runtime/instructions/cp/ParameterizedBuiltinCPInstruction.java
index 58bc7b1..34b7211 100644
--- 
a/src/main/java/org/apache/sysds/runtime/instructions/cp/ParameterizedBuiltinCPInstruction.java
+++ 
b/src/main/java/org/apache/sysds/runtime/instructions/cp/ParameterizedBuiltinCPInstruction.java
@@ -58,6 +58,8 @@ import 
org.apache.sysds.runtime.transform.decode.DecoderFactory;
 import org.apache.sysds.runtime.transform.encode.Encoder;
 import org.apache.sysds.runtime.transform.encode.EncoderFactory;
 import org.apache.sysds.runtime.transform.meta.TfMetaUtils;
+import org.apache.sysds.runtime.transform.tokenize.Tokenizer;
+import org.apache.sysds.runtime.transform.tokenize.TokenizerFactory;
 import org.apache.sysds.runtime.util.DataConverter;
 
 public class ParameterizedBuiltinCPInstruction extends 
ComputationCPInstruction {
@@ -148,6 +150,7 @@ public class ParameterizedBuiltinCPInstruction extends 
ComputationCPInstruction
                                || opcode.equals("transformdecode")
                                || opcode.equals("transformcolmap")
                                || opcode.equals("transformmeta")
+                               || opcode.equals("tokenize")
                                || opcode.equals("toString")
                                || opcode.equals("nvlist")) {
                        return new ParameterizedBuiltinCPInstruction(null, 
paramsMap, out, opcode, str);
@@ -256,6 +259,19 @@ public class ParameterizedBuiltinCPInstruction extends 
ComputationCPInstruction
                        ec.setMatrixOutput(output.getName(), ret);
                        ec.releaseMatrixInput(params.get("target"));
                }
+               else if ( opcode.equalsIgnoreCase("tokenize") ) {
+                       //acquire locks
+                       FrameBlock data = 
ec.getFrameInput(params.get("target"));
+
+                       // compute tokenizer
+                       Tokenizer tokenizer = TokenizerFactory.createTokenizer(
+                                       getParameterMap().get("spec"), 
Integer.parseInt(getParameterMap().get("max_tokens")));
+                       FrameBlock fbout = tokenizer.tokenize(data, new 
FrameBlock(tokenizer.getSchema()));
+
+                       //release locks
+                       ec.setFrameOutput(output.getName(), fbout);
+                       ec.releaseFrameInput(params.get("target"));
+               }
                else if ( opcode.equalsIgnoreCase("transformapply")) {
                        //acquire locks
                        FrameBlock data = 
ec.getFrameInput(params.get("target"));
diff --git 
a/src/main/java/org/apache/sysds/runtime/instructions/spark/ParameterizedBuiltinSPInstruction.java
 
b/src/main/java/org/apache/sysds/runtime/instructions/spark/ParameterizedBuiltinSPInstruction.java
index c6f065c..89abc9d 100644
--- 
a/src/main/java/org/apache/sysds/runtime/instructions/spark/ParameterizedBuiltinSPInstruction.java
+++ 
b/src/main/java/org/apache/sysds/runtime/instructions/spark/ParameterizedBuiltinSPInstruction.java
@@ -73,6 +73,8 @@ import org.apache.sysds.runtime.transform.encode.Encoder;
 import org.apache.sysds.runtime.transform.encode.EncoderFactory;
 import org.apache.sysds.runtime.transform.meta.TfMetaUtils;
 import org.apache.sysds.runtime.transform.meta.TfOffsetMap;
+import org.apache.sysds.runtime.transform.tokenize.Tokenizer;
+import org.apache.sysds.runtime.transform.tokenize.TokenizerFactory;
 import org.apache.sysds.runtime.util.DataConverter;
 import org.apache.sysds.runtime.util.UtilFunctions;
 import scala.Tuple2;
@@ -156,6 +158,7 @@ public class ParameterizedBuiltinSPInstruction extends 
ComputationSPInstruction
                                || opcode.equalsIgnoreCase("replace")
                                || opcode.equalsIgnoreCase("lowertri")
                                || opcode.equalsIgnoreCase("uppertri")
+                               || opcode.equalsIgnoreCase("tokenize")
                                || opcode.equalsIgnoreCase("transformapply")
                                || opcode.equalsIgnoreCase("transformdecode")) {
                                func = 
ParameterizedBuiltin.getParameterizedBuiltinFnObject(opcode);
@@ -432,7 +435,33 @@ public class ParameterizedBuiltinSPInstruction extends 
ComputationSPInstruction
                        //post-processing to obtain sparsity of ultra-sparse 
outputs
                        
SparkUtils.postprocessUltraSparseOutput(sec.getMatrixObject(output), mcOut);
                }
-               else if ( opcode.equalsIgnoreCase("transformapply") ) 
+               else if ( opcode.equalsIgnoreCase("tokenize") )
+               {
+                       //get input RDD data
+                       FrameObject fo = 
sec.getFrameObject(params.get("target"));
+                       JavaPairRDD<Long,FrameBlock> in = 
(JavaPairRDD<Long,FrameBlock>)
+                                       sec.getRDDHandleForFrameObject(fo, 
FileFormat.BINARY);
+                       DataCharacteristics mc = 
sec.getDataCharacteristics(params.get("target"));
+
+                       //construct tokenizer and tokenize text
+                       Tokenizer tokenizer = 
TokenizerFactory.createTokenizer(params.get("spec"),
+                                       
Integer.parseInt(params.get("max_tokens")));
+                       JavaPairRDD<Long,FrameBlock> out = in.mapToPair(
+                                       new RDDTokenizeFunction(tokenizer, 
mc.getBlocksize()));
+
+                       //set output and maintain lineage/output characteristics
+                       sec.setRDDHandleForVariable(output.getName(), out);
+                       sec.addLineageRDD(output.getName(), 
params.get("target"));
+
+                       // get max tokens for row upper bound
+                       long numRows = tokenizer.getNumRows(mc.getRows());
+                       long numCols = tokenizer.getNumCols();
+
+                       sec.getDataCharacteristics(output.getName()).set(
+                                       numRows, numCols, mc.getBlocksize());
+                       
sec.getFrameObject(output.getName()).setSchema(tokenizer.getSchema());
+               }
+               else if ( opcode.equalsIgnoreCase("transformapply") )
                {
                        //get input RDD and meta data
                        FrameObject fo = 
sec.getFrameObject(params.get("target"));
@@ -787,6 +816,30 @@ public class ParameterizedBuiltinSPInstruction extends 
ComputationSPInstruction
                }
        }
 
+       public static class RDDTokenizeFunction implements 
PairFunction<Tuple2<Long, FrameBlock>, Long, FrameBlock>
+       {
+               private static final long serialVersionUID = 
-8788298032616522019L;
+
+               private Tokenizer _tokenizer = null;
+               private int _blen = -1;
+
+               public RDDTokenizeFunction(Tokenizer tokenizer, int blen) {
+                       _tokenizer = tokenizer;
+                       _blen = blen;
+               }
+
+               @Override
+               public Tuple2<Long,FrameBlock> call(Tuple2<Long, FrameBlock> in)
+                               throws Exception
+               {
+                       long key = in._1();
+                       FrameBlock blk = in._2();
+
+                       FrameBlock fbout = _tokenizer.tokenize(blk, new 
FrameBlock(_tokenizer.getSchema()));
+                       return new Tuple2<>(key, fbout);
+               }
+       }
+
        public static class RDDTransformApplyFunction implements 
PairFunction<Tuple2<Long,FrameBlock>,Long,FrameBlock> 
        {
                private static final long serialVersionUID = 
5759813006068230916L;
diff --git 
a/src/main/java/org/apache/sysds/runtime/transform/tokenize/Tokenizer.java 
b/src/main/java/org/apache/sysds/runtime/transform/tokenize/Tokenizer.java
new file mode 100644
index 0000000..dd4982a
--- /dev/null
+++ b/src/main/java/org/apache/sysds/runtime/transform/tokenize/Tokenizer.java
@@ -0,0 +1,84 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.sysds.runtime.transform.tokenize;
+
+import org.apache.commons.logging.Log;
+import org.apache.commons.logging.LogFactory;
+import org.apache.sysds.common.Types;
+import org.apache.sysds.runtime.matrix.data.FrameBlock;
+
+import java.io.Serializable;
+import java.util.List;
+
+public class Tokenizer implements Serializable {
+
+    private static final long serialVersionUID = 7155673772374114577L;
+    protected static final Log LOG = 
LogFactory.getLog(Tokenizer.class.getName());
+
+    private final TokenizerPre tokenizerPre;
+    private final TokenizerPost tokenizerPost;
+
+    protected Tokenizer(TokenizerPre tokenizerPre, TokenizerPost 
tokenizerPost) {
+
+        this.tokenizerPre = tokenizerPre;
+        this.tokenizerPost = tokenizerPost;
+    }
+
+    public Types.ValueType[] getSchema() {
+        return tokenizerPost.getOutSchema();
+    }
+
+    public long getNumRows(long inRows) {
+        return tokenizerPost.getNumRows(inRows);
+    }
+
+    public long getNumCols() {
+        return tokenizerPost.getNumCols();
+    }
+
+    public FrameBlock tokenize(FrameBlock in, FrameBlock out) {
+        // First convert to internal representation
+        List<DocumentToTokens> documentsToTokenList = 
tokenizerPre.tokenizePre(in);
+        // Then convert to output representation
+        return tokenizerPost.tokenizePost(documentsToTokenList, out);
+    }
+
+    static class Token {
+        String textToken;
+        long startIndex;
+        long endIndex;
+
+        public Token(String token, long startIndex) {
+            this.textToken = token;
+            this.startIndex = startIndex;
+            this.endIndex = startIndex + token.length();
+        }
+    }
+
+    static class DocumentToTokens {
+        List<Object> keys;
+        List<Tokenizer.Token> tokens;
+
+        public DocumentToTokens(List<Object> keys, List<Tokenizer.Token> 
tokens) {
+            this.keys = keys;
+            this.tokens = tokens;
+        }
+    }
+}
diff --git 
a/src/main/java/org/apache/sysds/runtime/transform/tokenize/TokenizerFactory.java
 
b/src/main/java/org/apache/sysds/runtime/transform/tokenize/TokenizerFactory.java
new file mode 100644
index 0000000..18c4bff
--- /dev/null
+++ 
b/src/main/java/org/apache/sysds/runtime/transform/tokenize/TokenizerFactory.java
@@ -0,0 +1,109 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.sysds.runtime.transform.tokenize;
+
+import org.apache.sysds.runtime.DMLRuntimeException;
+import org.apache.wink.json4j.JSONObject;
+import org.apache.wink.json4j.JSONArray;
+
+import java.util.ArrayList;
+import java.util.List;
+
+public class TokenizerFactory {
+
+    public static Tokenizer createTokenizer(String spec, int maxTokens) {
+        Tokenizer tokenizer = null;
+
+        try {
+            //parse transform specification
+            JSONObject jSpec = new JSONObject(spec);
+
+            // tokenization needs an algorithm (with algorithm specific params)
+            String algo = jSpec.getString("algo");
+            JSONObject algoParams = null;
+            if (jSpec.has("algo_params")) {
+                algoParams = jSpec.getJSONObject("algo_params");
+            }
+
+            // tokenization needs an output representation (with 
representation specific params)
+            String out = jSpec.getString("out");
+            JSONObject outParams = null;
+            if (jSpec.has("out_params")) {
+                outParams = jSpec.getJSONObject("out_params");
+            }
+
+            // tokenization needs a text column to tokenize
+            int tokenizeCol = jSpec.getInt("tokenize_col");
+
+            // tokenization needs one or more idCols that define the document 
and are replicated per token
+            List<Integer> idCols = new ArrayList<>();
+            JSONArray idColsJsonArray = jSpec.getJSONArray("id_cols");
+            for (int i=0; i < idColsJsonArray.length(); i++) {
+                idCols.add(idColsJsonArray.getInt(i));
+            }
+            // Output schema is derived from specified id cols
+            int numIdCols = idCols.size();
+
+            // get difference between long and wide format
+            boolean wideFormat = false;  // long format is default
+            if (jSpec.has("format_wide")) {
+                wideFormat = jSpec.getBoolean("format_wide");
+            }
+
+            TokenizerPre tokenizerPre;
+            TokenizerPost tokenizerPost;
+
+            // Note that internal representation should be independent from 
output representation
+
+            // Algorithm to transform tokens into internal token representation
+            switch (algo) {
+                case "split":
+                    tokenizerPre = new TokenizerPreWhitespaceSplit(idCols, 
tokenizeCol, algoParams);
+                    break;
+                case "ngram":
+                    tokenizerPre = new TokenizerPreNgram(idCols, tokenizeCol, 
algoParams);
+                    break;
+                default:
+                    throw new IllegalArgumentException("Algorithm {algo=" + 
algo + "} is not supported.");
+            }
+
+            // Transform tokens to output representation
+            switch (out) {
+                case "count":
+                    tokenizerPost = new TokenizerPostCount(outParams, 
numIdCols, maxTokens, wideFormat);
+                    break;
+                case "position":
+                    tokenizerPost = new TokenizerPostPosition(outParams, 
numIdCols, maxTokens, wideFormat);
+                    break;
+                case "hash":
+                    tokenizerPost = new TokenizerPostHash(outParams, 
numIdCols, maxTokens, wideFormat);
+                    break;
+                default:
+                    throw new IllegalArgumentException("Output representation 
{out=" + out + "} is not supported.");
+            }
+
+            tokenizer = new Tokenizer(tokenizerPre, tokenizerPost);
+        }
+        catch(Exception ex) {
+            throw new DMLRuntimeException(ex);
+        }
+        return tokenizer;
+    }
+}
diff --git 
a/src/main/java/org/apache/sysds/runtime/transform/tokenize/TokenizerPost.java 
b/src/main/java/org/apache/sysds/runtime/transform/tokenize/TokenizerPost.java
new file mode 100644
index 0000000..5f35c89
--- /dev/null
+++ 
b/src/main/java/org/apache/sysds/runtime/transform/tokenize/TokenizerPost.java
@@ -0,0 +1,33 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.sysds.runtime.transform.tokenize;
+
+import org.apache.sysds.common.Types;
+import org.apache.sysds.runtime.matrix.data.FrameBlock;
+
+import java.io.Serializable;
+import java.util.List;
+
+public interface TokenizerPost extends Serializable {
+    FrameBlock tokenizePost(List<Tokenizer.DocumentToTokens> tl, FrameBlock 
out);
+    Types.ValueType[] getOutSchema();
+    long getNumRows(long inRows);
+    long getNumCols();
+}
diff --git 
a/src/main/java/org/apache/sysds/runtime/transform/tokenize/TokenizerPostCount.java
 
b/src/main/java/org/apache/sysds/runtime/transform/tokenize/TokenizerPostCount.java
new file mode 100644
index 0000000..f1f9e81
--- /dev/null
+++ 
b/src/main/java/org/apache/sysds/runtime/transform/tokenize/TokenizerPostCount.java
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.sysds.runtime.transform.tokenize;
+
+import org.apache.sysds.common.Types;
+import org.apache.sysds.runtime.matrix.data.FrameBlock;
+import org.apache.sysds.runtime.util.UtilFunctions;
+import org.apache.wink.json4j.JSONException;
+import org.apache.wink.json4j.JSONObject;
+
+import java.io.Serializable;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+public class TokenizerPostCount implements TokenizerPost{
+
+    private static final long serialVersionUID = 6382000606237705019L;
+    private final Params params;
+    private final int numIdCols;
+    private final int maxTokens;
+    private final boolean wideFormat;
+
+    static class Params implements Serializable {
+
+        private static final long serialVersionUID = 5121697674346781880L;
+
+        public boolean sort_alpha = false;
+
+        public Params(JSONObject json) throws JSONException {
+            if (json != null && json.has("sort_alpha")) {
+                this.sort_alpha = json.getBoolean("sort_alpha");
+            }
+        }
+    }
+
+    public TokenizerPostCount(JSONObject params, int numIdCols, int maxTokens, 
boolean wideFormat) throws JSONException {
+        this.params = new Params(params);
+        this.numIdCols = numIdCols;
+        this.maxTokens = maxTokens;
+        this.wideFormat = wideFormat;
+    }
+
+    @Override
+    public FrameBlock tokenizePost(List<Tokenizer.DocumentToTokens> tl, 
FrameBlock out) {
+        for (Tokenizer.DocumentToTokens docToToken: tl) {
+            List<Object> keys = docToToken.keys;
+            List<Tokenizer.Token> tokenList = docToToken.tokens;
+            // Creating the counts for BoW
+            Map<String, Long> tokenCounts = 
tokenList.stream().collect(Collectors.groupingBy(token ->
+                token.textToken, Collectors.counting()));
+            // Remove duplicate strings
+            Stream<String> distinctTokenStream = tokenList.stream().map(token 
-> token.textToken).distinct();
+            if (params.sort_alpha) {
+                // Sort alphabetically
+                distinctTokenStream = distinctTokenStream.sorted();
+            }
+            List<String> outputTokens = 
distinctTokenStream.collect(Collectors.toList());
+
+            int numTokens = 0;
+            for (String token: outputTokens) {
+                if (numTokens >= maxTokens) {
+                    break;
+                }
+                // Create a row per token
+                long count = tokenCounts.get(token);
+                List<Object> rowList = new ArrayList<>(keys);
+                rowList.add(token);
+                rowList.add(count);
+                Object[] row = new Object[rowList.size()];
+                rowList.toArray(row);
+                out.appendRow(row);
+                numTokens++;
+            }
+        }
+
+        return out;
+    }
+
+    @Override
+    public Types.ValueType[] getOutSchema() {
+        if (wideFormat) {
+            throw new IllegalArgumentException("Wide Format is not supported 
for Count Representation.");
+        }
+        // Long format only depends on numIdCols
+        Types.ValueType[]  schema = UtilFunctions.nCopies(numIdCols + 
2,Types.ValueType.STRING );
+        schema[numIdCols + 1] = Types.ValueType.INT64;
+        return schema;
+    }
+
+    public long getNumRows(long inRows) {
+        if (wideFormat) {
+            return inRows;
+        } else {
+            return inRows * maxTokens;
+        }
+    }
+
+    public long getNumCols() {
+        return this.getOutSchema().length;
+    }
+}
diff --git 
a/src/main/java/org/apache/sysds/runtime/transform/tokenize/TokenizerPostHash.java
 
b/src/main/java/org/apache/sysds/runtime/transform/tokenize/TokenizerPostHash.java
new file mode 100644
index 0000000..f19bdb8
--- /dev/null
+++ 
b/src/main/java/org/apache/sysds/runtime/transform/tokenize/TokenizerPostHash.java
@@ -0,0 +1,159 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.sysds.runtime.transform.tokenize;
+
+import org.apache.sysds.common.Types;
+import org.apache.sysds.runtime.matrix.data.FrameBlock;
+import org.apache.sysds.runtime.util.UtilFunctions;
+import org.apache.wink.json4j.JSONException;
+import org.apache.wink.json4j.JSONObject;
+
+import java.io.Serializable;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.Map;
+import java.util.TreeMap;
+import java.util.function.Function;
+import java.util.stream.Collectors;
+
+public class TokenizerPostHash implements TokenizerPost{
+
+    private static final long serialVersionUID = 4763889041868044668L;
+    private final Params params;
+    private final int numIdCols;
+    private final int maxTokens;
+    private final boolean wideFormat;
+
+    static class Params implements Serializable {
+
+        private static final long serialVersionUID = -256069061414241795L;
+
+        public int num_features = 1048576;  // 2^20
+
+        public Params(JSONObject json) throws JSONException {
+            if (json != null && json.has("num_features")) {
+                this.num_features = json.getInt("num_features");
+            }
+        }
+    }
+
+    public TokenizerPostHash(JSONObject params, int numIdCols, int maxTokens, 
boolean wideFormat) throws JSONException {
+        this.params = new Params(params);
+        this.numIdCols = numIdCols;
+        this.maxTokens = maxTokens;
+        this.wideFormat = wideFormat;
+    }
+
+    @Override
+    public FrameBlock tokenizePost(List<Tokenizer.DocumentToTokens> tl, 
FrameBlock out) {
+        for (Tokenizer.DocumentToTokens docToToken: tl) {
+            List<Object> keys = docToToken.keys;
+            List<Tokenizer.Token> tokenList = docToToken.tokens;
+            // Transform to hashes
+            List<Integer> hashList = tokenList.stream().map(token -> 
token.textToken.hashCode() %
+                params.num_features).collect(Collectors.toList());
+            // Counting the hashes
+            Map<Integer, Long> hashCounts = 
hashList.stream().collect(Collectors.groupingBy(Function.identity(),
+                Collectors.counting()));
+            // Sorted by hash
+            Map<Integer, Long> sortedHashes = new TreeMap<>(hashCounts);
+
+            if (wideFormat) {
+                this.appendTokensWide(keys, sortedHashes, out);
+            } else {
+                this.appendTokensLong(keys, sortedHashes, out);
+            }
+        }
+
+        return out;
+    }
+
+    private void appendTokensLong(List<Object> keys, Map<Integer, Long> 
sortedHashes, FrameBlock out) {
+        int numTokens = 0;
+        for (Map.Entry<Integer, Long> hashCount: sortedHashes.entrySet()) {
+            if (numTokens >= maxTokens) {
+                break;
+            }
+            // Create a row per token
+            int hash = hashCount.getKey() + 1;
+            long count = hashCount.getValue();
+            List<Object> rowList = new ArrayList<>(keys);
+            rowList.add((long) hash);
+            rowList.add(count);
+            Object[] row = new Object[rowList.size()];
+            rowList.toArray(row);
+            out.appendRow(row);
+            numTokens++;
+        }
+    }
+
+    private void appendTokensWide(List<Object> keys, Map<Integer, Long> 
sortedHashes, FrameBlock out) {
+        // Create one row with keys as prefix
+        List<Object> rowList = new ArrayList<>(keys);
+
+        for (int tokenPos = 0; tokenPos < maxTokens; tokenPos++) {
+            long positionHash = sortedHashes.getOrDefault(tokenPos, 0L);
+            rowList.add(positionHash);
+        }
+        Object[] row = new Object[rowList.size()];
+        rowList.toArray(row);
+        out.appendRow(row);
+    }
+
+    @Override
+    public Types.ValueType[] getOutSchema() {
+        if (wideFormat) {
+            return getOutSchemaWide(numIdCols, maxTokens);
+        } else {
+            return getOutSchemaLong(numIdCols);
+        }
+    }
+
+    private Types.ValueType[] getOutSchemaWide(int numIdCols, int maxTokens) {
+        Types.ValueType[] schema = new Types.ValueType[numIdCols + maxTokens];
+        int i = 0;
+        for (; i < numIdCols; i++) {
+            schema[i] = Types.ValueType.STRING;
+        }
+        for (int j = 0; j < maxTokens; j++, i++) {
+            schema[i] = Types.ValueType.INT64;
+        }
+        return schema;
+    }
+
+    private Types.ValueType[] getOutSchemaLong(int numIdCols) {
+        Types.ValueType[] schema =  UtilFunctions.nCopies(numIdCols + 
2,Types.ValueType.STRING );
+        schema[numIdCols] = Types.ValueType.INT64;
+        schema[numIdCols+1] = Types.ValueType.INT64;
+        return schema;
+    }
+
+    public long getNumRows(long inRows) {
+        if (wideFormat) {
+            return inRows;
+        } else {
+            return inRows * maxTokens;
+        }
+    }
+
+    public long getNumCols() {
+        return this.getOutSchema().length;
+    }
+}
diff --git 
a/src/main/java/org/apache/sysds/runtime/transform/tokenize/TokenizerPostPosition.java
 
b/src/main/java/org/apache/sysds/runtime/transform/tokenize/TokenizerPostPosition.java
new file mode 100644
index 0000000..4451f08
--- /dev/null
+++ 
b/src/main/java/org/apache/sysds/runtime/transform/tokenize/TokenizerPostPosition.java
@@ -0,0 +1,137 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.sysds.runtime.transform.tokenize;
+
+import org.apache.sysds.common.Types;
+import org.apache.sysds.runtime.matrix.data.FrameBlock;
+
+import org.apache.sysds.runtime.util.UtilFunctions;
+import org.apache.wink.json4j.JSONObject;
+
+import java.util.ArrayList;
+import java.util.List;
+
+public class TokenizerPostPosition implements TokenizerPost{
+
+    private static final long serialVersionUID = 3563407270742660830L;
+    private final int numIdCols;
+    private final int maxTokens;
+    private final boolean wideFormat;
+
+    public TokenizerPostPosition(JSONObject params, int numIdCols, int 
maxTokens, boolean wideFormat) {
+        // No configurable params yet
+        this.numIdCols = numIdCols;
+        this.maxTokens = maxTokens;
+        this.wideFormat = wideFormat;
+    }
+
+    @Override
+    public FrameBlock tokenizePost(List<Tokenizer.DocumentToTokens> tl, 
FrameBlock out) {
+        for (Tokenizer.DocumentToTokens docToToken: tl) {
+            List<Object> keys = docToToken.keys;
+            List<Tokenizer.Token> tokenList = docToToken.tokens;
+
+            if (wideFormat) {
+                this.appendTokensWide(keys, tokenList, out);
+            } else {
+                this.appendTokensLong(keys, tokenList, out);
+            }
+        }
+
+        return out;
+    }
+
+    public void appendTokensLong(List<Object> keys, List<Tokenizer.Token> 
tokenList, FrameBlock out) {
+        int numTokens = 0;
+        for (Tokenizer.Token token: tokenList) {
+            if (numTokens >= maxTokens) {
+                break;
+            }
+            // Create a row per token
+            List<Object> rowList = new ArrayList<>(keys);
+            // Convert to 1-based index for DML
+            rowList.add(token.startIndex + 1);
+            rowList.add(token.textToken);
+            Object[] row = new Object[rowList.size()];
+            rowList.toArray(row);
+            out.appendRow(row);
+            numTokens++;
+        }
+    }
+
+    public void appendTokensWide(List<Object> keys, List<Tokenizer.Token> 
tokenList, FrameBlock out) {
+        // Create one row with keys as prefix
+        List<Object> rowList = new ArrayList<>(keys);
+
+        int numTokens = 0;
+        for (Tokenizer.Token token: tokenList) {
+            if (numTokens >= maxTokens) {
+                break;
+            }
+            rowList.add(token.textToken);
+            numTokens++;
+        }
+        // Remaining positions need to be filled with empty tokens
+        for (; numTokens < maxTokens; numTokens++) {
+            rowList.add("");
+        }
+        Object[] row = new Object[rowList.size()];
+        rowList.toArray(row);
+        out.appendRow(row);
+    }
+
+    @Override
+    public Types.ValueType[] getOutSchema() {
+        if (wideFormat) {
+            return getOutSchemaWide(numIdCols, maxTokens);
+        } else {
+            return getOutSchemaLong(numIdCols);
+        }
+
+    }
+
+    private Types.ValueType[] getOutSchemaWide(int numIdCols, int maxTokens) {
+        Types.ValueType[] schema = UtilFunctions.nCopies(numIdCols + 
maxTokens,Types.ValueType.STRING );
+        return schema;
+    }
+
+    private Types.ValueType[] getOutSchemaLong(int numIdCols) {
+        Types.ValueType[] schema = new Types.ValueType[numIdCols + 2];
+        int i = 0;
+        for (; i < numIdCols; i++) {
+            schema[i] = Types.ValueType.STRING;
+        }
+        schema[i] = Types.ValueType.INT64;
+        schema[i+1] = Types.ValueType.STRING;
+        return schema;
+    }
+
+    public long getNumRows(long inRows) {
+        if (wideFormat) {
+            return inRows;
+        } else {
+            return inRows * maxTokens;
+        }
+    }
+
+    public long getNumCols() {
+        return this.getOutSchema().length;
+    }
+}
diff --git 
a/src/main/java/org/apache/sysds/runtime/transform/tokenize/TokenizerPre.java 
b/src/main/java/org/apache/sysds/runtime/transform/tokenize/TokenizerPre.java
new file mode 100644
index 0000000..640bb5b
--- /dev/null
+++ 
b/src/main/java/org/apache/sysds/runtime/transform/tokenize/TokenizerPre.java
@@ -0,0 +1,29 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.sysds.runtime.transform.tokenize;
+
+import org.apache.sysds.runtime.matrix.data.FrameBlock;
+
+import java.io.Serializable;
+import java.util.List;
+
+public interface TokenizerPre extends Serializable {
+    List<Tokenizer.DocumentToTokens> tokenizePre(FrameBlock in);
+}
diff --git 
a/src/main/java/org/apache/sysds/runtime/transform/tokenize/TokenizerPreNgram.java
 
b/src/main/java/org/apache/sysds/runtime/transform/tokenize/TokenizerPreNgram.java
new file mode 100644
index 0000000..a602c2b
--- /dev/null
+++ 
b/src/main/java/org/apache/sysds/runtime/transform/tokenize/TokenizerPreNgram.java
@@ -0,0 +1,100 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.sysds.runtime.transform.tokenize;
+
+import org.apache.sysds.runtime.matrix.data.FrameBlock;
+import org.apache.wink.json4j.JSONException;
+import org.apache.wink.json4j.JSONObject;
+
+import java.io.Serializable;
+import java.util.ArrayList;
+import java.util.List;
+
+public class TokenizerPreNgram implements TokenizerPre {
+
+    private static final long serialVersionUID = -6297904316677723802L;
+    
+    private final TokenizerPreWhitespaceSplit tokenizerPreWhitespaceSplit;
+    private final Params params;
+
+    static class Params implements Serializable {
+
+        private static final long serialVersionUID = -6516419749810062677L;
+
+        public int minGram = 1;
+        public int maxGram = 2;
+
+        public Params(JSONObject json) throws JSONException {
+            if (json != null && json.has("min_gram")) {
+                this.minGram = json.getInt("min_gram");
+            }
+            if (json != null && json.has("max_gram")) {
+                this.maxGram = json.getInt("max_gram");
+            }
+        }
+    }
+
+    public TokenizerPreNgram(List<Integer> idCols, int tokenizeCol, JSONObject 
params) throws JSONException {
+        this.tokenizerPreWhitespaceSplit = new 
TokenizerPreWhitespaceSplit(idCols, tokenizeCol, params);
+        this.params = new Params(params);
+    }
+
+    public List<Tokenizer.Token> wordTokenToNgrams(Tokenizer.Token wordTokens) 
{
+        List<Tokenizer.Token> ngramTokens = new ArrayList<>();
+
+        int tokenLen = wordTokens.textToken.length();
+        int startPos = params.minGram - params.maxGram;
+        int endPos = Math.max(tokenLen - params.minGram, startPos);
+
+        for (int i = startPos; i <= endPos; i++) {
+            int startSlice = Math.max(i, 0);
+            int endSlice = Math.min(i + params.maxGram, tokenLen);
+            String substring = wordTokens.textToken.substring(startSlice, 
endSlice);
+            long tokenStart = wordTokens.startIndex + startSlice;
+            ngramTokens.add(new Tokenizer.Token(substring, tokenStart));
+        }
+
+        return ngramTokens;
+    }
+
+    public List<Tokenizer.Token> wordTokenListToNgrams(List<Tokenizer.Token> 
wordTokens) {
+        List<Tokenizer.Token> ngramTokens = new ArrayList<>();
+
+        for (Tokenizer.Token wordToken: wordTokens) {
+            List<Tokenizer.Token> ngramTokensForWord = 
wordTokenToNgrams(wordToken);
+            ngramTokens.addAll(ngramTokensForWord);
+        }
+        return ngramTokens;
+    }
+
+    @Override
+    public List<Tokenizer.DocumentToTokens> tokenizePre(FrameBlock in) {
+        List<Tokenizer.DocumentToTokens> docToWordTokens = 
tokenizerPreWhitespaceSplit.tokenizePre(in);
+
+        List<Tokenizer.DocumentToTokens> docToNgramTokens = new ArrayList<>();
+        for (Tokenizer.DocumentToTokens docToTokens: docToWordTokens) {
+            List<Object> keys = docToTokens.keys;
+            List<Tokenizer.Token> wordTokens = docToTokens.tokens;
+            List<Tokenizer.Token> ngramTokens = 
wordTokenListToNgrams(wordTokens);
+            docToNgramTokens.add(new Tokenizer.DocumentToTokens(keys, 
ngramTokens));
+        }
+        return docToNgramTokens;
+    }
+}
diff --git 
a/src/main/java/org/apache/sysds/runtime/transform/tokenize/TokenizerPreWhitespaceSplit.java
 
b/src/main/java/org/apache/sysds/runtime/transform/tokenize/TokenizerPreWhitespaceSplit.java
new file mode 100644
index 0000000..2653fc0
--- /dev/null
+++ 
b/src/main/java/org/apache/sysds/runtime/transform/tokenize/TokenizerPreWhitespaceSplit.java
@@ -0,0 +1,92 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.sysds.runtime.transform.tokenize;
+
+import org.apache.sysds.runtime.matrix.data.FrameBlock;
+import org.apache.wink.json4j.JSONException;
+import org.apache.wink.json4j.JSONObject;
+
+import java.io.Serializable;
+import java.util.ArrayList;
+import java.util.Iterator;
+import java.util.List;
+
+public class TokenizerPreWhitespaceSplit implements TokenizerPre {
+
+    private static final long serialVersionUID = 539127244034913364L;
+
+    private final Params params;
+
+    private final List<Integer> idCols;
+    private final int tokenizeCol;
+
+    static class Params implements Serializable {
+
+        private static final long serialVersionUID = -4368552847660442628L;
+
+        public String regex = "\\s+"; // whitespace
+
+        public Params(JSONObject json) throws JSONException {
+            if (json != null && json.has("regex")) {
+                this.regex = json.getString("regex");
+            }
+        }
+    }
+
+    public TokenizerPreWhitespaceSplit(List<Integer> idCols, int tokenizeCol, 
JSONObject params) throws JSONException {
+        this.idCols = idCols;
+        this.tokenizeCol = tokenizeCol;
+        this.params = new Params(params);
+    }
+
+    public List<Tokenizer.Token> splitToTokens(String text) {
+        List<Tokenizer.Token> tokenList = new ArrayList<>();
+        String[] textTokens = text.split(params.regex);
+        int curIndex = 0;
+        for(String textToken: textTokens) {
+            int tokenIndex = text.indexOf(textToken, curIndex);
+            curIndex = tokenIndex;
+            tokenList.add(new Tokenizer.Token(textToken, tokenIndex));
+        }
+        return tokenList;
+    }
+
+    @Override
+    public List<Tokenizer.DocumentToTokens> tokenizePre(FrameBlock in) {
+        List<Tokenizer.DocumentToTokens> documentsToTokenList = new 
ArrayList<>();
+
+        Iterator<String[]> iterator = in.getStringRowIterator();
+        iterator.forEachRemaining(s -> {
+            // Convert index value to Java (0-based) from DML (1-based)
+            String text = s[tokenizeCol - 1];
+            List<Object> keys = new ArrayList<>();
+            for (Integer idCol: idCols) {
+                Object key = s[idCol - 1];
+                keys.add(key);
+            }
+
+            // Transform to Bag format internally
+            List<Tokenizer.Token> tokenList = splitToTokens(text);
+            documentsToTokenList.add(new Tokenizer.DocumentToTokens(keys, 
tokenList));
+        });
+
+        return documentsToTokenList;
+    }
+}
diff --git 
a/src/test/java/org/apache/sysds/test/functions/transform/TokenizeTest.java 
b/src/test/java/org/apache/sysds/test/functions/transform/TokenizeTest.java
new file mode 100644
index 0000000..89d339b
--- /dev/null
+++ b/src/test/java/org/apache/sysds/test/functions/transform/TokenizeTest.java
@@ -0,0 +1,220 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.sysds.test.functions.transform;
+
+import org.apache.sysds.api.DMLScript;
+import org.apache.sysds.runtime.io.*;
+import org.junit.Test;
+import org.apache.sysds.common.Types.ExecMode;
+import org.apache.sysds.runtime.matrix.data.FrameBlock;
+import org.apache.sysds.runtime.util.DataConverter;
+import org.apache.sysds.test.AutomatedTestBase;
+import org.apache.sysds.test.TestConfiguration;
+import org.apache.sysds.test.TestUtils;
+
+
+public class TokenizeTest extends AutomatedTestBase
+{
+    private static final String TEST_DIR = "functions/transform/";
+    private static final String TEST_CLASS_DIR = TEST_DIR + 
TokenizeTest.class.getSimpleName() + "/";
+
+    private static final String TEST_SPLIT_COUNT_LONG = 
"tokenize/TokenizeSplitCountLong";
+    private static final String TEST_NGRAM_POS_LONG = 
"tokenize/TokenizeNgramPosLong";
+    private static final String TEST_NGRAM_POS_WIDE = 
"tokenize/TokenizeNgramPosWide";
+    private static final String TEST_UNI_HASH_WIDE = 
"tokenize/TokenizeUniHashWide";
+
+    //dataset and transform tasks without missing values
+    private final static String DATASET        = 
"20news/20news_subset_untokenized.csv";
+
+    @Override
+    public void setUp()  {
+        TestUtils.clearAssertionInformation();
+        addTestConfiguration(TEST_SPLIT_COUNT_LONG,
+                new TestConfiguration(TEST_CLASS_DIR, TEST_SPLIT_COUNT_LONG, 
new String[] { "R" }) );
+        addTestConfiguration(TEST_NGRAM_POS_LONG,
+                new TestConfiguration(TEST_CLASS_DIR, TEST_NGRAM_POS_LONG, new 
String[] { "R" }) );
+        addTestConfiguration(TEST_NGRAM_POS_WIDE,
+                new TestConfiguration(TEST_CLASS_DIR, TEST_NGRAM_POS_WIDE, new 
String[] { "R" }) );
+        addTestConfiguration(TEST_UNI_HASH_WIDE,
+                new TestConfiguration(TEST_CLASS_DIR, TEST_UNI_HASH_WIDE, new 
String[] { "R" }) );
+    }
+
+    @Test
+    public void testTokenizeSingleNodeSplitCountLong() {
+        runTokenizeTest(ExecMode.SINGLE_NODE, TEST_SPLIT_COUNT_LONG,false);
+    }
+
+    @Test
+    public void testTokenizeSparkSplitCountLong() {
+
+        runTokenizeTest(ExecMode.SPARK, TEST_SPLIT_COUNT_LONG, false);
+    }
+
+    @Test
+    public void testTokenizeHybridSplitCountLong() {
+
+
+        runTokenizeTest(ExecMode.HYBRID, TEST_SPLIT_COUNT_LONG, false);
+    }
+
+    @Test
+    public void testTokenizeParReadSingleNodeSplitCountLong() {
+        runTokenizeTest(ExecMode.SINGLE_NODE, TEST_SPLIT_COUNT_LONG, true);
+    }
+
+    @Test
+    public void testTokenizeParReadSparkSplitCountLong() {
+        runTokenizeTest(ExecMode.SPARK, TEST_SPLIT_COUNT_LONG, true);
+    }
+
+    @Test
+    public void testTokenizeParReadHybridSplitCountLong() {
+        runTokenizeTest(ExecMode.HYBRID, TEST_SPLIT_COUNT_LONG, true);
+    }
+
+    @Test
+    public void testTokenizeSingleNodeNgramPosLong() {
+        runTokenizeTest(ExecMode.SINGLE_NODE, TEST_NGRAM_POS_LONG,false);
+    }
+
+    // Long format on spark execution fails with: Number of non-zeros mismatch 
on merge disjoint
+//    @Test
+//    public void testTokenizeSparkNgramPosLong() {
+//        runTokenizeTest(ExecMode.SPARK, TEST_NGRAM_POS_LONG, false);
+//    }
+
+    @Test
+    public void testTokenizeHybridNgramPosLong() {
+        runTokenizeTest(ExecMode.HYBRID, TEST_NGRAM_POS_LONG, false);
+    }
+
+    @Test
+    public void testTokenizeParReadSingleNodeNgramPosLong() {
+        runTokenizeTest(ExecMode.SINGLE_NODE, TEST_NGRAM_POS_LONG, true);
+    }
+
+    // Long format on spark execution fails with: Number of non-zeros mismatch 
on merge disjoint
+//    @Test
+//    public void testTokenizeParReadSparkNgramPosLong() {
+//        runTokenizeTest(ExecMode.SPARK, TEST_NGRAM_POS_LONG, true);
+//    }
+
+    @Test
+    public void testTokenizeParReadHybridNgramPosLong() {
+        runTokenizeTest(ExecMode.HYBRID, TEST_NGRAM_POS_LONG, true);
+    }
+
+    @Test
+    public void testTokenizeSingleNodeNgramPosWide() {
+        runTokenizeTest(ExecMode.SINGLE_NODE, TEST_NGRAM_POS_WIDE,false);
+    }
+
+    @Test
+    public void testTokenizeSparkNgramPosWide() {
+        runTokenizeTest(ExecMode.SPARK, TEST_NGRAM_POS_WIDE, false);
+    }
+
+    @Test
+    public void testTokenizeHybridNgramPosWide() {
+        runTokenizeTest(ExecMode.HYBRID, TEST_NGRAM_POS_WIDE, false);
+    }
+
+    @Test
+    public void testTokenizeParReadSingleNodeNgramPosWide() {
+        runTokenizeTest(ExecMode.SINGLE_NODE, TEST_NGRAM_POS_WIDE, true);
+    }
+
+    @Test
+    public void testTokenizeParReadSparkNgramPosWide() {
+        runTokenizeTest(ExecMode.SPARK, TEST_NGRAM_POS_WIDE, true);
+    }
+
+    @Test
+    public void testTokenizeParReadHybridNgramPosWide() {
+        runTokenizeTest(ExecMode.HYBRID, TEST_NGRAM_POS_WIDE, true);
+    }
+
+    @Test
+    public void testTokenizeSingleNodeUniHasWide() {
+        runTokenizeTest(ExecMode.SINGLE_NODE, TEST_UNI_HASH_WIDE,false);
+    }
+
+    @Test
+    public void testTokenizeSparkUniHasWide() {
+        runTokenizeTest(ExecMode.SPARK, TEST_UNI_HASH_WIDE, false);
+    }
+
+    @Test
+    public void testTokenizeHybridUniHasWide() {
+        runTokenizeTest(ExecMode.HYBRID, TEST_UNI_HASH_WIDE, false);
+    }
+
+    @Test
+    public void testTokenizeParReadSingleNodeUniHasWide() {
+        runTokenizeTest(ExecMode.SINGLE_NODE, TEST_UNI_HASH_WIDE, true);
+    }
+
+    @Test
+    public void testTokenizeParReadSparkUniHasWide() {
+        runTokenizeTest(ExecMode.SPARK, TEST_UNI_HASH_WIDE, true);
+    }
+
+    @Test
+    public void testTokenizeParReadHybridUniHasWide() {
+        runTokenizeTest(ExecMode.HYBRID, TEST_UNI_HASH_WIDE, true);
+    }
+
+    private void runTokenizeTest(ExecMode rt, String test_name, boolean 
parRead )
+    {
+        //set runtime platform
+        ExecMode rtold = rtplatform;
+        rtplatform = rt;
+
+        boolean sparkConfigOld = DMLScript.USE_LOCAL_SPARK_CONFIG;
+        if( rtplatform == ExecMode.SPARK || rtplatform == ExecMode.HYBRID)
+            DMLScript.USE_LOCAL_SPARK_CONFIG = true;
+
+        try
+        {
+            getAndLoadTestConfiguration(test_name);
+
+            String HOME = SCRIPT_DIR + TEST_DIR;
+            fullDMLScriptName = HOME + test_name + ".dml";
+            programArgs = new String[]{"-stats","-args",
+                    HOME + "input/" + DATASET, HOME + test_name + ".json", 
output("R") };
+
+            runTest(true, false, null, -1);
+
+            //read input/output and compare
+            FrameReader reader2 = parRead ?
+                    new FrameReaderTextCSVParallel( new 
FileFormatPropertiesCSV() ) :
+                    new FrameReaderTextCSV( new FileFormatPropertiesCSV()  );
+            FrameBlock fb2 = reader2.readFrameFromHDFS(output("R"), -1L, -1L);
+            System.out.println(DataConverter.toString(fb2));
+        }
+        catch(Exception ex) {
+            throw new RuntimeException(ex);
+        }
+        finally {
+            rtplatform = rtold;
+            DMLScript.USE_LOCAL_SPARK_CONFIG = sparkConfigOld;
+        }
+    }
+}
diff --git 
a/src/test/scripts/functions/transform/input/20news/20news_subset_untokenized.csv
 
b/src/test/scripts/functions/transform/input/20news/20news_subset_untokenized.csv
new file mode 100644
index 0000000..f622607
--- /dev/null
+++ 
b/src/test/scripts/functions/transform/input/20news/20news_subset_untokenized.csv
@@ -0,0 +1,3 @@
+20news-bydate-test,alt.atheism,53068,From: [email protected] 
(dean.kaflowitz) Subject: Re: about the bible quiz answers Organization: AT&T 
Distribution: na Lines: 18  In article <[email protected]>  
[email protected] (Tammy R Healy) writes: >  >  > #12) The 2 cheribums are 
on the Ark of the Covenant.  When God said make no  > graven image  he was 
refering to idols  which were created to be worshipped.  > The Ark of the 
Covenant wasn't wrodhipped and only the  [...]
+20news-bydate-test,alt.atheism,53257,From: [email protected] (Chris Faehl) 
Subject: Re: Amusing atheists and agnostics Organization: University of New 
Mexico  Albuquerque Lines: 88 Distribution: world NNTP-Posting-Host: 
vesta.unm.edu  In article <timmbake.735265296@mcl>  [email protected] ( 
Clam  Bake Timmons) writes:  >  > >Fallacy #1: Atheism is a faith. Lo! I hear 
the FAQ beckoning once again... > >[wonderful Rule #3 deleted - you're correct  
you didn't say anything >about > >a [...]
+20news-bydate-test,alt.atheism,53260,From: mathew <[email protected]> 
Subject: Re: Yet more Rushdie [Re: ISLAMIC LAW] Organization: Mantis 
Consultants  Cambridge. UK. X-Newsreader: rusnews v1.02 Lines: 50  
[email protected] (Gregg Jaeger) writes: >In article <[email protected]> 
[email protected] (Robert >Beauchaine) writes: >>Bennett  Neil.   How BCCI 
adapted the Koran rules of banking .  The  >>Times.  August 13  1991. >  > So  
let's see. If some guy writes a piece with a  [...]
\ No newline at end of file
diff --git 
a/src/test/scripts/functions/transform/tokenize/TokenizeNgramPosLong.dml 
b/src/test/scripts/functions/transform/tokenize/TokenizeNgramPosLong.dml
new file mode 100644
index 0000000..400c4f0
--- /dev/null
+++ b/src/test/scripts/functions/transform/tokenize/TokenizeNgramPosLong.dml
@@ -0,0 +1,47 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+F1 = read($1, data_type="frame", format="csv", sep=",");
+
+# Example Preprocessing:
+F1[,4] = map(F1[,4], "x -> x.toLowerCase()");
+
+# Example to estimate max tokens based on string length
+# max_token = as.integer(max(as.matrix(map(F1[,4], "x -> x.length()"))));
+# print(max_length);
+
+max_token = 2000;
+
+# Example spec:
+# jspec = "{\"algo\": \"whitespace\",\"out\": \"count\",\"id_col\": 
[2],\"tokenize_col\": 3}";
+
+jspec = read($2, data_type="scalar", value_type="string");
+
+# If slicing is performed, then only the remaining columns are considered for 
tokenizer spec:
+F2 = tokenize(target=F1[,2:4], spec=jspec, max_tokens=max_token);
+write(F2, $3, format="csv");
+
+# Followup spec to transform tokens
+jspec2 = "{\"ids\": true, \"recode\": [1,2,3,4]}";
+F2 = F2[,1:4];
+
+# Afterward, you can transform it into a matrix:
+[X, M] = transformencode(target=F2, spec=jspec2);
diff --git 
a/src/test/scripts/functions/transform/tokenize/TokenizeNgramPosLong.json 
b/src/test/scripts/functions/transform/tokenize/TokenizeNgramPosLong.json
new file mode 100644
index 0000000..b2f522f
--- /dev/null
+++ b/src/test/scripts/functions/transform/tokenize/TokenizeNgramPosLong.json
@@ -0,0 +1,12 @@
+{
+  "algo": "ngram",
+  "algo_params": {
+    "min_gram": 2,
+    "max_gram": 3,
+    "regex": "\\\\W+"
+  },
+  "out": "position",
+  "format_wide": false,
+  "id_cols": [1,2],
+  "tokenize_col": 3
+}
\ No newline at end of file
diff --git 
a/src/test/scripts/functions/transform/tokenize/TokenizeNgramPosWide.dml 
b/src/test/scripts/functions/transform/tokenize/TokenizeNgramPosWide.dml
new file mode 100644
index 0000000..400c4f0
--- /dev/null
+++ b/src/test/scripts/functions/transform/tokenize/TokenizeNgramPosWide.dml
@@ -0,0 +1,47 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+F1 = read($1, data_type="frame", format="csv", sep=",");
+
+# Example Preprocessing:
+F1[,4] = map(F1[,4], "x -> x.toLowerCase()");
+
+# Example to estimate max tokens based on string length
+# max_token = as.integer(max(as.matrix(map(F1[,4], "x -> x.length()"))));
+# print(max_length);
+
+max_token = 2000;
+
+# Example spec:
+# jspec = "{\"algo\": \"whitespace\",\"out\": \"count\",\"id_col\": 
[2],\"tokenize_col\": 3}";
+
+jspec = read($2, data_type="scalar", value_type="string");
+
+# If slicing is performed, then only the remaining columns are considered for 
tokenizer spec:
+F2 = tokenize(target=F1[,2:4], spec=jspec, max_tokens=max_token);
+write(F2, $3, format="csv");
+
+# Followup spec to transform tokens
+jspec2 = "{\"ids\": true, \"recode\": [1,2,3,4]}";
+F2 = F2[,1:4];
+
+# Afterward, you can transform it into a matrix:
+[X, M] = transformencode(target=F2, spec=jspec2);
diff --git 
a/src/test/scripts/functions/transform/tokenize/TokenizeNgramPosWide.json 
b/src/test/scripts/functions/transform/tokenize/TokenizeNgramPosWide.json
new file mode 100644
index 0000000..8087389
--- /dev/null
+++ b/src/test/scripts/functions/transform/tokenize/TokenizeNgramPosWide.json
@@ -0,0 +1,12 @@
+{
+  "algo": "ngram",
+  "algo_params": {
+    "min_gram": 2,
+    "max_gram": 3,
+    "regex": "\\\\W+"
+  },
+  "out": "position",
+  "format_wide": true,
+  "id_cols": [1,2],
+  "tokenize_col": 3
+}
\ No newline at end of file
diff --git 
a/src/test/scripts/functions/transform/tokenize/TokenizeSplitCountLong.dml 
b/src/test/scripts/functions/transform/tokenize/TokenizeSplitCountLong.dml
new file mode 100644
index 0000000..6e7226c
--- /dev/null
+++ b/src/test/scripts/functions/transform/tokenize/TokenizeSplitCountLong.dml
@@ -0,0 +1,49 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+F1 = read($1, data_type="frame", format="csv", sep=",");
+
+# Example Preprocessing:
+F1[,4] = map(F1[,4], "x -> x.toLowerCase()");
+
+# Example to estimate max tokens based on string length
+# max_token = as.integer(max(as.matrix(map(F1[,4], "x -> x.length()"))));
+# print(max_length);
+
+max_token = 2000;
+
+# Example spec:
+# jspec = "{\"algo\": \"whitespace\",\"out\": \"count\",\"id_col\": 
[2],\"tokenize_col\": 3}";
+
+jspec = read($2, data_type="scalar", value_type="string");
+
+# If slicing is performed, then only the remaining columns are considered for 
tokenizer spec:
+F2 = tokenize(target=F1[,2:4], spec=jspec, max_tokens=max_token);
+write(F2, $3, format="csv");
+
+# Followup spec to transform tokens
+jspec2 = "{\"ids\": true, \"recode\": [1,2]}";
+
+# Afterward, you can transform it into a matrix:
+[X, M] = transformencode(target=F2, spec=jspec2);
+# If format is long, then table could be applied afterwards
+X = table(X[1,], X[2,], X[3,]);
+
diff --git 
a/src/test/scripts/functions/transform/tokenize/TokenizeSplitCountLong.json 
b/src/test/scripts/functions/transform/tokenize/TokenizeSplitCountLong.json
new file mode 100644
index 0000000..201ff17
--- /dev/null
+++ b/src/test/scripts/functions/transform/tokenize/TokenizeSplitCountLong.json
@@ -0,0 +1,9 @@
+{
+  "algo": "split",
+  "out": "count",
+  "out_params": {
+    "sort_alpha": true
+  },
+  "id_cols": [2],
+  "tokenize_col": 3
+}
\ No newline at end of file
diff --git 
a/src/test/scripts/functions/transform/tokenize/TokenizeUniHashWide.dml 
b/src/test/scripts/functions/transform/tokenize/TokenizeUniHashWide.dml
new file mode 100644
index 0000000..0061e22
--- /dev/null
+++ b/src/test/scripts/functions/transform/tokenize/TokenizeUniHashWide.dml
@@ -0,0 +1,46 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+F1 = read($1, data_type="frame", format="csv", sep=",");
+
+# Example Preprocessing:
+F1[,4] = map(F1[,4], "x -> x.toLowerCase()");
+
+# Example to estimate max tokens based on string length
+# max_token = as.integer(max(as.matrix(map(F1[,4], "x -> x.length()"))));
+# print(max_length);
+
+max_token = 2000;
+
+# Example spec:
+# jspec = "{\"algo\": \"whitespace\",\"out\": \"count\",\"id_col\": 
[2],\"tokenize_col\": 3}";
+
+jspec = read($2, data_type="scalar", value_type="string");
+
+# If slicing is performed, then only the remaining columns are considered for 
tokenizer spec:
+F2 = tokenize(target=F1[,2:4], spec=jspec, max_tokens=max_token);
+write(F2, $3, format="csv");
+
+# Followup spec to transform tokens of ids, leave hash alone
+jspec2 = "{\"ids\": true, \"recode\": [1,2]}";
+
+# Afterward, you can transform it into a matrix:
+[X, M] = transformencode(target=F2, spec=jspec2);
diff --git 
a/src/test/scripts/functions/transform/tokenize/TokenizeUniHashWide.json 
b/src/test/scripts/functions/transform/tokenize/TokenizeUniHashWide.json
new file mode 100644
index 0000000..f87e873
--- /dev/null
+++ b/src/test/scripts/functions/transform/tokenize/TokenizeUniHashWide.json
@@ -0,0 +1,14 @@
+{
+  "algo": "ngram",
+  "algo_params": {
+    "min_gram": 1,
+    "max_gram": 3
+  },
+  "out": "hash",
+  "out_params": {
+    "num_features": 128
+  },
+  "format_wide": true,
+  "id_cols": [2,1],
+  "tokenize_col": 3
+}
\ No newline at end of file

[systemds] branch master updated: [SYSTEMDS-2881] DML Tokenizer API DIA project WS2020/21 Closes #1169.

Reply via email to