This is an automated email from the ASF dual-hosted git repository.
ssiddiqi pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/systemds.git
The following commit(s) were added to refs/heads/master by this push:
new 8ef0d5a [SYSTEMDS-2881] DML Tokenizer API DIA project WS2020/21
Closes #1169.
8ef0d5a is described below
commit 8ef0d5afef50aa11384f7af3851f930348b7e539
Author: Markus Reiter-Haas <[email protected]>
AuthorDate: Thu Mar 4 10:42:59 2021 +0100
[SYSTEMDS-2881] DML Tokenizer API
DIA project WS2020/21
Closes #1169.
Co-authored-by: Samuel Kogler <[email protected]>
Co-authored-by: David Froehlich <[email protected]>
---
docs/site/dml-language-reference.md | 56 +++++-
.../java/org/apache/sysds/common/Builtins.java | 1 +
src/main/java/org/apache/sysds/common/Types.java | 2 +-
.../apache/sysds/hops/ParameterizedBuiltinOp.java | 1 +
.../apache/sysds/lops/ParameterizedBuiltin.java | 3 +-
.../org/apache/sysds/parser/DMLTranslator.java | 1 +
.../ParameterizedBuiltinFunctionExpression.java | 19 ++
.../functionobjects/ParameterizedBuiltin.java | 8 +-
.../runtime/instructions/CPInstructionParser.java | 1 +
.../runtime/instructions/SPInstructionParser.java | 1 +
.../cp/ParameterizedBuiltinCPInstruction.java | 16 ++
.../spark/ParameterizedBuiltinSPInstruction.java | 55 +++++-
.../runtime/transform/tokenize/Tokenizer.java | 84 ++++++++
.../transform/tokenize/TokenizerFactory.java | 109 ++++++++++
.../runtime/transform/tokenize/TokenizerPost.java | 33 ++++
.../transform/tokenize/TokenizerPostCount.java | 121 ++++++++++++
.../transform/tokenize/TokenizerPostHash.java | 159 +++++++++++++++
.../transform/tokenize/TokenizerPostPosition.java | 137 +++++++++++++
.../runtime/transform/tokenize/TokenizerPre.java | 29 +++
.../transform/tokenize/TokenizerPreNgram.java | 100 ++++++++++
.../tokenize/TokenizerPreWhitespaceSplit.java | 92 +++++++++
.../test/functions/transform/TokenizeTest.java | 220 +++++++++++++++++++++
.../input/20news/20news_subset_untokenized.csv | 3 +
.../transform/tokenize/TokenizeNgramPosLong.dml | 47 +++++
.../transform/tokenize/TokenizeNgramPosLong.json | 12 ++
.../transform/tokenize/TokenizeNgramPosWide.dml | 47 +++++
.../transform/tokenize/TokenizeNgramPosWide.json | 12 ++
.../transform/tokenize/TokenizeSplitCountLong.dml | 49 +++++
.../transform/tokenize/TokenizeSplitCountLong.json | 9 +
.../transform/tokenize/TokenizeUniHashWide.dml | 46 +++++
.../transform/tokenize/TokenizeUniHashWide.json | 14 ++
31 files changed, 1477 insertions(+), 10 deletions(-)
diff --git a/docs/site/dml-language-reference.md
b/docs/site/dml-language-reference.md
index 27bbbc6..27ffd5d 100644
--- a/docs/site/dml-language-reference.md
+++ b/docs/site/dml-language-reference.md
@@ -2026,15 +2026,22 @@ The following example uses
<code>transformapply()</code> with the input matrix a
### Processing Frames
-The built-in function <code>map()</code> provides support for the lambda
expressions.
-
-**Table F5**: Frame map built-in function
+**Table F5**: Frame processing built-in functions
Function | Description | Parameters | Example
-------- | ----------- | ---------- | -------
-map() | It will execute the given lambda expression on a frame.| Input: (X
<frame>, y <String>) <br/>Output: <frame>. <br/> X is a frame
and <br/>y is a String containing the lambda expression to be executed on frame
X. | X = read("file1", data_type="frame", rows=2, cols=3, format="binary")
<br/> y = "lambda expression" <br/> Z = map(X, y) <br/> # Dimensions of Z =
Dimensions of X; <br/> example: <br/> <code> Z = map(X, "x -> x.charAt(2)")
</code>
+map() | It will execute the given lambda expression on a frame.| Input: (X
<frame>, y <String>) <br/>Output: <frame>. <br/> X is a frame
and <br/>y is a String containing the lambda expression to be executed on frame
X. | [map](#map)
+tokenize() | Transforms a frame to tokenized frame using specification.
Tokenization is valid only for string columns. | Input:<br/> target =
<frame> <br/> spec = <json specification> <br/> Outputs:
<matrix>, <frame> | [tokenize](#tokenize)
+
+#### map
+
+The built-in function <code>map()</code> provides support for the lambda
expressions.
+
+Simple example
-Example let X =
+ X = read("file1", data_type="frame", rows=2, cols=3, format="binary")
<br/> y = "lambda expression" <br/> Z = map(X, y) <br/> # Dimensions of Z =
Dimensions of X; <br/> example: <br/> <code> Z = map(X, "x -> x.charAt(2)")
</code>
+
+Example with data let X =
# FRAME: nrow = 10, ncol = 1
# C1
@@ -2085,6 +2092,45 @@ print(toString(dist)) </code>
0,600 0,286 0,125 1,000 0,286 0,125 0,125 0,600 0,600 0,000
#
+#### tokenize
+
+Simple example
+
+ X = read(“file1”, data_type=”frame”, rows=3, cols=2, format=”binary”);
+ spec = "{\"algo\": \"whitespace\",\"out\": \"count\",\"id_cols\":
[1],\"tokenize_col\": 2}";
+ Y = tokenize(target=X, spec=jspec, max_tokens=1000);
+ write(Y, "file2");
+
+Example spec
+
+ {
+ "algo": "split",
+ "out": "count",
+ "id_cols": [1],
+ "tokenize_col": 2
+ }
+
+The frame is tokenized along the `tokenize_col` and replicates the `id_cols`.
+
+The output frame can be converted into a matrix with the transform functions.
For instance, using `transformencode` with `recode`, followed by `table`.
+Alternatively, for certain algorithms by specifying `"format_wide": true`
expands the tokens in the columns instead of creating new rows.
+
+**Table F6**: Tokenizer Algorithms for `algo` field
+
+Algorithm | Algo Description | Parameters | Spec Example
+-------- | ----------- | ---------- | -------
+whitespace | Splits the tokens along whitespace characters. | None |
`"{\"algo\": \"whitespace\",\"out\": \"count\",\"out_params\": {\"sort_alpha\":
true},\"id_cols\": \[2\],\"tokenize_col\": 3}"`
+ngram | Pretokenizes using `whitespace` then splits the tokens into ngrams |
`min_gram` and `max_gram` specify the length of the ngrams. | `"{\"algo\":
\"ngram\",\"algo_params\": {\"min_gram\": 2,\"max_gram\": 3},\"out\":
\"position\",\"id_cols\": \[1,2\],\"tokenize_col\": 3}"`
+
+**Table F7**: Output Representations of Tokens for `out` field
+
+Out Representation | Format Description | Parameters | Format Example
+-------- | ----------- | ---------- | -------
+count | Outputs the `id_cols`, the `tokens`, and the number of token
`occurences` per document. | `sort_alpha` specifies whether the tokens are
sorted alphanumerically per document. | `id1,id2,token1,3`
+position | Outputs the `id_cols`, the `position` within the document, and the
`token`. | None | `id1,id2,1,token1`
+hash | Outputs the `id_cols`, the `index` of non-zero hashes, and the `hashes`
| `num_features` specifies the number of output features | `id1,id2,2,64`
+
+
* * *
## Modules
diff --git a/src/main/java/org/apache/sysds/common/Builtins.java
b/src/main/java/org/apache/sysds/common/Builtins.java
index a0c0222..fd9f218 100644
--- a/src/main/java/org/apache/sysds/common/Builtins.java
+++ b/src/main/java/org/apache/sysds/common/Builtins.java
@@ -259,6 +259,7 @@ public enum Builtins {
SCALEAPPLY("scaleApply", true, false),
TIME("time", false),
CVLM("cvlm", true, false),
+ TOKENIZE("tokenize", false, true),
TOSTRING("toString", false, true),
TRANSFORMAPPLY("transformapply", false, true),
TRANSFORMCOLMAP("transformcolmap", false, true),
diff --git a/src/main/java/org/apache/sysds/common/Types.java
b/src/main/java/org/apache/sysds/common/Types.java
index 335d60f..284ba9c 100644
--- a/src/main/java/org/apache/sysds/common/Types.java
+++ b/src/main/java/org/apache/sysds/common/Types.java
@@ -460,7 +460,7 @@ public class Types
INVALID, CDF, INVCDF, GROUPEDAGG, RMEMPTY, REPLACE, REXPAND,
LOWER_TRI, UPPER_TRI,
TRANSFORMAPPLY, TRANSFORMDECODE, TRANSFORMCOLMAP, TRANSFORMMETA,
- TOSTRING, LIST, PARAMSERV
+ TOKENIZE, TOSTRING, LIST, PARAMSERV
}
public enum OpOpDnn {
diff --git a/src/main/java/org/apache/sysds/hops/ParameterizedBuiltinOp.java
b/src/main/java/org/apache/sysds/hops/ParameterizedBuiltinOp.java
index 298fd6a..68128e0 100644
--- a/src/main/java/org/apache/sysds/hops/ParameterizedBuiltinOp.java
+++ b/src/main/java/org/apache/sysds/hops/ParameterizedBuiltinOp.java
@@ -182,6 +182,7 @@ public class ParameterizedBuiltinOp extends
MultiThreadedHop {
case REPLACE:
case LOWER_TRI:
case UPPER_TRI:
+ case TOKENIZE:
case TRANSFORMAPPLY:
case TRANSFORMDECODE:
case TRANSFORMCOLMAP:
diff --git a/src/main/java/org/apache/sysds/lops/ParameterizedBuiltin.java
b/src/main/java/org/apache/sysds/lops/ParameterizedBuiltin.java
index 698e739..154bc54 100644
--- a/src/main/java/org/apache/sysds/lops/ParameterizedBuiltin.java
+++ b/src/main/java/org/apache/sysds/lops/ParameterizedBuiltin.java
@@ -173,7 +173,8 @@ public class ParameterizedBuiltin extends Lop
}
break;
-
+
+ case TOKENIZE:
case TRANSFORMAPPLY:
case TRANSFORMDECODE:
case TRANSFORMCOLMAP:
diff --git a/src/main/java/org/apache/sysds/parser/DMLTranslator.java
b/src/main/java/org/apache/sysds/parser/DMLTranslator.java
index 7e5d063..b050a3b 100644
--- a/src/main/java/org/apache/sysds/parser/DMLTranslator.java
+++ b/src/main/java/org/apache/sysds/parser/DMLTranslator.java
@@ -2003,6 +2003,7 @@ public class DMLTranslator
case REPLACE:
case LOWER_TRI:
case UPPER_TRI:
+ case TOKENIZE:
case TRANSFORMAPPLY:
case TRANSFORMDECODE:
case TRANSFORMCOLMAP:
diff --git
a/src/main/java/org/apache/sysds/parser/ParameterizedBuiltinFunctionExpression.java
b/src/main/java/org/apache/sysds/parser/ParameterizedBuiltinFunctionExpression.java
index 4d111b0..ec731e6 100644
---
a/src/main/java/org/apache/sysds/parser/ParameterizedBuiltinFunctionExpression.java
+++
b/src/main/java/org/apache/sysds/parser/ParameterizedBuiltinFunctionExpression.java
@@ -202,6 +202,10 @@ public class ParameterizedBuiltinFunctionExpression
extends DataIdentifier
case ORDER:
validateOrder(output, conditional);
break;
+
+ case TOKENIZE:
+ validateTokenize(output, conditional);
+ break;
case TRANSFORMAPPLY:
validateTransformApply(output, conditional);
@@ -337,6 +341,21 @@ public class ParameterizedBuiltinFunctionExpression
extends DataIdentifier
}
}
+ private void validateTokenize(DataIdentifier output, boolean
conditional)
+ {
+ //validate data / metadata (recode maps)
+ checkDataType("tokenize", TF_FN_PARAM_DATA, DataType.FRAME,
conditional);
+
+ //validate specification
+ checkDataValueType(false, "tokenize", TF_FN_PARAM_SPEC,
DataType.SCALAR, ValueType.STRING, conditional);
+ validateTransformSpec(TF_FN_PARAM_SPEC, conditional);
+
+ //set output dimensions
+ output.setDataType(DataType.FRAME);
+ output.setValueType(ValueType.STRING);
+ output.setDimensions(-1, -1);
+ }
+
// example: A = transformapply(target=X, meta=M, spec=s)
private void validateTransformApply(DataIdentifier output, boolean
conditional)
{
diff --git
a/src/main/java/org/apache/sysds/runtime/functionobjects/ParameterizedBuiltin.java
b/src/main/java/org/apache/sysds/runtime/functionobjects/ParameterizedBuiltin.java
index 6bea475..c15f6da 100644
---
a/src/main/java/org/apache/sysds/runtime/functionobjects/ParameterizedBuiltin.java
+++
b/src/main/java/org/apache/sysds/runtime/functionobjects/ParameterizedBuiltin.java
@@ -40,11 +40,11 @@ import org.apache.sysds.runtime.util.UtilFunctions;
public class ParameterizedBuiltin extends ValueFunction
{
- private static final long serialVersionUID = -5966242955816522697L;
+ private static final long serialVersionUID = -7987603644903675052L;
public enum ParameterizedBuiltinCode {
CDF, INVCDF, RMEMPTY, REPLACE, REXPAND, LOWER_TRI, UPPER_TRI,
- TRANSFORMAPPLY, TRANSFORMDECODE, PARAMSERV }
+ TOKENIZE, TRANSFORMAPPLY, TRANSFORMDECODE, PARAMSERV }
public enum ProbabilityDistributionCode {
INVALID, NORMAL, EXP, CHISQ, F, T }
@@ -61,6 +61,7 @@ public class ParameterizedBuiltin extends ValueFunction
String2ParameterizedBuiltinCode.put( "lowertri",
ParameterizedBuiltinCode.LOWER_TRI);
String2ParameterizedBuiltinCode.put( "uppertri",
ParameterizedBuiltinCode.UPPER_TRI);
String2ParameterizedBuiltinCode.put( "rexpand",
ParameterizedBuiltinCode.REXPAND);
+ String2ParameterizedBuiltinCode.put( "tokenize",
ParameterizedBuiltinCode.TOKENIZE);
String2ParameterizedBuiltinCode.put( "transformapply",
ParameterizedBuiltinCode.TRANSFORMAPPLY);
String2ParameterizedBuiltinCode.put( "transformdecode",
ParameterizedBuiltinCode.TRANSFORMDECODE);
String2ParameterizedBuiltinCode.put( "paramserv",
ParameterizedBuiltinCode.PARAMSERV);
@@ -172,6 +173,9 @@ public class ParameterizedBuiltin extends ValueFunction
case REXPAND:
return new
ParameterizedBuiltin(ParameterizedBuiltinCode.REXPAND);
+
+ case TOKENIZE:
+ return new
ParameterizedBuiltin(ParameterizedBuiltinCode.TOKENIZE);
case TRANSFORMAPPLY:
return new
ParameterizedBuiltin(ParameterizedBuiltinCode.TRANSFORMAPPLY);
diff --git
a/src/main/java/org/apache/sysds/runtime/instructions/CPInstructionParser.java
b/src/main/java/org/apache/sysds/runtime/instructions/CPInstructionParser.java
index 273e59a..2232fa3 100644
---
a/src/main/java/org/apache/sysds/runtime/instructions/CPInstructionParser.java
+++
b/src/main/java/org/apache/sysds/runtime/instructions/CPInstructionParser.java
@@ -214,6 +214,7 @@ public class CPInstructionParser extends InstructionParser
String2CPInstructionType.put( "uppertri",
CPType.ParameterizedBuiltin);
String2CPInstructionType.put( "rexpand",
CPType.ParameterizedBuiltin);
String2CPInstructionType.put( "toString",
CPType.ParameterizedBuiltin);
+ String2CPInstructionType.put( "tokenize",
CPType.ParameterizedBuiltin);
String2CPInstructionType.put( "transformapply",
CPType.ParameterizedBuiltin);
String2CPInstructionType.put(
"transformdecode",CPType.ParameterizedBuiltin);
String2CPInstructionType.put(
"transformcolmap",CPType.ParameterizedBuiltin);
diff --git
a/src/main/java/org/apache/sysds/runtime/instructions/SPInstructionParser.java
b/src/main/java/org/apache/sysds/runtime/instructions/SPInstructionParser.java
index 4b77bff..1adc79f 100644
---
a/src/main/java/org/apache/sysds/runtime/instructions/SPInstructionParser.java
+++
b/src/main/java/org/apache/sysds/runtime/instructions/SPInstructionParser.java
@@ -266,6 +266,7 @@ public class SPInstructionParser extends InstructionParser
String2SPInstructionType.put( "rexpand",
SPType.ParameterizedBuiltin);
String2SPInstructionType.put( "lowertri",
SPType.ParameterizedBuiltin);
String2SPInstructionType.put( "uppertri",
SPType.ParameterizedBuiltin);
+ String2SPInstructionType.put( "tokenize",
SPType.ParameterizedBuiltin);
String2SPInstructionType.put( "transformapply",
SPType.ParameterizedBuiltin);
String2SPInstructionType.put(
"transformdecode",SPType.ParameterizedBuiltin);
String2SPInstructionType.put(
"transformencode",SPType.MultiReturnBuiltin);
diff --git
a/src/main/java/org/apache/sysds/runtime/instructions/cp/ParameterizedBuiltinCPInstruction.java
b/src/main/java/org/apache/sysds/runtime/instructions/cp/ParameterizedBuiltinCPInstruction.java
index 58bc7b1..34b7211 100644
---
a/src/main/java/org/apache/sysds/runtime/instructions/cp/ParameterizedBuiltinCPInstruction.java
+++
b/src/main/java/org/apache/sysds/runtime/instructions/cp/ParameterizedBuiltinCPInstruction.java
@@ -58,6 +58,8 @@ import
org.apache.sysds.runtime.transform.decode.DecoderFactory;
import org.apache.sysds.runtime.transform.encode.Encoder;
import org.apache.sysds.runtime.transform.encode.EncoderFactory;
import org.apache.sysds.runtime.transform.meta.TfMetaUtils;
+import org.apache.sysds.runtime.transform.tokenize.Tokenizer;
+import org.apache.sysds.runtime.transform.tokenize.TokenizerFactory;
import org.apache.sysds.runtime.util.DataConverter;
public class ParameterizedBuiltinCPInstruction extends
ComputationCPInstruction {
@@ -148,6 +150,7 @@ public class ParameterizedBuiltinCPInstruction extends
ComputationCPInstruction
|| opcode.equals("transformdecode")
|| opcode.equals("transformcolmap")
|| opcode.equals("transformmeta")
+ || opcode.equals("tokenize")
|| opcode.equals("toString")
|| opcode.equals("nvlist")) {
return new ParameterizedBuiltinCPInstruction(null,
paramsMap, out, opcode, str);
@@ -256,6 +259,19 @@ public class ParameterizedBuiltinCPInstruction extends
ComputationCPInstruction
ec.setMatrixOutput(output.getName(), ret);
ec.releaseMatrixInput(params.get("target"));
}
+ else if ( opcode.equalsIgnoreCase("tokenize") ) {
+ //acquire locks
+ FrameBlock data =
ec.getFrameInput(params.get("target"));
+
+ // compute tokenizer
+ Tokenizer tokenizer = TokenizerFactory.createTokenizer(
+ getParameterMap().get("spec"),
Integer.parseInt(getParameterMap().get("max_tokens")));
+ FrameBlock fbout = tokenizer.tokenize(data, new
FrameBlock(tokenizer.getSchema()));
+
+ //release locks
+ ec.setFrameOutput(output.getName(), fbout);
+ ec.releaseFrameInput(params.get("target"));
+ }
else if ( opcode.equalsIgnoreCase("transformapply")) {
//acquire locks
FrameBlock data =
ec.getFrameInput(params.get("target"));
diff --git
a/src/main/java/org/apache/sysds/runtime/instructions/spark/ParameterizedBuiltinSPInstruction.java
b/src/main/java/org/apache/sysds/runtime/instructions/spark/ParameterizedBuiltinSPInstruction.java
index c6f065c..89abc9d 100644
---
a/src/main/java/org/apache/sysds/runtime/instructions/spark/ParameterizedBuiltinSPInstruction.java
+++
b/src/main/java/org/apache/sysds/runtime/instructions/spark/ParameterizedBuiltinSPInstruction.java
@@ -73,6 +73,8 @@ import org.apache.sysds.runtime.transform.encode.Encoder;
import org.apache.sysds.runtime.transform.encode.EncoderFactory;
import org.apache.sysds.runtime.transform.meta.TfMetaUtils;
import org.apache.sysds.runtime.transform.meta.TfOffsetMap;
+import org.apache.sysds.runtime.transform.tokenize.Tokenizer;
+import org.apache.sysds.runtime.transform.tokenize.TokenizerFactory;
import org.apache.sysds.runtime.util.DataConverter;
import org.apache.sysds.runtime.util.UtilFunctions;
import scala.Tuple2;
@@ -156,6 +158,7 @@ public class ParameterizedBuiltinSPInstruction extends
ComputationSPInstruction
|| opcode.equalsIgnoreCase("replace")
|| opcode.equalsIgnoreCase("lowertri")
|| opcode.equalsIgnoreCase("uppertri")
+ || opcode.equalsIgnoreCase("tokenize")
|| opcode.equalsIgnoreCase("transformapply")
|| opcode.equalsIgnoreCase("transformdecode")) {
func =
ParameterizedBuiltin.getParameterizedBuiltinFnObject(opcode);
@@ -432,7 +435,33 @@ public class ParameterizedBuiltinSPInstruction extends
ComputationSPInstruction
//post-processing to obtain sparsity of ultra-sparse
outputs
SparkUtils.postprocessUltraSparseOutput(sec.getMatrixObject(output), mcOut);
}
- else if ( opcode.equalsIgnoreCase("transformapply") )
+ else if ( opcode.equalsIgnoreCase("tokenize") )
+ {
+ //get input RDD data
+ FrameObject fo =
sec.getFrameObject(params.get("target"));
+ JavaPairRDD<Long,FrameBlock> in =
(JavaPairRDD<Long,FrameBlock>)
+ sec.getRDDHandleForFrameObject(fo,
FileFormat.BINARY);
+ DataCharacteristics mc =
sec.getDataCharacteristics(params.get("target"));
+
+ //construct tokenizer and tokenize text
+ Tokenizer tokenizer =
TokenizerFactory.createTokenizer(params.get("spec"),
+
Integer.parseInt(params.get("max_tokens")));
+ JavaPairRDD<Long,FrameBlock> out = in.mapToPair(
+ new RDDTokenizeFunction(tokenizer,
mc.getBlocksize()));
+
+ //set output and maintain lineage/output characteristics
+ sec.setRDDHandleForVariable(output.getName(), out);
+ sec.addLineageRDD(output.getName(),
params.get("target"));
+
+ // get max tokens for row upper bound
+ long numRows = tokenizer.getNumRows(mc.getRows());
+ long numCols = tokenizer.getNumCols();
+
+ sec.getDataCharacteristics(output.getName()).set(
+ numRows, numCols, mc.getBlocksize());
+
sec.getFrameObject(output.getName()).setSchema(tokenizer.getSchema());
+ }
+ else if ( opcode.equalsIgnoreCase("transformapply") )
{
//get input RDD and meta data
FrameObject fo =
sec.getFrameObject(params.get("target"));
@@ -787,6 +816,30 @@ public class ParameterizedBuiltinSPInstruction extends
ComputationSPInstruction
}
}
+ public static class RDDTokenizeFunction implements
PairFunction<Tuple2<Long, FrameBlock>, Long, FrameBlock>
+ {
+ private static final long serialVersionUID =
-8788298032616522019L;
+
+ private Tokenizer _tokenizer = null;
+ private int _blen = -1;
+
+ public RDDTokenizeFunction(Tokenizer tokenizer, int blen) {
+ _tokenizer = tokenizer;
+ _blen = blen;
+ }
+
+ @Override
+ public Tuple2<Long,FrameBlock> call(Tuple2<Long, FrameBlock> in)
+ throws Exception
+ {
+ long key = in._1();
+ FrameBlock blk = in._2();
+
+ FrameBlock fbout = _tokenizer.tokenize(blk, new
FrameBlock(_tokenizer.getSchema()));
+ return new Tuple2<>(key, fbout);
+ }
+ }
+
public static class RDDTransformApplyFunction implements
PairFunction<Tuple2<Long,FrameBlock>,Long,FrameBlock>
{
private static final long serialVersionUID =
5759813006068230916L;
diff --git
a/src/main/java/org/apache/sysds/runtime/transform/tokenize/Tokenizer.java
b/src/main/java/org/apache/sysds/runtime/transform/tokenize/Tokenizer.java
new file mode 100644
index 0000000..dd4982a
--- /dev/null
+++ b/src/main/java/org/apache/sysds/runtime/transform/tokenize/Tokenizer.java
@@ -0,0 +1,84 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.sysds.runtime.transform.tokenize;
+
+import org.apache.commons.logging.Log;
+import org.apache.commons.logging.LogFactory;
+import org.apache.sysds.common.Types;
+import org.apache.sysds.runtime.matrix.data.FrameBlock;
+
+import java.io.Serializable;
+import java.util.List;
+
+public class Tokenizer implements Serializable {
+
+ private static final long serialVersionUID = 7155673772374114577L;
+ protected static final Log LOG =
LogFactory.getLog(Tokenizer.class.getName());
+
+ private final TokenizerPre tokenizerPre;
+ private final TokenizerPost tokenizerPost;
+
+ protected Tokenizer(TokenizerPre tokenizerPre, TokenizerPost
tokenizerPost) {
+
+ this.tokenizerPre = tokenizerPre;
+ this.tokenizerPost = tokenizerPost;
+ }
+
+ public Types.ValueType[] getSchema() {
+ return tokenizerPost.getOutSchema();
+ }
+
+ public long getNumRows(long inRows) {
+ return tokenizerPost.getNumRows(inRows);
+ }
+
+ public long getNumCols() {
+ return tokenizerPost.getNumCols();
+ }
+
+ public FrameBlock tokenize(FrameBlock in, FrameBlock out) {
+ // First convert to internal representation
+ List<DocumentToTokens> documentsToTokenList =
tokenizerPre.tokenizePre(in);
+ // Then convert to output representation
+ return tokenizerPost.tokenizePost(documentsToTokenList, out);
+ }
+
+ static class Token {
+ String textToken;
+ long startIndex;
+ long endIndex;
+
+ public Token(String token, long startIndex) {
+ this.textToken = token;
+ this.startIndex = startIndex;
+ this.endIndex = startIndex + token.length();
+ }
+ }
+
+ static class DocumentToTokens {
+ List<Object> keys;
+ List<Tokenizer.Token> tokens;
+
+ public DocumentToTokens(List<Object> keys, List<Tokenizer.Token>
tokens) {
+ this.keys = keys;
+ this.tokens = tokens;
+ }
+ }
+}
diff --git
a/src/main/java/org/apache/sysds/runtime/transform/tokenize/TokenizerFactory.java
b/src/main/java/org/apache/sysds/runtime/transform/tokenize/TokenizerFactory.java
new file mode 100644
index 0000000..18c4bff
--- /dev/null
+++
b/src/main/java/org/apache/sysds/runtime/transform/tokenize/TokenizerFactory.java
@@ -0,0 +1,109 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.sysds.runtime.transform.tokenize;
+
+import org.apache.sysds.runtime.DMLRuntimeException;
+import org.apache.wink.json4j.JSONObject;
+import org.apache.wink.json4j.JSONArray;
+
+import java.util.ArrayList;
+import java.util.List;
+
+public class TokenizerFactory {
+
+ public static Tokenizer createTokenizer(String spec, int maxTokens) {
+ Tokenizer tokenizer = null;
+
+ try {
+ //parse transform specification
+ JSONObject jSpec = new JSONObject(spec);
+
+ // tokenization needs an algorithm (with algorithm specific params)
+ String algo = jSpec.getString("algo");
+ JSONObject algoParams = null;
+ if (jSpec.has("algo_params")) {
+ algoParams = jSpec.getJSONObject("algo_params");
+ }
+
+ // tokenization needs an output representation (with
representation specific params)
+ String out = jSpec.getString("out");
+ JSONObject outParams = null;
+ if (jSpec.has("out_params")) {
+ outParams = jSpec.getJSONObject("out_params");
+ }
+
+ // tokenization needs a text column to tokenize
+ int tokenizeCol = jSpec.getInt("tokenize_col");
+
+ // tokenization needs one or more idCols that define the document
and are replicated per token
+ List<Integer> idCols = new ArrayList<>();
+ JSONArray idColsJsonArray = jSpec.getJSONArray("id_cols");
+ for (int i=0; i < idColsJsonArray.length(); i++) {
+ idCols.add(idColsJsonArray.getInt(i));
+ }
+ // Output schema is derived from specified id cols
+ int numIdCols = idCols.size();
+
+ // get difference between long and wide format
+ boolean wideFormat = false; // long format is default
+ if (jSpec.has("format_wide")) {
+ wideFormat = jSpec.getBoolean("format_wide");
+ }
+
+ TokenizerPre tokenizerPre;
+ TokenizerPost tokenizerPost;
+
+ // Note that internal representation should be independent from
output representation
+
+ // Algorithm to transform tokens into internal token representation
+ switch (algo) {
+ case "split":
+ tokenizerPre = new TokenizerPreWhitespaceSplit(idCols,
tokenizeCol, algoParams);
+ break;
+ case "ngram":
+ tokenizerPre = new TokenizerPreNgram(idCols, tokenizeCol,
algoParams);
+ break;
+ default:
+ throw new IllegalArgumentException("Algorithm {algo=" +
algo + "} is not supported.");
+ }
+
+ // Transform tokens to output representation
+ switch (out) {
+ case "count":
+ tokenizerPost = new TokenizerPostCount(outParams,
numIdCols, maxTokens, wideFormat);
+ break;
+ case "position":
+ tokenizerPost = new TokenizerPostPosition(outParams,
numIdCols, maxTokens, wideFormat);
+ break;
+ case "hash":
+ tokenizerPost = new TokenizerPostHash(outParams,
numIdCols, maxTokens, wideFormat);
+ break;
+ default:
+ throw new IllegalArgumentException("Output representation
{out=" + out + "} is not supported.");
+ }
+
+ tokenizer = new Tokenizer(tokenizerPre, tokenizerPost);
+ }
+ catch(Exception ex) {
+ throw new DMLRuntimeException(ex);
+ }
+ return tokenizer;
+ }
+}
diff --git
a/src/main/java/org/apache/sysds/runtime/transform/tokenize/TokenizerPost.java
b/src/main/java/org/apache/sysds/runtime/transform/tokenize/TokenizerPost.java
new file mode 100644
index 0000000..5f35c89
--- /dev/null
+++
b/src/main/java/org/apache/sysds/runtime/transform/tokenize/TokenizerPost.java
@@ -0,0 +1,33 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.sysds.runtime.transform.tokenize;
+
+import org.apache.sysds.common.Types;
+import org.apache.sysds.runtime.matrix.data.FrameBlock;
+
+import java.io.Serializable;
+import java.util.List;
+
+public interface TokenizerPost extends Serializable {
+ FrameBlock tokenizePost(List<Tokenizer.DocumentToTokens> tl, FrameBlock
out);
+ Types.ValueType[] getOutSchema();
+ long getNumRows(long inRows);
+ long getNumCols();
+}
diff --git
a/src/main/java/org/apache/sysds/runtime/transform/tokenize/TokenizerPostCount.java
b/src/main/java/org/apache/sysds/runtime/transform/tokenize/TokenizerPostCount.java
new file mode 100644
index 0000000..f1f9e81
--- /dev/null
+++
b/src/main/java/org/apache/sysds/runtime/transform/tokenize/TokenizerPostCount.java
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.sysds.runtime.transform.tokenize;
+
+import org.apache.sysds.common.Types;
+import org.apache.sysds.runtime.matrix.data.FrameBlock;
+import org.apache.sysds.runtime.util.UtilFunctions;
+import org.apache.wink.json4j.JSONException;
+import org.apache.wink.json4j.JSONObject;
+
+import java.io.Serializable;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+public class TokenizerPostCount implements TokenizerPost{
+
+ private static final long serialVersionUID = 6382000606237705019L;
+ private final Params params;
+ private final int numIdCols;
+ private final int maxTokens;
+ private final boolean wideFormat;
+
+ static class Params implements Serializable {
+
+ private static final long serialVersionUID = 5121697674346781880L;
+
+ public boolean sort_alpha = false;
+
+ public Params(JSONObject json) throws JSONException {
+ if (json != null && json.has("sort_alpha")) {
+ this.sort_alpha = json.getBoolean("sort_alpha");
+ }
+ }
+ }
+
+ public TokenizerPostCount(JSONObject params, int numIdCols, int maxTokens,
boolean wideFormat) throws JSONException {
+ this.params = new Params(params);
+ this.numIdCols = numIdCols;
+ this.maxTokens = maxTokens;
+ this.wideFormat = wideFormat;
+ }
+
+ @Override
+ public FrameBlock tokenizePost(List<Tokenizer.DocumentToTokens> tl,
FrameBlock out) {
+ for (Tokenizer.DocumentToTokens docToToken: tl) {
+ List<Object> keys = docToToken.keys;
+ List<Tokenizer.Token> tokenList = docToToken.tokens;
+ // Creating the counts for BoW
+ Map<String, Long> tokenCounts =
tokenList.stream().collect(Collectors.groupingBy(token ->
+ token.textToken, Collectors.counting()));
+ // Remove duplicate strings
+ Stream<String> distinctTokenStream = tokenList.stream().map(token
-> token.textToken).distinct();
+ if (params.sort_alpha) {
+ // Sort alphabetically
+ distinctTokenStream = distinctTokenStream.sorted();
+ }
+ List<String> outputTokens =
distinctTokenStream.collect(Collectors.toList());
+
+ int numTokens = 0;
+ for (String token: outputTokens) {
+ if (numTokens >= maxTokens) {
+ break;
+ }
+ // Create a row per token
+ long count = tokenCounts.get(token);
+ List<Object> rowList = new ArrayList<>(keys);
+ rowList.add(token);
+ rowList.add(count);
+ Object[] row = new Object[rowList.size()];
+ rowList.toArray(row);
+ out.appendRow(row);
+ numTokens++;
+ }
+ }
+
+ return out;
+ }
+
+ @Override
+ public Types.ValueType[] getOutSchema() {
+ if (wideFormat) {
+ throw new IllegalArgumentException("Wide Format is not supported
for Count Representation.");
+ }
+ // Long format only depends on numIdCols
+ Types.ValueType[] schema = UtilFunctions.nCopies(numIdCols +
2,Types.ValueType.STRING );
+ schema[numIdCols + 1] = Types.ValueType.INT64;
+ return schema;
+ }
+
+ public long getNumRows(long inRows) {
+ if (wideFormat) {
+ return inRows;
+ } else {
+ return inRows * maxTokens;
+ }
+ }
+
+ public long getNumCols() {
+ return this.getOutSchema().length;
+ }
+}
diff --git
a/src/main/java/org/apache/sysds/runtime/transform/tokenize/TokenizerPostHash.java
b/src/main/java/org/apache/sysds/runtime/transform/tokenize/TokenizerPostHash.java
new file mode 100644
index 0000000..f19bdb8
--- /dev/null
+++
b/src/main/java/org/apache/sysds/runtime/transform/tokenize/TokenizerPostHash.java
@@ -0,0 +1,159 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.sysds.runtime.transform.tokenize;
+
+import org.apache.sysds.common.Types;
+import org.apache.sysds.runtime.matrix.data.FrameBlock;
+import org.apache.sysds.runtime.util.UtilFunctions;
+import org.apache.wink.json4j.JSONException;
+import org.apache.wink.json4j.JSONObject;
+
+import java.io.Serializable;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.Map;
+import java.util.TreeMap;
+import java.util.function.Function;
+import java.util.stream.Collectors;
+
+public class TokenizerPostHash implements TokenizerPost{
+
+ private static final long serialVersionUID = 4763889041868044668L;
+ private final Params params;
+ private final int numIdCols;
+ private final int maxTokens;
+ private final boolean wideFormat;
+
+ static class Params implements Serializable {
+
+ private static final long serialVersionUID = -256069061414241795L;
+
+ public int num_features = 1048576; // 2^20
+
+ public Params(JSONObject json) throws JSONException {
+ if (json != null && json.has("num_features")) {
+ this.num_features = json.getInt("num_features");
+ }
+ }
+ }
+
+ public TokenizerPostHash(JSONObject params, int numIdCols, int maxTokens,
boolean wideFormat) throws JSONException {
+ this.params = new Params(params);
+ this.numIdCols = numIdCols;
+ this.maxTokens = maxTokens;
+ this.wideFormat = wideFormat;
+ }
+
+ @Override
+ public FrameBlock tokenizePost(List<Tokenizer.DocumentToTokens> tl,
FrameBlock out) {
+ for (Tokenizer.DocumentToTokens docToToken: tl) {
+ List<Object> keys = docToToken.keys;
+ List<Tokenizer.Token> tokenList = docToToken.tokens;
+ // Transform to hashes
+ List<Integer> hashList = tokenList.stream().map(token ->
token.textToken.hashCode() %
+ params.num_features).collect(Collectors.toList());
+ // Counting the hashes
+ Map<Integer, Long> hashCounts =
hashList.stream().collect(Collectors.groupingBy(Function.identity(),
+ Collectors.counting()));
+ // Sorted by hash
+ Map<Integer, Long> sortedHashes = new TreeMap<>(hashCounts);
+
+ if (wideFormat) {
+ this.appendTokensWide(keys, sortedHashes, out);
+ } else {
+ this.appendTokensLong(keys, sortedHashes, out);
+ }
+ }
+
+ return out;
+ }
+
+ private void appendTokensLong(List<Object> keys, Map<Integer, Long>
sortedHashes, FrameBlock out) {
+ int numTokens = 0;
+ for (Map.Entry<Integer, Long> hashCount: sortedHashes.entrySet()) {
+ if (numTokens >= maxTokens) {
+ break;
+ }
+ // Create a row per token
+ int hash = hashCount.getKey() + 1;
+ long count = hashCount.getValue();
+ List<Object> rowList = new ArrayList<>(keys);
+ rowList.add((long) hash);
+ rowList.add(count);
+ Object[] row = new Object[rowList.size()];
+ rowList.toArray(row);
+ out.appendRow(row);
+ numTokens++;
+ }
+ }
+
+ private void appendTokensWide(List<Object> keys, Map<Integer, Long>
sortedHashes, FrameBlock out) {
+ // Create one row with keys as prefix
+ List<Object> rowList = new ArrayList<>(keys);
+
+ for (int tokenPos = 0; tokenPos < maxTokens; tokenPos++) {
+ long positionHash = sortedHashes.getOrDefault(tokenPos, 0L);
+ rowList.add(positionHash);
+ }
+ Object[] row = new Object[rowList.size()];
+ rowList.toArray(row);
+ out.appendRow(row);
+ }
+
+ @Override
+ public Types.ValueType[] getOutSchema() {
+ if (wideFormat) {
+ return getOutSchemaWide(numIdCols, maxTokens);
+ } else {
+ return getOutSchemaLong(numIdCols);
+ }
+ }
+
+ private Types.ValueType[] getOutSchemaWide(int numIdCols, int maxTokens) {
+ Types.ValueType[] schema = new Types.ValueType[numIdCols + maxTokens];
+ int i = 0;
+ for (; i < numIdCols; i++) {
+ schema[i] = Types.ValueType.STRING;
+ }
+ for (int j = 0; j < maxTokens; j++, i++) {
+ schema[i] = Types.ValueType.INT64;
+ }
+ return schema;
+ }
+
+ private Types.ValueType[] getOutSchemaLong(int numIdCols) {
+ Types.ValueType[] schema = UtilFunctions.nCopies(numIdCols +
2,Types.ValueType.STRING );
+ schema[numIdCols] = Types.ValueType.INT64;
+ schema[numIdCols+1] = Types.ValueType.INT64;
+ return schema;
+ }
+
+ public long getNumRows(long inRows) {
+ if (wideFormat) {
+ return inRows;
+ } else {
+ return inRows * maxTokens;
+ }
+ }
+
+ public long getNumCols() {
+ return this.getOutSchema().length;
+ }
+}
diff --git
a/src/main/java/org/apache/sysds/runtime/transform/tokenize/TokenizerPostPosition.java
b/src/main/java/org/apache/sysds/runtime/transform/tokenize/TokenizerPostPosition.java
new file mode 100644
index 0000000..4451f08
--- /dev/null
+++
b/src/main/java/org/apache/sysds/runtime/transform/tokenize/TokenizerPostPosition.java
@@ -0,0 +1,137 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.sysds.runtime.transform.tokenize;
+
+import org.apache.sysds.common.Types;
+import org.apache.sysds.runtime.matrix.data.FrameBlock;
+
+import org.apache.sysds.runtime.util.UtilFunctions;
+import org.apache.wink.json4j.JSONObject;
+
+import java.util.ArrayList;
+import java.util.List;
+
+public class TokenizerPostPosition implements TokenizerPost{
+
+ private static final long serialVersionUID = 3563407270742660830L;
+ private final int numIdCols;
+ private final int maxTokens;
+ private final boolean wideFormat;
+
+ public TokenizerPostPosition(JSONObject params, int numIdCols, int
maxTokens, boolean wideFormat) {
+ // No configurable params yet
+ this.numIdCols = numIdCols;
+ this.maxTokens = maxTokens;
+ this.wideFormat = wideFormat;
+ }
+
+ @Override
+ public FrameBlock tokenizePost(List<Tokenizer.DocumentToTokens> tl,
FrameBlock out) {
+ for (Tokenizer.DocumentToTokens docToToken: tl) {
+ List<Object> keys = docToToken.keys;
+ List<Tokenizer.Token> tokenList = docToToken.tokens;
+
+ if (wideFormat) {
+ this.appendTokensWide(keys, tokenList, out);
+ } else {
+ this.appendTokensLong(keys, tokenList, out);
+ }
+ }
+
+ return out;
+ }
+
+ public void appendTokensLong(List<Object> keys, List<Tokenizer.Token>
tokenList, FrameBlock out) {
+ int numTokens = 0;
+ for (Tokenizer.Token token: tokenList) {
+ if (numTokens >= maxTokens) {
+ break;
+ }
+ // Create a row per token
+ List<Object> rowList = new ArrayList<>(keys);
+ // Convert to 1-based index for DML
+ rowList.add(token.startIndex + 1);
+ rowList.add(token.textToken);
+ Object[] row = new Object[rowList.size()];
+ rowList.toArray(row);
+ out.appendRow(row);
+ numTokens++;
+ }
+ }
+
+ public void appendTokensWide(List<Object> keys, List<Tokenizer.Token>
tokenList, FrameBlock out) {
+ // Create one row with keys as prefix
+ List<Object> rowList = new ArrayList<>(keys);
+
+ int numTokens = 0;
+ for (Tokenizer.Token token: tokenList) {
+ if (numTokens >= maxTokens) {
+ break;
+ }
+ rowList.add(token.textToken);
+ numTokens++;
+ }
+ // Remaining positions need to be filled with empty tokens
+ for (; numTokens < maxTokens; numTokens++) {
+ rowList.add("");
+ }
+ Object[] row = new Object[rowList.size()];
+ rowList.toArray(row);
+ out.appendRow(row);
+ }
+
+ @Override
+ public Types.ValueType[] getOutSchema() {
+ if (wideFormat) {
+ return getOutSchemaWide(numIdCols, maxTokens);
+ } else {
+ return getOutSchemaLong(numIdCols);
+ }
+
+ }
+
+ private Types.ValueType[] getOutSchemaWide(int numIdCols, int maxTokens) {
+ Types.ValueType[] schema = UtilFunctions.nCopies(numIdCols +
maxTokens,Types.ValueType.STRING );
+ return schema;
+ }
+
+ private Types.ValueType[] getOutSchemaLong(int numIdCols) {
+ Types.ValueType[] schema = new Types.ValueType[numIdCols + 2];
+ int i = 0;
+ for (; i < numIdCols; i++) {
+ schema[i] = Types.ValueType.STRING;
+ }
+ schema[i] = Types.ValueType.INT64;
+ schema[i+1] = Types.ValueType.STRING;
+ return schema;
+ }
+
+ public long getNumRows(long inRows) {
+ if (wideFormat) {
+ return inRows;
+ } else {
+ return inRows * maxTokens;
+ }
+ }
+
+ public long getNumCols() {
+ return this.getOutSchema().length;
+ }
+}
diff --git
a/src/main/java/org/apache/sysds/runtime/transform/tokenize/TokenizerPre.java
b/src/main/java/org/apache/sysds/runtime/transform/tokenize/TokenizerPre.java
new file mode 100644
index 0000000..640bb5b
--- /dev/null
+++
b/src/main/java/org/apache/sysds/runtime/transform/tokenize/TokenizerPre.java
@@ -0,0 +1,29 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.sysds.runtime.transform.tokenize;
+
+import org.apache.sysds.runtime.matrix.data.FrameBlock;
+
+import java.io.Serializable;
+import java.util.List;
+
+public interface TokenizerPre extends Serializable {
+ List<Tokenizer.DocumentToTokens> tokenizePre(FrameBlock in);
+}
diff --git
a/src/main/java/org/apache/sysds/runtime/transform/tokenize/TokenizerPreNgram.java
b/src/main/java/org/apache/sysds/runtime/transform/tokenize/TokenizerPreNgram.java
new file mode 100644
index 0000000..a602c2b
--- /dev/null
+++
b/src/main/java/org/apache/sysds/runtime/transform/tokenize/TokenizerPreNgram.java
@@ -0,0 +1,100 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.sysds.runtime.transform.tokenize;
+
+import org.apache.sysds.runtime.matrix.data.FrameBlock;
+import org.apache.wink.json4j.JSONException;
+import org.apache.wink.json4j.JSONObject;
+
+import java.io.Serializable;
+import java.util.ArrayList;
+import java.util.List;
+
+public class TokenizerPreNgram implements TokenizerPre {
+
+ private static final long serialVersionUID = -6297904316677723802L;
+
+ private final TokenizerPreWhitespaceSplit tokenizerPreWhitespaceSplit;
+ private final Params params;
+
+ static class Params implements Serializable {
+
+ private static final long serialVersionUID = -6516419749810062677L;
+
+ public int minGram = 1;
+ public int maxGram = 2;
+
+ public Params(JSONObject json) throws JSONException {
+ if (json != null && json.has("min_gram")) {
+ this.minGram = json.getInt("min_gram");
+ }
+ if (json != null && json.has("max_gram")) {
+ this.maxGram = json.getInt("max_gram");
+ }
+ }
+ }
+
+ public TokenizerPreNgram(List<Integer> idCols, int tokenizeCol, JSONObject
params) throws JSONException {
+ this.tokenizerPreWhitespaceSplit = new
TokenizerPreWhitespaceSplit(idCols, tokenizeCol, params);
+ this.params = new Params(params);
+ }
+
+ public List<Tokenizer.Token> wordTokenToNgrams(Tokenizer.Token wordTokens)
{
+ List<Tokenizer.Token> ngramTokens = new ArrayList<>();
+
+ int tokenLen = wordTokens.textToken.length();
+ int startPos = params.minGram - params.maxGram;
+ int endPos = Math.max(tokenLen - params.minGram, startPos);
+
+ for (int i = startPos; i <= endPos; i++) {
+ int startSlice = Math.max(i, 0);
+ int endSlice = Math.min(i + params.maxGram, tokenLen);
+ String substring = wordTokens.textToken.substring(startSlice,
endSlice);
+ long tokenStart = wordTokens.startIndex + startSlice;
+ ngramTokens.add(new Tokenizer.Token(substring, tokenStart));
+ }
+
+ return ngramTokens;
+ }
+
+ public List<Tokenizer.Token> wordTokenListToNgrams(List<Tokenizer.Token>
wordTokens) {
+ List<Tokenizer.Token> ngramTokens = new ArrayList<>();
+
+ for (Tokenizer.Token wordToken: wordTokens) {
+ List<Tokenizer.Token> ngramTokensForWord =
wordTokenToNgrams(wordToken);
+ ngramTokens.addAll(ngramTokensForWord);
+ }
+ return ngramTokens;
+ }
+
+ @Override
+ public List<Tokenizer.DocumentToTokens> tokenizePre(FrameBlock in) {
+ List<Tokenizer.DocumentToTokens> docToWordTokens =
tokenizerPreWhitespaceSplit.tokenizePre(in);
+
+ List<Tokenizer.DocumentToTokens> docToNgramTokens = new ArrayList<>();
+ for (Tokenizer.DocumentToTokens docToTokens: docToWordTokens) {
+ List<Object> keys = docToTokens.keys;
+ List<Tokenizer.Token> wordTokens = docToTokens.tokens;
+ List<Tokenizer.Token> ngramTokens =
wordTokenListToNgrams(wordTokens);
+ docToNgramTokens.add(new Tokenizer.DocumentToTokens(keys,
ngramTokens));
+ }
+ return docToNgramTokens;
+ }
+}
diff --git
a/src/main/java/org/apache/sysds/runtime/transform/tokenize/TokenizerPreWhitespaceSplit.java
b/src/main/java/org/apache/sysds/runtime/transform/tokenize/TokenizerPreWhitespaceSplit.java
new file mode 100644
index 0000000..2653fc0
--- /dev/null
+++
b/src/main/java/org/apache/sysds/runtime/transform/tokenize/TokenizerPreWhitespaceSplit.java
@@ -0,0 +1,92 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.sysds.runtime.transform.tokenize;
+
+import org.apache.sysds.runtime.matrix.data.FrameBlock;
+import org.apache.wink.json4j.JSONException;
+import org.apache.wink.json4j.JSONObject;
+
+import java.io.Serializable;
+import java.util.ArrayList;
+import java.util.Iterator;
+import java.util.List;
+
+public class TokenizerPreWhitespaceSplit implements TokenizerPre {
+
+ private static final long serialVersionUID = 539127244034913364L;
+
+ private final Params params;
+
+ private final List<Integer> idCols;
+ private final int tokenizeCol;
+
+ static class Params implements Serializable {
+
+ private static final long serialVersionUID = -4368552847660442628L;
+
+ public String regex = "\\s+"; // whitespace
+
+ public Params(JSONObject json) throws JSONException {
+ if (json != null && json.has("regex")) {
+ this.regex = json.getString("regex");
+ }
+ }
+ }
+
+ public TokenizerPreWhitespaceSplit(List<Integer> idCols, int tokenizeCol,
JSONObject params) throws JSONException {
+ this.idCols = idCols;
+ this.tokenizeCol = tokenizeCol;
+ this.params = new Params(params);
+ }
+
+ public List<Tokenizer.Token> splitToTokens(String text) {
+ List<Tokenizer.Token> tokenList = new ArrayList<>();
+ String[] textTokens = text.split(params.regex);
+ int curIndex = 0;
+ for(String textToken: textTokens) {
+ int tokenIndex = text.indexOf(textToken, curIndex);
+ curIndex = tokenIndex;
+ tokenList.add(new Tokenizer.Token(textToken, tokenIndex));
+ }
+ return tokenList;
+ }
+
+ @Override
+ public List<Tokenizer.DocumentToTokens> tokenizePre(FrameBlock in) {
+ List<Tokenizer.DocumentToTokens> documentsToTokenList = new
ArrayList<>();
+
+ Iterator<String[]> iterator = in.getStringRowIterator();
+ iterator.forEachRemaining(s -> {
+ // Convert index value to Java (0-based) from DML (1-based)
+ String text = s[tokenizeCol - 1];
+ List<Object> keys = new ArrayList<>();
+ for (Integer idCol: idCols) {
+ Object key = s[idCol - 1];
+ keys.add(key);
+ }
+
+ // Transform to Bag format internally
+ List<Tokenizer.Token> tokenList = splitToTokens(text);
+ documentsToTokenList.add(new Tokenizer.DocumentToTokens(keys,
tokenList));
+ });
+
+ return documentsToTokenList;
+ }
+}
diff --git
a/src/test/java/org/apache/sysds/test/functions/transform/TokenizeTest.java
b/src/test/java/org/apache/sysds/test/functions/transform/TokenizeTest.java
new file mode 100644
index 0000000..89d339b
--- /dev/null
+++ b/src/test/java/org/apache/sysds/test/functions/transform/TokenizeTest.java
@@ -0,0 +1,220 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.sysds.test.functions.transform;
+
+import org.apache.sysds.api.DMLScript;
+import org.apache.sysds.runtime.io.*;
+import org.junit.Test;
+import org.apache.sysds.common.Types.ExecMode;
+import org.apache.sysds.runtime.matrix.data.FrameBlock;
+import org.apache.sysds.runtime.util.DataConverter;
+import org.apache.sysds.test.AutomatedTestBase;
+import org.apache.sysds.test.TestConfiguration;
+import org.apache.sysds.test.TestUtils;
+
+
+public class TokenizeTest extends AutomatedTestBase
+{
+ private static final String TEST_DIR = "functions/transform/";
+ private static final String TEST_CLASS_DIR = TEST_DIR +
TokenizeTest.class.getSimpleName() + "/";
+
+ private static final String TEST_SPLIT_COUNT_LONG =
"tokenize/TokenizeSplitCountLong";
+ private static final String TEST_NGRAM_POS_LONG =
"tokenize/TokenizeNgramPosLong";
+ private static final String TEST_NGRAM_POS_WIDE =
"tokenize/TokenizeNgramPosWide";
+ private static final String TEST_UNI_HASH_WIDE =
"tokenize/TokenizeUniHashWide";
+
+ //dataset and transform tasks without missing values
+ private final static String DATASET =
"20news/20news_subset_untokenized.csv";
+
+ @Override
+ public void setUp() {
+ TestUtils.clearAssertionInformation();
+ addTestConfiguration(TEST_SPLIT_COUNT_LONG,
+ new TestConfiguration(TEST_CLASS_DIR, TEST_SPLIT_COUNT_LONG,
new String[] { "R" }) );
+ addTestConfiguration(TEST_NGRAM_POS_LONG,
+ new TestConfiguration(TEST_CLASS_DIR, TEST_NGRAM_POS_LONG, new
String[] { "R" }) );
+ addTestConfiguration(TEST_NGRAM_POS_WIDE,
+ new TestConfiguration(TEST_CLASS_DIR, TEST_NGRAM_POS_WIDE, new
String[] { "R" }) );
+ addTestConfiguration(TEST_UNI_HASH_WIDE,
+ new TestConfiguration(TEST_CLASS_DIR, TEST_UNI_HASH_WIDE, new
String[] { "R" }) );
+ }
+
+ @Test
+ public void testTokenizeSingleNodeSplitCountLong() {
+ runTokenizeTest(ExecMode.SINGLE_NODE, TEST_SPLIT_COUNT_LONG,false);
+ }
+
+ @Test
+ public void testTokenizeSparkSplitCountLong() {
+
+ runTokenizeTest(ExecMode.SPARK, TEST_SPLIT_COUNT_LONG, false);
+ }
+
+ @Test
+ public void testTokenizeHybridSplitCountLong() {
+
+
+ runTokenizeTest(ExecMode.HYBRID, TEST_SPLIT_COUNT_LONG, false);
+ }
+
+ @Test
+ public void testTokenizeParReadSingleNodeSplitCountLong() {
+ runTokenizeTest(ExecMode.SINGLE_NODE, TEST_SPLIT_COUNT_LONG, true);
+ }
+
+ @Test
+ public void testTokenizeParReadSparkSplitCountLong() {
+ runTokenizeTest(ExecMode.SPARK, TEST_SPLIT_COUNT_LONG, true);
+ }
+
+ @Test
+ public void testTokenizeParReadHybridSplitCountLong() {
+ runTokenizeTest(ExecMode.HYBRID, TEST_SPLIT_COUNT_LONG, true);
+ }
+
+ @Test
+ public void testTokenizeSingleNodeNgramPosLong() {
+ runTokenizeTest(ExecMode.SINGLE_NODE, TEST_NGRAM_POS_LONG,false);
+ }
+
+ // Long format on spark execution fails with: Number of non-zeros mismatch
on merge disjoint
+// @Test
+// public void testTokenizeSparkNgramPosLong() {
+// runTokenizeTest(ExecMode.SPARK, TEST_NGRAM_POS_LONG, false);
+// }
+
+ @Test
+ public void testTokenizeHybridNgramPosLong() {
+ runTokenizeTest(ExecMode.HYBRID, TEST_NGRAM_POS_LONG, false);
+ }
+
+ @Test
+ public void testTokenizeParReadSingleNodeNgramPosLong() {
+ runTokenizeTest(ExecMode.SINGLE_NODE, TEST_NGRAM_POS_LONG, true);
+ }
+
+ // Long format on spark execution fails with: Number of non-zeros mismatch
on merge disjoint
+// @Test
+// public void testTokenizeParReadSparkNgramPosLong() {
+// runTokenizeTest(ExecMode.SPARK, TEST_NGRAM_POS_LONG, true);
+// }
+
+ @Test
+ public void testTokenizeParReadHybridNgramPosLong() {
+ runTokenizeTest(ExecMode.HYBRID, TEST_NGRAM_POS_LONG, true);
+ }
+
+ @Test
+ public void testTokenizeSingleNodeNgramPosWide() {
+ runTokenizeTest(ExecMode.SINGLE_NODE, TEST_NGRAM_POS_WIDE,false);
+ }
+
+ @Test
+ public void testTokenizeSparkNgramPosWide() {
+ runTokenizeTest(ExecMode.SPARK, TEST_NGRAM_POS_WIDE, false);
+ }
+
+ @Test
+ public void testTokenizeHybridNgramPosWide() {
+ runTokenizeTest(ExecMode.HYBRID, TEST_NGRAM_POS_WIDE, false);
+ }
+
+ @Test
+ public void testTokenizeParReadSingleNodeNgramPosWide() {
+ runTokenizeTest(ExecMode.SINGLE_NODE, TEST_NGRAM_POS_WIDE, true);
+ }
+
+ @Test
+ public void testTokenizeParReadSparkNgramPosWide() {
+ runTokenizeTest(ExecMode.SPARK, TEST_NGRAM_POS_WIDE, true);
+ }
+
+ @Test
+ public void testTokenizeParReadHybridNgramPosWide() {
+ runTokenizeTest(ExecMode.HYBRID, TEST_NGRAM_POS_WIDE, true);
+ }
+
+ @Test
+ public void testTokenizeSingleNodeUniHasWide() {
+ runTokenizeTest(ExecMode.SINGLE_NODE, TEST_UNI_HASH_WIDE,false);
+ }
+
+ @Test
+ public void testTokenizeSparkUniHasWide() {
+ runTokenizeTest(ExecMode.SPARK, TEST_UNI_HASH_WIDE, false);
+ }
+
+ @Test
+ public void testTokenizeHybridUniHasWide() {
+ runTokenizeTest(ExecMode.HYBRID, TEST_UNI_HASH_WIDE, false);
+ }
+
+ @Test
+ public void testTokenizeParReadSingleNodeUniHasWide() {
+ runTokenizeTest(ExecMode.SINGLE_NODE, TEST_UNI_HASH_WIDE, true);
+ }
+
+ @Test
+ public void testTokenizeParReadSparkUniHasWide() {
+ runTokenizeTest(ExecMode.SPARK, TEST_UNI_HASH_WIDE, true);
+ }
+
+ @Test
+ public void testTokenizeParReadHybridUniHasWide() {
+ runTokenizeTest(ExecMode.HYBRID, TEST_UNI_HASH_WIDE, true);
+ }
+
+ private void runTokenizeTest(ExecMode rt, String test_name, boolean
parRead )
+ {
+ //set runtime platform
+ ExecMode rtold = rtplatform;
+ rtplatform = rt;
+
+ boolean sparkConfigOld = DMLScript.USE_LOCAL_SPARK_CONFIG;
+ if( rtplatform == ExecMode.SPARK || rtplatform == ExecMode.HYBRID)
+ DMLScript.USE_LOCAL_SPARK_CONFIG = true;
+
+ try
+ {
+ getAndLoadTestConfiguration(test_name);
+
+ String HOME = SCRIPT_DIR + TEST_DIR;
+ fullDMLScriptName = HOME + test_name + ".dml";
+ programArgs = new String[]{"-stats","-args",
+ HOME + "input/" + DATASET, HOME + test_name + ".json",
output("R") };
+
+ runTest(true, false, null, -1);
+
+ //read input/output and compare
+ FrameReader reader2 = parRead ?
+ new FrameReaderTextCSVParallel( new
FileFormatPropertiesCSV() ) :
+ new FrameReaderTextCSV( new FileFormatPropertiesCSV() );
+ FrameBlock fb2 = reader2.readFrameFromHDFS(output("R"), -1L, -1L);
+ System.out.println(DataConverter.toString(fb2));
+ }
+ catch(Exception ex) {
+ throw new RuntimeException(ex);
+ }
+ finally {
+ rtplatform = rtold;
+ DMLScript.USE_LOCAL_SPARK_CONFIG = sparkConfigOld;
+ }
+ }
+}
diff --git
a/src/test/scripts/functions/transform/input/20news/20news_subset_untokenized.csv
b/src/test/scripts/functions/transform/input/20news/20news_subset_untokenized.csv
new file mode 100644
index 0000000..f622607
--- /dev/null
+++
b/src/test/scripts/functions/transform/input/20news/20news_subset_untokenized.csv
@@ -0,0 +1,3 @@
+20news-bydate-test,alt.atheism,53068,From: [email protected]
(dean.kaflowitz) Subject: Re: about the bible quiz answers Organization: AT&T
Distribution: na Lines: 18 In article <[email protected]>
[email protected] (Tammy R Healy) writes: > > > #12) The 2 cheribums are
on the Ark of the Covenant. When God said make no > graven image he was
refering to idols which were created to be worshipped. > The Ark of the
Covenant wasn't wrodhipped and only the [...]
+20news-bydate-test,alt.atheism,53257,From: [email protected] (Chris Faehl)
Subject: Re: Amusing atheists and agnostics Organization: University of New
Mexico Albuquerque Lines: 88 Distribution: world NNTP-Posting-Host:
vesta.unm.edu In article <timmbake.735265296@mcl> [email protected] (
Clam Bake Timmons) writes: > > >Fallacy #1: Atheism is a faith. Lo! I hear
the FAQ beckoning once again... > >[wonderful Rule #3 deleted - you're correct
you didn't say anything >about > >a [...]
+20news-bydate-test,alt.atheism,53260,From: mathew <[email protected]>
Subject: Re: Yet more Rushdie [Re: ISLAMIC LAW] Organization: Mantis
Consultants Cambridge. UK. X-Newsreader: rusnews v1.02 Lines: 50
[email protected] (Gregg Jaeger) writes: >In article <[email protected]>
[email protected] (Robert >Beauchaine) writes: >>Bennett Neil. How BCCI
adapted the Koran rules of banking . The >>Times. August 13 1991. > > So
let's see. If some guy writes a piece with a [...]
\ No newline at end of file
diff --git
a/src/test/scripts/functions/transform/tokenize/TokenizeNgramPosLong.dml
b/src/test/scripts/functions/transform/tokenize/TokenizeNgramPosLong.dml
new file mode 100644
index 0000000..400c4f0
--- /dev/null
+++ b/src/test/scripts/functions/transform/tokenize/TokenizeNgramPosLong.dml
@@ -0,0 +1,47 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements. See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership. The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied. See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+F1 = read($1, data_type="frame", format="csv", sep=",");
+
+# Example Preprocessing:
+F1[,4] = map(F1[,4], "x -> x.toLowerCase()");
+
+# Example to estimate max tokens based on string length
+# max_token = as.integer(max(as.matrix(map(F1[,4], "x -> x.length()"))));
+# print(max_length);
+
+max_token = 2000;
+
+# Example spec:
+# jspec = "{\"algo\": \"whitespace\",\"out\": \"count\",\"id_col\":
[2],\"tokenize_col\": 3}";
+
+jspec = read($2, data_type="scalar", value_type="string");
+
+# If slicing is performed, then only the remaining columns are considered for
tokenizer spec:
+F2 = tokenize(target=F1[,2:4], spec=jspec, max_tokens=max_token);
+write(F2, $3, format="csv");
+
+# Followup spec to transform tokens
+jspec2 = "{\"ids\": true, \"recode\": [1,2,3,4]}";
+F2 = F2[,1:4];
+
+# Afterward, you can transform it into a matrix:
+[X, M] = transformencode(target=F2, spec=jspec2);
diff --git
a/src/test/scripts/functions/transform/tokenize/TokenizeNgramPosLong.json
b/src/test/scripts/functions/transform/tokenize/TokenizeNgramPosLong.json
new file mode 100644
index 0000000..b2f522f
--- /dev/null
+++ b/src/test/scripts/functions/transform/tokenize/TokenizeNgramPosLong.json
@@ -0,0 +1,12 @@
+{
+ "algo": "ngram",
+ "algo_params": {
+ "min_gram": 2,
+ "max_gram": 3,
+ "regex": "\\\\W+"
+ },
+ "out": "position",
+ "format_wide": false,
+ "id_cols": [1,2],
+ "tokenize_col": 3
+}
\ No newline at end of file
diff --git
a/src/test/scripts/functions/transform/tokenize/TokenizeNgramPosWide.dml
b/src/test/scripts/functions/transform/tokenize/TokenizeNgramPosWide.dml
new file mode 100644
index 0000000..400c4f0
--- /dev/null
+++ b/src/test/scripts/functions/transform/tokenize/TokenizeNgramPosWide.dml
@@ -0,0 +1,47 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements. See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership. The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied. See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+F1 = read($1, data_type="frame", format="csv", sep=",");
+
+# Example Preprocessing:
+F1[,4] = map(F1[,4], "x -> x.toLowerCase()");
+
+# Example to estimate max tokens based on string length
+# max_token = as.integer(max(as.matrix(map(F1[,4], "x -> x.length()"))));
+# print(max_length);
+
+max_token = 2000;
+
+# Example spec:
+# jspec = "{\"algo\": \"whitespace\",\"out\": \"count\",\"id_col\":
[2],\"tokenize_col\": 3}";
+
+jspec = read($2, data_type="scalar", value_type="string");
+
+# If slicing is performed, then only the remaining columns are considered for
tokenizer spec:
+F2 = tokenize(target=F1[,2:4], spec=jspec, max_tokens=max_token);
+write(F2, $3, format="csv");
+
+# Followup spec to transform tokens
+jspec2 = "{\"ids\": true, \"recode\": [1,2,3,4]}";
+F2 = F2[,1:4];
+
+# Afterward, you can transform it into a matrix:
+[X, M] = transformencode(target=F2, spec=jspec2);
diff --git
a/src/test/scripts/functions/transform/tokenize/TokenizeNgramPosWide.json
b/src/test/scripts/functions/transform/tokenize/TokenizeNgramPosWide.json
new file mode 100644
index 0000000..8087389
--- /dev/null
+++ b/src/test/scripts/functions/transform/tokenize/TokenizeNgramPosWide.json
@@ -0,0 +1,12 @@
+{
+ "algo": "ngram",
+ "algo_params": {
+ "min_gram": 2,
+ "max_gram": 3,
+ "regex": "\\\\W+"
+ },
+ "out": "position",
+ "format_wide": true,
+ "id_cols": [1,2],
+ "tokenize_col": 3
+}
\ No newline at end of file
diff --git
a/src/test/scripts/functions/transform/tokenize/TokenizeSplitCountLong.dml
b/src/test/scripts/functions/transform/tokenize/TokenizeSplitCountLong.dml
new file mode 100644
index 0000000..6e7226c
--- /dev/null
+++ b/src/test/scripts/functions/transform/tokenize/TokenizeSplitCountLong.dml
@@ -0,0 +1,49 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements. See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership. The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied. See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+F1 = read($1, data_type="frame", format="csv", sep=",");
+
+# Example Preprocessing:
+F1[,4] = map(F1[,4], "x -> x.toLowerCase()");
+
+# Example to estimate max tokens based on string length
+# max_token = as.integer(max(as.matrix(map(F1[,4], "x -> x.length()"))));
+# print(max_length);
+
+max_token = 2000;
+
+# Example spec:
+# jspec = "{\"algo\": \"whitespace\",\"out\": \"count\",\"id_col\":
[2],\"tokenize_col\": 3}";
+
+jspec = read($2, data_type="scalar", value_type="string");
+
+# If slicing is performed, then only the remaining columns are considered for
tokenizer spec:
+F2 = tokenize(target=F1[,2:4], spec=jspec, max_tokens=max_token);
+write(F2, $3, format="csv");
+
+# Followup spec to transform tokens
+jspec2 = "{\"ids\": true, \"recode\": [1,2]}";
+
+# Afterward, you can transform it into a matrix:
+[X, M] = transformencode(target=F2, spec=jspec2);
+# If format is long, then table could be applied afterwards
+X = table(X[1,], X[2,], X[3,]);
+
diff --git
a/src/test/scripts/functions/transform/tokenize/TokenizeSplitCountLong.json
b/src/test/scripts/functions/transform/tokenize/TokenizeSplitCountLong.json
new file mode 100644
index 0000000..201ff17
--- /dev/null
+++ b/src/test/scripts/functions/transform/tokenize/TokenizeSplitCountLong.json
@@ -0,0 +1,9 @@
+{
+ "algo": "split",
+ "out": "count",
+ "out_params": {
+ "sort_alpha": true
+ },
+ "id_cols": [2],
+ "tokenize_col": 3
+}
\ No newline at end of file
diff --git
a/src/test/scripts/functions/transform/tokenize/TokenizeUniHashWide.dml
b/src/test/scripts/functions/transform/tokenize/TokenizeUniHashWide.dml
new file mode 100644
index 0000000..0061e22
--- /dev/null
+++ b/src/test/scripts/functions/transform/tokenize/TokenizeUniHashWide.dml
@@ -0,0 +1,46 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements. See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership. The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied. See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+F1 = read($1, data_type="frame", format="csv", sep=",");
+
+# Example Preprocessing:
+F1[,4] = map(F1[,4], "x -> x.toLowerCase()");
+
+# Example to estimate max tokens based on string length
+# max_token = as.integer(max(as.matrix(map(F1[,4], "x -> x.length()"))));
+# print(max_length);
+
+max_token = 2000;
+
+# Example spec:
+# jspec = "{\"algo\": \"whitespace\",\"out\": \"count\",\"id_col\":
[2],\"tokenize_col\": 3}";
+
+jspec = read($2, data_type="scalar", value_type="string");
+
+# If slicing is performed, then only the remaining columns are considered for
tokenizer spec:
+F2 = tokenize(target=F1[,2:4], spec=jspec, max_tokens=max_token);
+write(F2, $3, format="csv");
+
+# Followup spec to transform tokens of ids, leave hash alone
+jspec2 = "{\"ids\": true, \"recode\": [1,2]}";
+
+# Afterward, you can transform it into a matrix:
+[X, M] = transformencode(target=F2, spec=jspec2);
diff --git
a/src/test/scripts/functions/transform/tokenize/TokenizeUniHashWide.json
b/src/test/scripts/functions/transform/tokenize/TokenizeUniHashWide.json
new file mode 100644
index 0000000..f87e873
--- /dev/null
+++ b/src/test/scripts/functions/transform/tokenize/TokenizeUniHashWide.json
@@ -0,0 +1,14 @@
+{
+ "algo": "ngram",
+ "algo_params": {
+ "min_gram": 1,
+ "max_gram": 3
+ },
+ "out": "hash",
+ "out_params": {
+ "num_features": 128
+ },
+ "format_wide": true,
+ "id_cols": [2,1],
+ "tokenize_col": 3
+}
\ No newline at end of file