Github user takuti commented on a diff in the pull request:
https://github.com/apache/incubator-hivemall/pull/118#discussion_r142300766
--- Diff: core/src/main/java/hivemall/tools/text/NgramsUDF.java ---
@@ -0,0 +1,76 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package hivemall.tools.text;
+
+import hivemall.utils.lang.StringUtils;
+
+import org.apache.hadoop.hive.ql.exec.Description;
+import org.apache.hadoop.hive.ql.exec.UDF;
+import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.ql.udf.UDFType;
+import org.apache.hadoop.io.Text;
+
+import javax.annotation.Nonnegative;
+import javax.annotation.Nonnull;
+
+import java.util.ArrayList;
+import java.util.List;
+
+@Description(name = "to_ngrams", value = "_FUNC_(array<string> words, int
minSize, int maxSize])"
--- End diff --
Since the function is applicable for a list of characters (character-based
ngrams), using a word `word` e.g., `wordgram` sounds inappropriate for me.
How about:
- `ngrams_joined_all`
- in contrast to the original `ngrams` which returns top-k most-frequent
ngrams which contains "ngram" represented as a list of words
- `ngrams_between`, `ngrams_in_range`
- our UDF returns ngrams where `min <= n <= max`
- `to_ngram_list`
- a little bit longer than `to_ngrams` and less likely to be confusing
- `ngrams` creates list of named structs, but ours simply returns list of
ngrams
---