Nick Lothian created SPARK-20838:
------------------------------------

             Summary: Spark ML ngram feature extractor should support ngram 
range like scikit
                 Key: SPARK-20838
                 URL: https://issues.apache.org/jira/browse/SPARK-20838
             Project: Spark
          Issue Type: Improvement
          Components: ML
    Affects Versions: 2.1.1
            Reporter: Nick Lothian


Currently Spark ML ngram extractor requires an ngram size (which default to 2).

This means that to tokenize to words, bigrams and trigrams (which is pretty 
common) you need a pipeline like this:

    tokenizer = Tokenizer(inputCol="text", outputCol="tokenized_text")
    remover = StopWordsRemover(inputCol=tokenizer.getOutputCol(), 
outputCol="words")
    bigram = NGram(n=2, inputCol=remover.getOutputCol(), outputCol="bigrams")
    trigram = NGram(n=3, inputCol=remover.getOutputCol(), outputCol="trigrams")
    
    pipeline = Pipeline(stages=[tokenizer, remover, bigram, trigram])


That's not terrible, but the big problem is that the words, bigrams and 
trigrams end up in separate fields, and the only way (in pyspark) to combine 
them is to explode each of the words, bigrams and trigrams field and then union 
them together.

In my experience this means it is slower to use this for feature extraction 
than to use a python UDF. This seems preposterous!



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to