[ https://issues.apache.org/jira/browse/SPARK-20180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15952618#comment-15952618 ]
Cyril de Vogelaere commented on SPARK-20180: -------------------------------------------- Yes they can, it's really not a critical issue at all. Current pattern length work also well for most in practice, except for very large datasets where sequence are very long. But then I suppose people would know about the parameter, and set it to a large value. However changing it to create a default value allowing unlimitted pattern length would cost nothing in terms of performance, it's just an additionnal condition in an if. And may be easier than always setting the highest value possible. At least, that option wouldn't hurt. Actually, I have quite a few improvement in store for Prefix-span since I worked on an algorithm for my master thesis. Notably a very performant implementation that specialize PrefixSpan for single-item pattern, while slightly improving the performance of multi-item pattern. But I was told I needed to get familiar with contributing to spark first ^^', thus why I'm proposing this small, non critical, improvement, and implementing it. I'm ready to push this small change anytime, it's already implemented. But the contributor wiki ask to run dev/run-tests before pushing, and it's been running for a day and a half already ... Is that normal by the way ? Also, the test already found some error, but I'm 99.999% sure they're not mine. They're not even from the mllib module, which is the only thing I modified ... Is that normal too ? I suppose so, but I wouldn't want to waste the reviewers time ^^' > Unlimited max pattern length in Prefix span > ------------------------------------------- > > Key: SPARK-20180 > URL: https://issues.apache.org/jira/browse/SPARK-20180 > Project: Spark > Issue Type: Improvement > Components: MLlib > Affects Versions: 2.1.0 > Reporter: Cyril de Vogelaere > Priority: Minor > Original Estimate: 0h > Remaining Estimate: 0h > > Right now, we need to use .setMaxPatternLength() method to > specify is the maximum pattern length of a sequence. Any pattern longer than > that won't be outputted. > The current default maxPatternlength value being 10. > This should be changed so that with input 0, all pattern of any length would > be outputted. Additionally, the default value should be changed to 0, so that > a new user could find all patterns in his dataset without looking at this > parameter. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org