[ https://issues.apache.org/jira/browse/SPARK-13103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen resolved SPARK-13103. ------------------------------- Resolution: Cannot Reproduce I can't reproduce this: {code} >>> from pyspark.mllib.feature import HashingTF, IDF >>> hashtf = HashingTF() >>> hashtf.indexOf('的問題哦') 594182 >>> hashtf.indexOf('豪們都把') 227158 {code} Can you try the latest master just to double check? Your idea is a good one Holden but the default # of features is 2^20. That shouldn't be an issue here. > HashTF dosn't count TF correctly > -------------------------------- > > Key: SPARK-13103 > URL: https://issues.apache.org/jira/browse/SPARK-13103 > Project: Spark > Issue Type: Bug > Components: MLlib > Affects Versions: 1.6.0 > Environment: Ubuntu 14.04 > Python 3.4.3 > Reporter: Louis Liu > > I wrote a Python program to calculate frequencies of n-gram sequences with > HashTF. > But it generate a strange output. It found more "一一下嗎" than "一一下". > HashTF gets words' index with hash() > But hashes of some Chinese words are negative. > Ex: > >>> hash('一一下嗎') > -6433835193350070115 > >>> hash('一一下') > -5938108283593463272 -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org