[ https://issues.apache.org/jira/browse/SPARK-13103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15130187#comment-15130187 ]
Louis Liu commented on SPARK-13103: ----------------------------------- I'm sorry, you are right. The negative numbers doesn't matter. Those code shall explain the problem: >>> from pyspark.mllib.feature import HashingTF, IDF >>> hashtf = HashingTF() >>> hash('的問題哦') -234244945207099392 >>> hash('豪們都把') 8689153874407194624 >>> hashtf.indexOf('的問題哦') 0 >>> hashtf.indexOf('豪們都把') 0 > HashTF dosn't count TF correctly > -------------------------------- > > Key: SPARK-13103 > URL: https://issues.apache.org/jira/browse/SPARK-13103 > Project: Spark > Issue Type: Bug > Components: MLlib > Affects Versions: 1.6.0 > Environment: Ubuntu 14.04 > Python 3.4.3 > Reporter: Louis Liu > > I wrote a Python program to calculate frequencies of n-gram sequences with > HashTF. > But it generate a strange output. It found more "一一下嗎" than "一一下". > HashTF gets words' index with hash() > But hashes of some Chinese words are negative. > Ex: > >>> hash('一一下嗎') > -6433835193350070115 > >>> hash('一一下') > -5938108283593463272 -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org