[ https://issues.apache.org/jira/browse/SPARK-13103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15130964#comment-15130964 ]
holdenk commented on SPARK-13103: --------------------------------- I think that is expected behavior - if your finding too large a number of collisions you can change the number of features with setNumFeatures > HashTF dosn't count TF correctly > -------------------------------- > > Key: SPARK-13103 > URL: https://issues.apache.org/jira/browse/SPARK-13103 > Project: Spark > Issue Type: Bug > Components: MLlib > Affects Versions: 1.6.0 > Environment: Ubuntu 14.04 > Python 3.4.3 > Reporter: Louis Liu > > I wrote a Python program to calculate frequencies of n-gram sequences with > HashTF. > But it generate a strange output. It found more "一一下嗎" than "一一下". > HashTF gets words' index with hash() > But hashes of some Chinese words are negative. > Ex: > >>> hash('一一下嗎') > -6433835193350070115 > >>> hash('一一下') > -5938108283593463272 -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org