[ https://issues.apache.org/jira/browse/SPARK-13103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15124937#comment-15124937 ]
yuhao yang commented on SPARK-13103: ------------------------------------ Thanks for finding this. I'm not sure what's the historical reason, yet it's not common that HashingTF in Python was implemented independently from the Scala version. > HashTF dosn't count TF correctly > -------------------------------- > > Key: SPARK-13103 > URL: https://issues.apache.org/jira/browse/SPARK-13103 > Project: Spark > Issue Type: Bug > Components: MLlib > Affects Versions: 1.6.0 > Environment: Ubuntu 14.04 > Python 3.4.3 > Reporter: Louis Liu > > I wrote a Python program to calculate frequencies of n-gram sequences with > HashTF. > But it generate a strange output. It found more "一一下嗎" than "一一下". > HashTF gets words' index with hash() > But hashes of some Chinese words are negative. > Ex: > >>> hash('一一下嗎') > -6433835193350070115 > >>> hash('一一下') > -5938108283593463272 -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org