[ https://issues.apache.org/jira/browse/SPARK-5261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292823#comment-14292823 ]
Guoqiang Li commented on SPARK-5261: ------------------------------------ [~lewuathe] {code} normalize_text() { awk '{print tolower($0);}' | sed -e "s/’/'/g" -e "s/′/'/g" -e "s/''/ /g" -e "s/'/ ' /g" -e "s/“/\"/g" -e "s/”/\"/g" \ -e 's/"/ " /g' -e 's/\./ \. /g' -e 's/<br \/>/ /g' -e 's/, / , /g' -e 's/(/ ( /g' -e 's/)/ ) /g' -e 's/\!/ \! /g' \ -e 's/\?/ \? /g' -e 's/\;/ /g' -e 's/\:/ /g' -e 's/-/ - /g' -e 's/=/ /g' -e 's/=/ /g' -e 's/*/ /g' -e 's/|/ /g' \ -e 's/«/ /g' | tr 0-9 " " } wget http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shuffled.gz gzip -d news.2013.en.shuffled.gz normalize_text < news.2013.en.shuffled > data.txt {code} > In some cases ,The value of word's vector representation is too big > ------------------------------------------------------------------- > > Key: SPARK-5261 > URL: https://issues.apache.org/jira/browse/SPARK-5261 > Project: Spark > Issue Type: Bug > Components: MLlib > Affects Versions: 1.2.0 > Reporter: Guoqiang Li > > {code} > val word2Vec = new Word2Vec() > word2Vec. > setVectorSize(100). > setSeed(42L). > setNumIterations(5). > setNumPartitions(36) > {code} > The average absolute value of the word's vector representation is 60731.8 > {code} > val word2Vec = new Word2Vec() > word2Vec. > setVectorSize(100). > setSeed(42L). > setNumIterations(5). > setNumPartitions(1) > {code} > The average absolute value of the word's vector representation is 0.13889 -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org