Github user Ishiihara commented on the pull request:
https://github.com/apache/spark/pull/8513#issuecomment-144873979
@holdenk LGTM. The reason to make the window size constant is that the
window size does not affect the result too much given a large corpus.
---
If your project is
Github user Ishiihara commented on a diff in the pull request:
https://github.com/apache/spark/pull/3173#discussion_r20056877
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala ---
@@ -84,6 +84,10 @@ private[sql] abstract class SparkStrategies
GitHub user Ishiihara opened a pull request:
https://github.com/apache/spark/pull/3173
[SPARK-2213][SQL] Sort Merge Join
This PR adds MergeJoin operator to Spark SQL. The semantics of MergeJoin
operator is similar to Hive's Sort merge bucket join.
MergeJoin ope
Github user Ishiihara commented on the pull request:
https://github.com/apache/spark/pull/2723#issuecomment-60881119
@marmbrus All test failures have the same pattern
select * from a right outer join b on condition1 join c on condition2
With the extra join
Github user Ishiihara commented on the pull request:
https://github.com/apache/spark/pull/2723#issuecomment-60861600
test this please
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this
Github user Ishiihara commented on the pull request:
https://github.com/apache/spark/pull/2866#issuecomment-59979144
@JoshRosen Thank you.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have
Github user Ishiihara commented on the pull request:
https://github.com/apache/spark/pull/2866#issuecomment-59883834
@JoshRosen I have been looking into the compressed bitmap and already get a
good idea of how to use roaring bitmap to perform the task. If this work is not
urgent, can
Github user Ishiihara commented on a diff in the pull request:
https://github.com/apache/spark/pull/2819#discussion_r18932327
--- Diff: python/pyspark/mllib/feature.py ---
@@ -95,90 +360,46 @@ class Word2Vec(object):
>>> sentence = "a b "
Github user Ishiihara commented on a diff in the pull request:
https://github.com/apache/spark/pull/2815#discussion_r18917918
--- Diff: graphx/src/main/scala/org/apache/spark/graphx/Graph.scala ---
@@ -195,6 +195,12 @@ abstract class Graph[VD: ClassTag, ED: ClassTag]
protected
Github user Ishiihara commented on the pull request:
https://github.com/apache/spark/pull/2723#issuecomment-59001540
test this please
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this
Github user Ishiihara commented on the pull request:
https://github.com/apache/spark/pull/2758#issuecomment-58731952
this is ok to test.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this
GitHub user Ishiihara opened a pull request:
https://github.com/apache/spark/pull/2758
[SQL]Small bug in unresolved.scala
name should throw exception with name instead of exprId.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com
Github user Ishiihara commented on the pull request:
https://github.com/apache/spark/pull/2723#issuecomment-58448551
This depends on https://github.com/apache/spark/pull/2719
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as
GitHub user Ishiihara opened a pull request:
https://github.com/apache/spark/pull/2723
[WIP][SQL][SPARK-3839] Reimplement Left/Right outer join
This is a working in progress PR. This PR reimplement Left/Right outer join
using only one hash table.
You can merge this pull request
GitHub user Ishiihara opened a pull request:
https://github.com/apache/spark/pull/2706
[SQL][Doc] Keep Spark SQL README.md up to date
@marmbrus
Update README.md to be consistent with Spark 1.1
You can merge this pull request into a Git repository by running:
$ git pull
Github user Ishiihara commented on the pull request:
https://github.com/apache/spark/pull/2356#issuecomment-58271779
retest this please
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this
Github user Ishiihara commented on the pull request:
https://github.com/apache/spark/pull/2356#issuecomment-58271152
test this please
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this
Github user Ishiihara commented on the pull request:
https://github.com/apache/spark/pull/2356#issuecomment-58270419
test this please
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this
Github user Ishiihara commented on the pull request:
https://github.com/apache/spark/pull/2356#issuecomment-58252347
test this please
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this
Github user Ishiihara commented on the pull request:
https://github.com/apache/spark/pull/2356#issuecomment-58119086
@mengxr will take care of that and other comments
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If
Github user Ishiihara commented on the pull request:
https://github.com/apache/spark/pull/2595#issuecomment-57426079
@chouqin You can run SPARK_TESTING=1 ./bin/pyspark
python/pyspark/my_file.py to run unit tests for a certain file. In your case,
use SPARK_TESTING=1 ./bin/pyspark
Github user Ishiihara commented on the pull request:
https://github.com/apache/spark/pull/2470#issuecomment-57274044
@rxin I looked through Roaring bitmap and that is a highly compressed
bitmap compared with other bitmap implementations. I will start working on this
and keep you
Github user Ishiihara commented on a diff in the pull request:
https://github.com/apache/spark/pull/2580#discussion_r18173898
--- Diff: core/src/main/scala/org/apache/spark/network/ManagedBuffer.scala
---
@@ -71,6 +73,14 @@ final class FileSegmentManagedBuffer(val file: File, val
Github user Ishiihara commented on a diff in the pull request:
https://github.com/apache/spark/pull/2580#discussion_r18173321
--- Diff: core/src/main/scala/org/apache/spark/network/ManagedBuffer.scala
---
@@ -71,6 +73,14 @@ final class FileSegmentManagedBuffer(val file: File, val
Github user Ishiihara commented on the pull request:
https://github.com/apache/spark/pull/2356#issuecomment-57046286
@mengxr Repartition is very slow when caching at Python side. It takes 9
minutes to do the repartition where as caching in Java only takes 5s.
---
If your project is
Github user Ishiihara commented on a diff in the pull request:
https://github.com/apache/spark/pull/2356#discussion_r18122597
--- Diff: python/pyspark/mllib/Word2Vec.py ---
@@ -0,0 +1,124 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
Github user Ishiihara commented on a diff in the pull request:
https://github.com/apache/spark/pull/2356#discussion_r18122598
--- Diff: python/pyspark/mllib/Word2Vec.py ---
@@ -0,0 +1,124 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
Github user Ishiihara commented on a diff in the pull request:
https://github.com/apache/spark/pull/2356#discussion_r18120761
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
@@ -284,6 +285,80 @@ class PythonMLLibAPI extends Serializable
Github user Ishiihara commented on a diff in the pull request:
https://github.com/apache/spark/pull/2356#discussion_r18118109
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
@@ -284,6 +285,80 @@ class PythonMLLibAPI extends Serializable
Github user Ishiihara commented on a diff in the pull request:
https://github.com/apache/spark/pull/2356#discussion_r18117647
--- Diff: python/pyspark/mllib/Word2Vec.py ---
@@ -0,0 +1,123 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
Github user Ishiihara commented on a diff in the pull request:
https://github.com/apache/spark/pull/2356#discussion_r18117608
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
@@ -284,6 +285,80 @@ class PythonMLLibAPI extends Serializable
Github user Ishiihara commented on a diff in the pull request:
https://github.com/apache/spark/pull/2356#discussion_r18117604
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
@@ -284,6 +285,80 @@ class PythonMLLibAPI extends Serializable
Github user Ishiihara commented on a diff in the pull request:
https://github.com/apache/spark/pull/2356#discussion_r18117584
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
@@ -284,6 +285,80 @@ class PythonMLLibAPI extends Serializable
Github user Ishiihara commented on a diff in the pull request:
https://github.com/apache/spark/pull/2356#discussion_r18117593
--- Diff: python/pyspark/mllib/Word2Vec.py ---
@@ -0,0 +1,123 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
Github user Ishiihara commented on a diff in the pull request:
https://github.com/apache/spark/pull/2356#discussion_r18117490
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
@@ -284,6 +285,80 @@ class PythonMLLibAPI extends Serializable
Github user Ishiihara commented on the pull request:
https://github.com/apache/spark/pull/2394#issuecomment-57006487
@mengxr @epahomov Added some comments after quickly going through the code.
Will do a deeper looking at the algorithm later.
---
If your project is set up for it
Github user Ishiihara commented on a diff in the pull request:
https://github.com/apache/spark/pull/2394#discussion_r18107306
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/regression/StochasticGradientBoosting.scala
---
@@ -0,0 +1,173 @@
+package
Github user Ishiihara commented on a diff in the pull request:
https://github.com/apache/spark/pull/2394#discussion_r18107318
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/regression/StochasticGradientBoosting.scala
---
@@ -0,0 +1,173 @@
+package
Github user Ishiihara commented on a diff in the pull request:
https://github.com/apache/spark/pull/2394#discussion_r18107249
--- Diff:
mllib/src/test/scala/org/apache/spark/mllib/regression/StochasticGradientBoostingSuite.scala
---
@@ -0,0 +1,44 @@
+package
Github user Ishiihara commented on a diff in the pull request:
https://github.com/apache/spark/pull/2394#discussion_r18107150
--- Diff:
mllib/src/test/scala/org/apache/spark/mllib/regression/StochasticGradientBoostingSuite.scala
---
@@ -0,0 +1,44 @@
+package
Github user Ishiihara commented on a diff in the pull request:
https://github.com/apache/spark/pull/2394#discussion_r18107119
--- Diff:
mllib/src/test/scala/org/apache/spark/mllib/regression/StochasticGradientBoostingSuite.scala
---
@@ -0,0 +1,44 @@
+package
Github user Ishiihara commented on a diff in the pull request:
https://github.com/apache/spark/pull/2394#discussion_r18107064
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/regression/StochasticGradientBoosting.scala
---
@@ -0,0 +1,173 @@
+package
Github user Ishiihara commented on a diff in the pull request:
https://github.com/apache/spark/pull/2394#discussion_r18106962
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/regression/StochasticGradientBoosting.scala
---
@@ -0,0 +1,173 @@
+package
Github user Ishiihara commented on a diff in the pull request:
https://github.com/apache/spark/pull/2394#discussion_r18106585
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/regression/StochasticGradientBoosting.scala
---
@@ -0,0 +1,173 @@
+package
Github user Ishiihara commented on a diff in the pull request:
https://github.com/apache/spark/pull/2394#discussion_r18106472
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/regression/StochasticGradientBoosting.scala
---
@@ -0,0 +1,173 @@
+package
Github user Ishiihara commented on a diff in the pull request:
https://github.com/apache/spark/pull/2394#discussion_r18106416
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/regression/StochasticGradientBoosting.scala
---
@@ -0,0 +1,173 @@
+package
Github user Ishiihara commented on a diff in the pull request:
https://github.com/apache/spark/pull/2394#discussion_r18106266
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/regression/StochasticGradientBoosting.scala
---
@@ -0,0 +1,173 @@
+package
Github user Ishiihara commented on a diff in the pull request:
https://github.com/apache/spark/pull/2394#discussion_r18106192
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/regression/StochasticGradientBoosting.scala
---
@@ -0,0 +1,173 @@
+package
Github user Ishiihara commented on a diff in the pull request:
https://github.com/apache/spark/pull/2394#discussion_r18106160
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/regression/StochasticGradientBoosting.scala
---
@@ -0,0 +1,173 @@
+package
Github user Ishiihara commented on a diff in the pull request:
https://github.com/apache/spark/pull/2394#discussion_r18105497
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/regression/StochasticGradientBoosting.scala
---
@@ -0,0 +1,173 @@
+package
Github user Ishiihara commented on a diff in the pull request:
https://github.com/apache/spark/pull/2394#discussion_r18105459
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/regression/StochasticGradientBoosting.scala
---
@@ -0,0 +1,173 @@
+package
Github user Ishiihara commented on a diff in the pull request:
https://github.com/apache/spark/pull/2394#discussion_r18105434
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/regression/StochasticGradientBoosting.scala
---
@@ -0,0 +1,173 @@
+package
Github user Ishiihara commented on the pull request:
https://github.com/apache/spark/pull/2356#issuecomment-56869195
@mengxr PR updated to use new pickle SerDe.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your
Github user Ishiihara commented on a diff in the pull request:
https://github.com/apache/spark/pull/2533#discussion_r18027094
--- Diff:
core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala
---
@@ -179,25 +178,22 @@ class
Github user Ishiihara commented on a diff in the pull request:
https://github.com/apache/spark/pull/2533#discussion_r18027054
--- Diff:
core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala
---
@@ -179,25 +178,22 @@ class
Github user Ishiihara commented on a diff in the pull request:
https://github.com/apache/spark/pull/2533#discussion_r18026933
--- Diff:
core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala
---
@@ -179,25 +178,22 @@ class
Github user Ishiihara commented on a diff in the pull request:
https://github.com/apache/spark/pull/2533#discussion_r18026850
--- Diff:
core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala
---
@@ -149,13 +147,14 @@ class
Github user Ishiihara commented on a diff in the pull request:
https://github.com/apache/spark/pull/2533#discussion_r18026763
--- Diff:
core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala
---
@@ -126,8 +124,8 @@ class
Github user Ishiihara commented on a diff in the pull request:
https://github.com/apache/spark/pull/2533#discussion_r18026654
--- Diff:
core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala
---
@@ -85,16 +79,18 @@ class
Github user Ishiihara commented on a diff in the pull request:
https://github.com/apache/spark/pull/2533#discussion_r18026675
--- Diff:
core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala
---
@@ -104,13 +100,15 @@ class
Github user Ishiihara commented on a diff in the pull request:
https://github.com/apache/spark/pull/2533#discussion_r18026563
--- Diff:
core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala
---
@@ -85,16 +79,18 @@ class
Github user Ishiihara commented on a diff in the pull request:
https://github.com/apache/spark/pull/2533#discussion_r18026314
--- Diff:
core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala
---
@@ -62,15 +62,9 @@ class
Github user Ishiihara commented on a diff in the pull request:
https://github.com/apache/spark/pull/2490#discussion_r17959656
--- Diff: docs/programming-guide.md ---
@@ -1183,6 +1188,10 @@ running on the cluster can then add to it using the
`add` method or the `+=` ope
Github user Ishiihara commented on a diff in the pull request:
https://github.com/apache/spark/pull/2490#discussion_r17959346
--- Diff: docs/programming-guide.md ---
@@ -1121,6 +1121,11 @@ than shipping a copy of it with tasks. They can be
used, for example, to give ev
large
Github user Ishiihara commented on the pull request:
https://github.com/apache/spark/pull/2494#issuecomment-56571123
@rnowling LGTM in general. Some comments on styles.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If
Github user Ishiihara commented on a diff in the pull request:
https://github.com/apache/spark/pull/2494#discussion_r17930496
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/IDF.scala ---
@@ -60,13 +70,16 @@ class IDF {
private object IDF
Github user Ishiihara commented on a diff in the pull request:
https://github.com/apache/spark/pull/2494#discussion_r17886499
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/IDF.scala ---
@@ -30,9 +30,20 @@ import org.apache.spark.rdd.RDD
* Inverse document
Github user Ishiihara commented on the pull request:
https://github.com/apache/spark/pull/2494#issuecomment-56461303
@rnowling Please run sbt/sbt scalastyle on your local machine to clear out
style issues.
---
If your project is set up for it, you can reply to this email and have
Github user Ishiihara commented on a diff in the pull request:
https://github.com/apache/spark/pull/2494#discussion_r17880822
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/IDF.scala ---
@@ -123,7 +134,17 @@ private object IDF {
val inv = new Array[Double
Github user Ishiihara commented on the pull request:
https://github.com/apache/spark/pull/2494#issuecomment-56445953
One question, with this parameter set, it also filter out words that is
very important to some documents. Say, that if some word occurs many times in 1
or 2 documents
Github user Ishiihara commented on a diff in the pull request:
https://github.com/apache/spark/pull/2494#discussion_r17879054
--- Diff:
mllib/src/test/scala/org/apache/spark/mllib/feature/IDFSuite.scala ---
@@ -54,4 +54,38 @@ class IDFSuite extends FunSuite with LocalSparkContext
Github user Ishiihara commented on a diff in the pull request:
https://github.com/apache/spark/pull/2494#discussion_r17878647
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/IDF.scala ---
@@ -123,7 +134,17 @@ private object IDF {
val inv = new Array[Double
Github user Ishiihara commented on the pull request:
https://github.com/apache/spark/pull/2356#issuecomment-56420682
We need to modify the implementation to use the new SerDe mechanism.
Working on that now.
---
If your project is set up for it, you can reply to this email and have
Github user Ishiihara commented on the pull request:
https://github.com/apache/spark/pull/2470#issuecomment-56310242
@rxin @lemire Starting looking at Roaring.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your
Github user Ishiihara commented on the pull request:
https://github.com/apache/spark/pull/2470#issuecomment-56292536
@rxin I am definitely interested in working on adding compressed bitmap.
What is the first step? Thanks.
---
If your project is set up for it, you can reply to this
Github user Ishiihara commented on the pull request:
https://github.com/apache/spark/pull/2470#issuecomment-56277955
Thanks for the reply. Another questions, In hash shuffle write, the data
may be screwed for different map output file. For some cases, the reducer may
try to fetch
Github user Ishiihara commented on the pull request:
https://github.com/apache/spark/pull/2470#issuecomment-56277206
@rxin my understanding is that MapStatus is used to check whether a map
output file contain data for a certain reducer. Why do we use actual size
instead of a boolean
GitHub user Ishiihara opened a pull request:
https://github.com/apache/spark/pull/2356
[SPARK-3486][MLlib][PySpark] PySpark support for Word2Vec
@mengxr
Added PySpark support for Word2Vec
Change list
(1) PySpark support for Word2Vec
(2) SerDe support of string
Github user Ishiihara commented on the pull request:
https://github.com/apache/spark/pull/2049#issuecomment-52724285
Good point. This reduces the needs of temp object to store the output
model. Although None is output but is a much smaller object compared with the
vector.
---
If
Github user Ishiihara commented on the pull request:
https://github.com/apache/spark/pull/2043#issuecomment-52704090
Looks good to me.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this
Github user Ishiihara commented on the pull request:
https://github.com/apache/spark/pull/1871#issuecomment-52702675
@mateiz This is taken care of by https://github.com/apache/spark/pull/1932
and is already merged in master and 1.1. In that PR, the model output by each
partition is
Github user Ishiihara closed the pull request at:
https://github.com/apache/spark/pull/1871
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is
GitHub user Ishiihara opened a pull request:
https://github.com/apache/spark/pull/2010
[MLlib] Remove transform(dataset: RDD[String]) from Word2Vec public API
@mengxr
Remove transform(dataset: RDD[String]) from public API.
You can merge this pull request into a Git repository
GitHub user Ishiihara opened a pull request:
https://github.com/apache/spark/pull/2003
[SPARK-2842][MLlib]Word2Vec documentation
Documentation for Word2Vec
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/Ishiihara/spark Word2Vec
Github user Ishiihara commented on a diff in the pull request:
https://github.com/apache/spark/pull/1932#discussion_r16222465
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -284,16 +284,15 @@ class Word2Vec extends Serializable with Logging
Github user Ishiihara commented on a diff in the pull request:
https://github.com/apache/spark/pull/1932#discussion_r16222127
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -284,16 +284,15 @@ class Word2Vec extends Serializable with Logging
GitHub user Ishiihara opened a pull request:
https://github.com/apache/spark/pull/1932
[SPARK-2907][MLlib] Word2Vec performance improve
@mengxr Please review the code. Adding weights in reduceByKey soon.
Only output model entry for words appeared in the partition before
Github user Ishiihara commented on the pull request:
https://github.com/apache/spark/pull/1871#issuecomment-51878228
@mateiz The performance of PrimitiveKeyOpenHashMap is on par with
mutable.HashMap. For one partition case, the PrimitiveKeyOpenHashMap is
slightly faster than using
GitHub user Ishiihara opened a pull request:
https://github.com/apache/spark/pull/1900
[MLlib] Correctly set vectorSize and alpha
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/Ishiihara/spark Word2Vec-bugfix
Alternatively you
Github user Ishiihara commented on the pull request:
https://github.com/apache/spark/pull/1871#issuecomment-51724432
@mengxr Some benchmark result
Environment: OSX 10.9, 8G memory, 2.5G i5 CPU, 4 threads
startingAlpha = 0.0025
vecterSize = 100
Driver memory 2g
Github user Ishiihara commented on the pull request:
https://github.com/apache/spark/pull/1871#issuecomment-51720995
@mengxr It is about 1-2 minutes slower with vector size = 100 for
different number of partitions.
---
If your project is set up for it, you can reply to this email
GitHub user Ishiihara opened a pull request:
https://github.com/apache/spark/pull/1871
[SPARK-2907] [MLlib] Use mutable.HashMap to represent model in Word2Vec
Change list:
1. Used mutable.HashMap to represent syn0Global and syn1Global to reduce
shuffle size.
2. Introduced
Github user Ishiihara commented on the pull request:
https://github.com/apache/spark/pull/1790#issuecomment-51234797
@mengxr LGTM. We may need better implementation of TopK. It also worth
trying to change the starting alpha in each iteration.
---
If your project is set up for it
Github user Ishiihara commented on a diff in the pull request:
https://github.com/apache/spark/pull/1719#discussion_r15741135
--- Diff:
mllib/src/test/scala/org/apache/spark/mllib/feature/Word2VecSuite.scala ---
@@ -0,0 +1,61 @@
+/*
+ * Licensed to the Apache Software
Github user Ishiihara commented on the pull request:
https://github.com/apache/spark/pull/1719#issuecomment-50949833
@mengxr result of 4 and 10 partitions make sense but result of 100
partitions doesn't make sense.
Made changes according to review except the random seed.
Github user Ishiihara commented on a diff in the pull request:
https://github.com/apache/spark/pull/1719#discussion_r15723320
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -0,0 +1,375 @@
+/*
+* Licensed to the Apache Software
Github user Ishiihara commented on the pull request:
https://github.com/apache/spark/pull/1719#issuecomment-50904281
@mengxr code format done. Working on test case of algorithm.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as
GitHub user Ishiihara opened a pull request:
https://github.com/apache/spark/pull/1719
[MLlib] word2vec: Distributed Representation of Words
Vector representation of words. This is a pull request regarding SPARK-2510
at https://issues.apache.org/jira/browse/SPARK-2510
You can
98 matches
Mail list logo