Hi Ognen,
Any particular reason of choosing scalatra over options like play or spray
?
Is scalatra much better in serving apis or is it due to similarity with
ruby's sinatra ?
Did you try the other options and then pick scalatra ?
Thanks.
Deb
On Tue, Mar 4, 2014 at 4:50 AM, Ognen Duzlevski
Hi,
If the join keys are skewed is there are specific optimized join available
in Spark for such usecases ?
I saw in both scalding and Hive similar feature is supported and I am
testing skewjoinWithSmaller on one of the skewed dataset...
Hi,
I gave my spark job 16 gb of memory and it is running on 8 executors.
The job needs more memory due to ALS requirements (20M x 1M matrix)
On each node I do have 96 gb of memory and I am using 16 gb out of it. I
want to increase the memory but I am not sure what is the right way to do
Are these the right options:
1. If there is a spark script, just do a ctrl-c from spark-shell and the
job will be killed property.
2. For spark application also ctrl c will kill the job property on the
cluster:
Somehow the ctrl-c option did not work for us...
Similar option works fine for
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Sun, Mar 16, 2014 at 2:59 PM, Debasish Das debasish.da...@gmail.comwrote:
Are these the right options:
1. If there is a spark script, just do a ctrl-c from spark-shell and the
job
I am getting these weird errors which I have not seen before:
[error] Server access Error: handshake alert: unrecognized_name url=
https://repo.maven.apache.org/maven2/org/eclipse/jetty/orbit/javax.servlet/2.5.0.v201103041518/javax.servlet-2.5.0.v201103041518.orbit
[info] Resolving
After a long time (may be a month) I could do a fresh build for
2.0.0-mr1-cdh4.5.0...I was using the cached files in .ivy2/cache
My case is especially painful since I have to build behind a firewall...
@Sean thanks for the fix...I think we should put a test for https/firewall
compilation as
Niko,
Comparing some other components will be very useful as wellsvd++ from
graphx vs the same algorithm in graphlabalso mllib.recommendation.als
implicit/explicit compared to the collaborative filtering toolkit in
graphlab...
To stress test what's the biggest sparse dataset that you
, not native code. This should be exactly what that
PR I
mentioned fixes.
--
Sean Owen | Director, Data Science | London
On Sun, Mar 16, 2014 at 11:48 AM, Debasish Das
debasish.da...@gmail.com
wrote:
Thanks Sean...let me get the latest code..do you know which PR
Hi,
What's the splittable compression format that works with Spark right now ?
We are looking into bzip2 / lzo / gzip...gzip is not splittable so not a
good optionWithin bzip2/lzo I am confused.
Thanks.
Deb
Classes are serialized and sent to all the workers as akka msgs
singletons and case classes I am not sure if they are javaserialized or
kryoserialized by default
But definitely your own classes if serialized by kryo will be much
efficient.there is an comparison that Matei did for all
I think multiply by ratings is a heuristic that worked on rating related
problems like netflix dataset or any other ratings datasets but the scope
of NMF is much more broad than that
@Sean please correct me in case you don't agree...
Definitely it's good to add all the rating dataset related
Hi Matei,
How can I run multiple Spark workers per node ? I am running 8 core 10 node
cluster but I do have 8 more cores on each nodeSo having 2 workers per
node will definitely help my usecase.
Thanks.
Deb
On Wed, Apr 2, 2014 at 3:58 PM, Matei Zaharia matei.zaha...@gmail.comwrote:
Hey
://www.sigmoidanalytics.com
@mayur_rustagi
On Wed, Apr 2, 2014 at 7:19 PM, Debasish Das debasish.da...@gmail.com
wrote:
Hi Matei,
How can I run multiple Spark workers per node ? I am running 8 core 10
node cluster but I do have 8 more cores on each nodeSo having 2 workers
per node
:
spark.storage.blockManagerSlaveTimeoutMs
to a higher value. In your case it's setting this to 45000 or 45 seconds.
On Fri, Apr 4, 2014 at 5:52 PM, Debasish Das debasish.da...@gmail.comwrote:
Hi,
In my ALS runs I am noticing messages that complain about heart beats:
14/04/04 20:43:09 WARN
From the documentation this is what I understood:
1. spark.worker.timeout: Number of seconds after which the standalone
deploy master considers a worker lost if it receives no heartbeats.
default: 60
I increased it to be 600
It was pointed before that if there is GC overload and the worker
to GC...Persisting the factors to disk at each iteration
will resolve this issue with runtime loss of course...
I also have another issue...I run with executor memory as 24g but I see
18.4 GB in executor ui...is that expected ?
On Sat, Apr 5, 2014 at 8:16 AM, Debasish Das debasish.da
When you say Spark is one of the forerunners for our technology choice,
what are the other options you are looking into ?
I start cross validation runs on a 40 core, 160 GB spark job using a
script...I woke up in the morning, none of the jobs crashed ! and the
project just came out of incubation
Mllib has decision treethere is a rf pr which is not active nowtake
that and swap the tree builder with the fast tree builder that's in
mllib...search for the spark jira...the code is based on google planet
paper. ..
I am sure people in devlist are already working on it...send an email to
). Additionally, the
lack of multi-class classification limits its applicability.
Also, RF requires random features per tree node to be effective (not
just bootstrap samples), and MLLib decision tree doesn't support that.
On Thu, Apr 17, 2014 at 10:27 AM, Debasish Das
debasish.da...@gmail.com
support
that.
On Thu, Apr 17, 2014 at 10:27 AM, Debasish Das
debasish.da...@gmail.com wrote:
Mllib has decision treethere is a rf pr which is not active
nowtake that and swap the tree builder with the fast tree builder
that's in mllib...search for the spark jira...the code is based
, if I did
want to run that example, where would I find the file in question?
It would be great if this were documented, perhaps in the source code.
I'll add a JIRA.
Thanks,
Diana
On Mon, Apr 28, 2014 at 1:41 PM, Debasish Das debasish.da...@gmail.comwrote:
Diana,
Here are the parameters
Hi,
For each line that we read as textLine from HDFS, we have a schema..if
there is an API that takes the schema as List[Symbol] and maps each token
to the Symbol it will be helpful...
One solution is to keep data on hdfs as avro/protobuf serialized objects
but not sure if that works on HBase
Hello Prof. Lin,
Awesome news ! I am curious if you have any benchmarks comparing C++ MPI
with Scala Spark liblinear implementations...
Is Spark Liblinear apache licensed or there are any specific restrictions
on using it ?
Except using native blas libraries (which each user has to manage by
Hi,
How do I load native BLAS libraries on Mac ?
I am getting the following errors while running LR and SVM with SGD:
14/05/07 10:48:13 WARN BLAS: Failed to load implementation from:
com.github.fommil.netlib.NativeSystemBLAS
14/05/07 10:48:13 WARN BLAS: Failed to load implementation from:
is compatible with Apache.
Best,
Xiangrui
On Sun, May 11, 2014 at 10:29 AM, Debasish Das debasish.da...@gmail.com
wrote:
Hello Prof. Lin,
Awesome news ! I am curious if you have any benchmarks comparing C++ MPI
with Scala Spark liblinear implementations...
Is Spark Liblinear apache licensed
Hi,
For each line that we read as textLine from HDFS, we have a schema..if
there is an API that takes the schema as List[Symbol] and maps each token
to the Symbol it will be helpful...
Does RDDs provide a schema view of the dataset on HDFS ?
Thanks.
Deb
There are examples to run them in BinaryClassification.scala in
org.apache.spark.examples...
On Wed, May 14, 2014 at 1:36 PM, yxzhao yxz...@ualr.edu wrote:
Hello,
I found the classfication algorithms SVM and LogisticRegression implemented
in the following directory. And how to run them?
Xiangrui,
Could you point to the JIRA related to tree aggregate ? ...sounds like the
allreduce idea...
I would definitely like to try it on our dataset...
Makoto,
I did run pretty big sparse dataset (20M rows, 3M sparse features) and I
got 100 iterations of SGD running in 200 seconds...10
600s for Spark vs 5s for Redshift...The numbers look much different from
the amplab benchmark...
https://amplab.cs.berkeley.edu/benchmark/
Is it like SSDs or something that's helping redshift or the whole data is
in memory when you run the query ? Could you publish the query ?
Also after
Hi,
Databricks demo at spark summit was amazing...what's the frontend stack
used specifically for rendering multiple reactive charts on same dom? Looks
like that's an emerging pattern for correlating different data api...
Thanks
Deb
I compiled spark 1.0.1 with 2.3.0cdh5.0.2 today...
No issues with mvn compilation but my sbt build keeps failing on the sql
module...
I just saw that my scala is at 2.11.0 (with brew update)...not sure if
that's why the sbt compilation is failing...retrying..
On Sat, Jul 19, 2014 at 6:16 PM,
Yup...the scala version 2.11.0 caused it...with 2.10.4, I could compile
1.0.1 and HEAD both for 2.3.0cdh5.0.2
On Sat, Jul 19, 2014 at 8:14 PM, Debasish Das debasish.da...@gmail.com
wrote:
I compiled spark 1.0.1 with 2.3.0cdh5.0.2 today...
No issues with mvn compilation but my sbt build
Hi,
We have been using standalone spark for last 6 months and I used to run
application jars fine on spark cluster with the following command.
java -cp
:/app/data/spark_deploy/conf:/app/data/spark_deploy/lib/spark-assembly-1.0.0-SNAPSHOT-hadoop2.0.0-mr1-cdh4.5.0.jar:./app.jar
-Xms2g -Xmx2g
I found the issue...
If you use spark git and generate the assembly jar then
org.apache.hadoop.io.Writable.class is packaged with it
If you use the assembly jar that ships with CDH in
Hi Aureliano,
Will it be possible for you to give the test-case ? You can add it to JIRA
as well as an attachment I guess...
I am preparing the PR for ADMM based QuadraticMinimizer...In my matlab
experiments with scaling the rank to 1000 and beyond (which is too high for
ALS but gives a good
Dennis,
If it is PLSA with least square loss then the QuadraticMinimizer that we
open sourced should be able to solve it for modest topics (till 1000 I
believe)...if we integrate a cg solver for equality (Nocedal's KNITRO paper
is the reference) the topic size can be increased much larger than
Hi,
I have set up the SPARK_LOCAL_DIRS option in spark-env.sh so that Spark can
use more shuffle space...
Does Spark cleans all the shuffle files once the runs are done ? Seems to
me that the shuffle files are not cleaned...
Do I need to set this variable ? spark.cleaner.ttl
Right now we are
Actually I faced it yesterday...
I had to put it in spark-env.sh and take it out from spark-defaults.conf on
1.0.1...Note that this settings should be visible on all workers..
After that I validated that SPARK_LOCAL_DIRS was indeed getting used for
shuffling...
On Thu, Aug 14, 2014 at 10:27
Hi,
For our large ALS runs, we are considering using sc.setCheckPointDir so
that the intermediate factors are written to HDFS and the lineage is
broken...
Is there a comparison which shows the performance degradation due to these
options ? If not I will be happy to add experiments with it...
DB,
Did you compare softmax regression with one-vs-all and found that softmax
is better ?
one-vs-all can be implemented as a wrapper over binary classifier that we
have in mllib...I am curious if softmax multinomial is better on most cases
or is it worthwhile to add a one vs all version of mlor
Hi,
Are there any experiments detailing the performance hit due to HDFS
checkpoint in ALS ?
As we scale to large ranks with more ratings, I believe we have to cut the
RDD lineage to safe guard against the lineage issue...
Thanks.
Deb
Hi Brandon,
Looks very cool...will try it out for ad-hoc analysis of our datasets and
provide more feedback...
Could you please give bit more details about the differences of Spindle
architecture compared to Hue + Spark integration (python stack) and Ooyala
Jobserver ?
Does Spindle allow
Hi Wei,
Sparkler code was not available for benchmarking and so I picked up
Jellyfish which uses SGD and if you look at the paper, the ideas are very
similar to sparkler paper but Jellyfish is on shared memory and uses C code
while sparkler was built on top of spark...Jellyfish used some
Breeze author David also has a github project on cuda binding in
scalado you prefer using java or scala ?
On Aug 27, 2014 2:05 PM, Frank van Lankvelt f.vanlankv...@onehippo.com
wrote:
you could try looking at ScalaCL[1], it's targeting OpenCL rather than
CUDA, but that might be close
Hi Reza,
Have you compared with the brute force algorithm for similarity computation
with something like the following in Spark ?
https://github.com/echen/scaldingale
I am adding cosine similarity computation but I do want to compute an all
pair similarities...
Note that the data is sparse for
/apache/spark/pull/1778
Your question wasn't entirely clear - does this answer it?
Best,
Reza
On Fri, Sep 5, 2014 at 6:14 PM, Debasish Das debasish.da...@gmail.com
wrote:
Hi Reza,
Have you compared with the brute force algorithm for similarity
computation with something like the following
that
you don't have to redo your code. Your call if you need it before a week.
Reza
On Fri, Sep 5, 2014 at 7:43 PM, Debasish Das debasish.da...@gmail.com
wrote:
Ohh coolall-pairs brute force is also part of this PR ? Let me pull
it in and test on our dataset...
Thanks.
Deb
On Fri, Sep 5
Also for tall and wide (rows ~60M, columns 10M), I am considering running a
matrix factorization to reduce the dimension to say ~60M x 50 and then run
all pair similarity...
Did you also try similar ideas and saw positive results ?
On Fri, Sep 5, 2014 at 7:54 PM, Debasish Das debasish.da
) if your goal is to find batches of similar points instead of all
pairs above a threshold.
On Fri, Sep 5, 2014 at 8:02 PM, Debasish Das debasish.da...@gmail.com
wrote:
Also for tall and wide (rows ~60M, columns 10M), I am considering running
a matrix factorization to reduce the dimension
, calling dimsum with gamma as PositiveInfinity turns it
into the usual brute force algorithm for cosine similarity, there is no
sampling. This is by design.
On Fri, Sep 5, 2014 at 8:20 PM, Debasish Das debasish.da...@gmail.com
wrote:
I looked at the code: similarColumns(Double.posInf
Durin,
I have integrated ecos with spark which uses suitesparse under the hood for
linear equation solvesI have exposed only the qp solver api in spark
since I was comparing ip with proximal algorithms but we can expose
suitesparse api as well...jni is used to load up ldl amd and ecos
version of ldl and
amd which are lgpl...
Let me know.
Thanks.
Deb
On Sep 8, 2014 7:04 AM, Debasish Das debasish.da...@gmail.com wrote:
Durin,
I have integrated ecos with spark which uses suitesparse under the hood
for linear equation solvesI have exposed only the qp solver api in
spark
in a distributed way.
-Xiangrui
On Mon, Sep 8, 2014 at 7:12 AM, Debasish Das debasish.da...@gmail.com
wrote:
Xiangrui,
Should I open up a JIRA for this ?
Distributed lp/socp solver through ecos/ldl/amd ?
I can open source it with gpl license in spark code as that's what our
legal
similarity in a future PR, probably
still for 1.2
On Fri, Sep 5, 2014 at 9:15 PM, Debasish Das debasish.da...@gmail.com
wrote:
Awesome...Let me try it out...
Any plans of putting other similarity measures in future (jaccard is
something that will be useful) ? I guess it makes sense to add some
one. For dense matrices with say, 1m
columns this won't be computationally feasible and you'll want to start
sampling with dimsum.
It would be helpful to have a loadRowMatrix function, I would use it.
Best,
Reza
On Tue, Sep 9, 2014 at 12:05 AM, Debasish Das debasish.da...@gmail.com
wrote
, Debasish Das debasish.da...@gmail.com
wrote:
Cool...can I add loadRowMatrix in your PR ?
Thanks.
Deb
On Tue, Sep 9, 2014 at 1:14 AM, Reza Zadeh r...@databricks.com wrote:
Hi Deb,
Did you mean to message me instead of Xiangrui?
For TS matrices, dimsum with positiveinfinity
We dump fairly big libsvm to compare against liblinear/libsvm...the
following code dumps out libsvm format from SparseVector...
def toLibSvm(features: SparseVector): String = {
val indices = features.indices.map(_ + 1)
val values = features.values
indices.zip(values).mkString(
, you can un-normalize the cosine similarities to get the
dot product, and then compute the other similarity measures from the dot
product.
Best,
Reza
On Wed, Sep 17, 2014 at 6:52 PM, Debasish Das debasish.da...@gmail.com
wrote:
Hi Reza,
In similarColumns, it seems with cosine similarity
sc.parallelize(model.weights.toArray, blocks).top(k) will get that right ?
For logistic you might want both positive and negative feature...so just
pass it through a filter on abs and then pick top(k)
On Thu, Sep 18, 2014 at 10:30 AM, Sameer Tilak ssti...@live.com wrote:
Hi All,
I am able to
the PR. We can add jaccard and other similarity measures in
later PRs.
In the meantime, you can un-normalize the cosine similarities to get the
dot product, and then compute the other similarity measures from the dot
product.
Best,
Reza
On Wed, Sep 17, 2014 at 6:52 PM, Debasish Das
. The PR will updated
today.
Best,
Reza
On Thu, Sep 18, 2014 at 2:06 PM, Debasish Das debasish.da...@gmail.com
wrote:
Hi Reza,
Have you tested if different runs of the algorithm produce different
similarities (basically if the algorithm is deterministic) ?
This number does not look like
Hi,
I am building a dictionary of RDD[(String, Long)] and after the dictionary
is built and cached, I find key almonds at value 5187 using:
rdd.filter{case(product, index) = product == almonds}.collect
Output:
Debug product almonds index 5187
Now I take the same dictionary and write it out as:
- zipWithIndex is being used to assign IDs. From a
recent JIRA discussion I understand this is not deterministic within a
partition so the index can be different when the RDD is reevaluated. If you
need it fixed, persist the zipped RDD on disk or in memory.
On Sep 20, 2014 8:10 PM, Debasish Das
clear from docs...
On Sat, Sep 20, 2014 at 1:48 PM, Debasish Das debasish.da...@gmail.com
wrote:
I did not persist / cache it as I assumed zipWithIndex will preserve
order...
There is also zipWithUniqueId...I am trying that...If that also shows the
same issue, we should make it clear
affected, but may not happen to have
exhibited in your test.
On Sun, Sep 21, 2014 at 2:13 AM, Debasish Das debasish.da...@gmail.com
wrote:
Some more debug revealed that as Sean said I have to keep the
dictionaries
persisted till I am done with the RDD manipulation.
Thanks Sean
HBase regionserver needs to be balancedyou might have some skewness in
row keys and one regionserver is under pressuretry finding that key and
replicate it using random salt
On Wed, Sep 24, 2014 at 8:51 AM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Hi Ted,
It converts RDD[Edge] to
On Wed, Sep 24, 2014 at 8:56 AM, Debasish Das debasish.da...@gmail.com
wrote:
HBase regionserver needs to be balancedyou might have some skewness
in row keys and one regionserver is under pressuretry finding that key
and replicate it using random salt
On Wed, Sep 24, 2014 at 8:51 AM
If the tree is too big build it on graphxbut it will need thorough
analysis so that the partitions are well balanced...
On Tue, Sep 30, 2014 at 2:45 PM, Andy Twigg andy.tw...@gmail.com wrote:
Hi Boromir,
Assuming the tree fits in memory, and what you want to do is parallelize
the
Only fit the data in memory where you want to run the iterative
algorithm
For map-reduce operations, it's better not to cache if you have a memory
crunch...
Also schedule the persist and unpersist such that you utilize the RAM
well...
On Tue, Sep 30, 2014 at 4:34 PM, Liquan Pei
Can't you extend a class in place of object which can be generic ?
class GenericAccumulator[B] extends AccumulatorParam[Seq[B]] {
}
On Wed, Oct 1, 2014 at 3:38 AM, Johan Stenberg johanstenber...@gmail.com
wrote:
Just realized that, of course, objects can't be generic, but how do I
create a
If the missing values are 0, then you can also look into implicit
formulation...
On Tue, Sep 30, 2014 at 12:05 PM, Xiangrui Meng men...@gmail.com wrote:
We don't handle missing value imputation in the current version of
MLlib. In future releases, we can store feature information in the
Hi,
We write the output of models and other information as parquet files and
later we let data APIs run SQL queries on the columnar data...
SparkSQL is used to dump the data in parquet format and now we are
considering whether using SparkSQL or Impala to read it back...
I came across this
Another rule of thumb is that definitely cache the RDD over which you need
to do iterative analysis...
For rest of them only cache if you have lot of free memory !
On Mon, Oct 6, 2014 at 2:39 PM, Sean Owen so...@cloudera.com wrote:
I think you mean that data2 is a function of data1 in the
I have faced this in the past and I have to put a profile -Phadoop2.3...
mvn -Dhadoop.version=2.3.0-cdh5.1.0 -Phadoop-2.3 -Pyarn -DskipTests install
On Wed, Oct 8, 2014 at 1:40 PM, Chuang Liu liuchuan...@gmail.com wrote:
Hi:
I tried to build Spark (1.1.0) with hadoop 2.4.0, and ran a simple
Awesome news Matei !
Congratulations to the databricks team and all the community members...
On Fri, Oct 10, 2014 at 7:54 AM, Matei Zaharia matei.zaha...@gmail.com
wrote:
Hi folks,
I interrupt your regularly scheduled user / dev list to bring you some
pretty cool news for the project, which
Hi,
Is someone working on a project on integrating Oryx model serving layer
with Spark ? Models will be built using either Streaming data / Batch data
in HDFS and cross validated with mllib APIs but the model serving layer
will give API endpoints like Oryx
and read the models may be from
the architecture. It has all the things you are thinking
about:)
Thanks,
Jayant
On Sat, Oct 18, 2014 at 8:49 AM, Debasish Das debasish.da...@gmail.com
wrote:
Hi,
Is someone working on a project on integrating Oryx model serving layer
with Spark ? Models will be built using either Streaming
Hi Martin,
This problem is Ax = B where A is your matrix [2 1 3 ... 1; 1 0 3 ...;]
and x is what you want to find..B is 0 in this case...For mllib normally
this is labelbasically create a labeledPoint where label is 0 always...
Use mllib's linear regression and solve the following
If the SVM is not already migrated to BFGS, that's the first thing you
should try...Basically following LBFGS Logistic Regression come up with
LBFGS based linear SVM...
About integrating TRON in mllib, David already has a version of TRON in
breeze but someone needs to validate it for linear SVM
@dbtsai for condition number what did you use ? Diagonal preconditioning of
the inverse of B matrix ? But then B matrix keeps on changing...did u
condition it after every few iterations ?
Will it be possible to put that code in Breeze since it will be very useful
to condition other solvers as
:33 PM, Chih-Jen Lin cj...@csie.ntu.edu.tw wrote:
Debasish Das writes:
If the SVM is not already migrated to BFGS, that's the first thing you
should
try...Basically following LBFGS Logistic Regression come up with LBFGS
based
linear SVM...
About integrating TRON in mllib, David
Hi,
I just built the master today and I was testing the IR metrics (MAP and
prec@k) on Movielens data to establish a baseline...
I am getting a weird error which I have not seen before:
MASTER=spark://TUSCA09LMLVT00C.local:7077 ./bin/run-example
mllib.MovieLensALS --kryo --lambda 0.065
userFeatures.lookup(user).head to
work ?
On Mon, Nov 3, 2014 at 9:24 PM, Xiangrui Meng men...@gmail.com wrote:
Was user presented in training? We can put a check there and return
NaN if the user is not included in the model. -Xiangrui
On Mon, Nov 3, 2014 at 5:25 PM, Debasish Das debasish.da
if the user is not included in the model. -Xiangrui
On Mon, Nov 3, 2014 at 5:25 PM, Debasish Das debasish.da...@gmail.com
wrote:
Hi,
I am testing MatrixFactorizationModel.predict(user: Int, product: Int)
but
the code fails on userFeatures.lookup(user).head
In computeRmse
Hi,
I am doing a flatMap followed by mapPartitions to do some blocked
operation...flatMap is shuffling data but this shuffle is strictly
shuffling to disk and not over the network right ?
Thanks.
Deb
only if output RDD is expected to be
partitioned by some key.
RDD[X].flatmap(X=RDD[Y])
If it has to shuffle it should be local.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Thu, Nov 13, 2014 at 7:31 AM, Debasish Das
Hi,
If I look inside algebird Monoid implementation it uses
java.io.Serializable...
But when we use CMS/HLL in examples.streaming.TwitterAlgebirdCMS, I don't
see a KryoRegistrator for CMS and HLL monoid...
In these examples we will run with Kryo serialization on CMS and HLL or
they will be java
groupByKey does not run a combiner so be careful about the
performance...groupByKey does shuffle even for local groups...
reduceByKey and aggregateByKey does run a combiner but if you want a
separate function for each key, you can have a key to closure map that you
can broadcast and use it in
Use zipWithIndex but cache the data before you run zipWithIndex...that way
your ordering will be consistent (unless the bug has been fixed where you
don't have to cache the data)...
Normally these operations are used for dictionary building and so I am
hoping you can cache the dictionary of
I run my Spark on YARN jobs as:
HADOOP_CONF_DIR=/etc/hadoop/conf/ /app/data/v606014/dist/bin/spark-submit
--master yarn --jars test-job.jar --executor-cores 4 --num-executors 10
--executor-memory 16g --driver-memory 4g --class TestClass test.jar
It uses HADOOP_CONF_DIR to schedule executors and
I have used breeze fine with scala shell:
scala -cp ./target/spark-mllib_2.10-1.3.0-SNAPSHOT.
rdd.top collects it on master...
If you want topk for a key run map / mappartition and use a bounded
priority queue and reducebykey the queues.
I experimented with topk from algebird and bounded priority queue wrapped
over jpriority queue ( spark default)...bpq is faster
Code example is here:
Apriori can be thought as a post-processing on product similarity graph...I
call it product similarity but for each product you build a node which
keeps distinct users visiting the product and two product nodes are
connected by an edge if the intersection 0...you are assuming if no one
user
Hi Bui,
Please use BFGS based solvers...For BFGS you don't have to specify step
size since the line search will find sufficient decrease each time...
Regularization you still have to do grid search...it's not possible to
automate that but on master you will find nice ways to automate grid
If you have tall x skinny matrix of m users and n products, column
similarity will give you a n x n matrix (product x product matrix)...this
is also called product correlation matrix...it can be cosine, pearson or
other kind of correlations...Note that if the entry is unobserved (user
Joanary did
Hi Dib,
For our usecase I want my spark job1 to read from hdfs/cache and write to
kafka queues. Similarly spark job2 should read from kafka queues and write
to kafka queues.
Is writing to kafka queues from spark job supported in your code ?
Thanks
Deb
On Jan 15, 2015 11:21 PM, Akhil Das
...
Neither play nor spray is being used in Spark right nowso it brings
dependencies and we already know about the akka conflicts...thriftserver on
the other hand is already integrated for JDBC access
On Tue, Feb 10, 2015 at 3:43 PM, Debasish Das debasish.da...@gmail.com
wrote:
Also I wanted
Hi,
I am running brute force similarity from RowMatrix on a job with 5M x 1.5M
sparse matrix with 800M entries. With 200M entries the job run fine but
with 800M I am getting exceptions like too many files open and no space
left on device...
Seems like I need more nodes or use dimsum sampling ?
that the key would be filtered.
And then after, run a flatMap or something to make Option[B] into B.
On Thu, Feb 19, 2015 at 2:21 PM, Debasish Das debasish.da...@gmail.com
wrote:
Hi,
Before I send out the keys for network shuffle, in reduceByKey after map
+
combine are done, I would like
1 - 100 of 156 matches
Mail list logo