[
https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243456#comment-14243456
]
Debasish Das commented on SPARK-2426:
-
[~akopich] I got good MAP results
[
https://issues.apache.org/jira/browse/SPARK-4675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241535#comment-14241535
]
Debasish Das commented on SPARK-4675:
-
There are few issues:
1. Batch API for topK
[
https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242031#comment-14242031
]
Debasish Das commented on SPARK-4823:
-
I am considering coming up with a baseline
[
https://issues.apache.org/jira/browse/SPARK-4675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242034#comment-14242034
]
Debasish Das commented on SPARK-4675:
-
[~josephkb] how do we validate that low
Hi,
It seems there are multiple places where we would like to compute row
similarity (accurate or approximate similarities)
Basically through RowMatrix columnSimilarities we can compute column
similarities of a tall skinny matrix
Similarly we should have an API in RowMatrix called
of a matrix A (i.e. computing
AA^T, which is expensive).
There is a JIRA to track handling (1) and (2) more efficiently than
computing all pairs: https://issues.apache.org/jira/browse/SPARK-3066
On Wed, Dec 10, 2014 at 2:44 PM, Debasish Das debasish.da...@gmail.com
wrote:
Hi,
It seems
If you have tall x skinny matrix of m users and n products, column
similarity will give you a n x n matrix (product x product matrix)...this
is also called product correlation matrix...it can be cosine, pearson or
other kind of correlations...Note that if the entry is unobserved (user
Joanary did
Hi Bui,
Please use BFGS based solvers...For BFGS you don't have to specify step
size since the line search will find sufficient decrease each time...
Regularization you still have to do grid search...it's not possible to
automate that but on master you will find nice ways to automate grid
Apriori can be thought as a post-processing on product similarity graph...I
call it product similarity but for each product you build a node which
keeps distinct users visiting the product and two product nodes are
connected by an edge if the intersection 0...you are assuming if no one
user
rdd.top collects it on master...
If you want topk for a key run map / mappartition and use a bounded
priority queue and reducebykey the queues.
I experimented with topk from algebird and bounded priority queue wrapped
over jpriority queue ( spark default)...bpq is faster
Code example is here:
I have used breeze fine with scala shell:
scala -cp ./target/spark-mllib_2.10-1.3.0-SNAPSHOT.
with Jellyfish code http://i.stanford.edu/hazy/victor/Hogwild/), will
reproduce the failure...
https://issues.apache.org/jira/browse/SPARK-4231
The failed job I will debug more and figure out the real cause. If needed I
will open up new JIRAs.
On Sun, Nov 23, 2014 at 9:50 AM, Debasish Das
-1 from me...same FetchFailed issue as what Hector saw...
I am running Netflix dataset and dumping out recommendation for all users.
It shuffles around 100 GB data on disk to run a reduceByKey per user on
utils.BoundedPriorityQueue...The code runs fine with MovieLens1m dataset...
I gave Spark 10
[
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222024#comment-14222024
]
Debasish Das commented on SPARK-1405:
-
We need a larger dataset as well where topics
[
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222024#comment-14222024
]
Debasish Das edited comment on SPARK-1405 at 11/22/14 4:22 PM
[
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222027#comment-14222027
]
Debasish Das commented on SPARK-1405:
-
[~pedrorodriguez] did you write the metric
[
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222089#comment-14222089
]
Debasish Das commented on SPARK-1405:
-
NIPS dataset is common for PLSA and additive
[
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222108#comment-14222108
]
Debasish Das commented on SPARK-1405:
-
@sparks that will be awesome...I should be fine
[
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222108#comment-14222108
]
Debasish Das edited comment on SPARK-1405 at 11/22/14 6:40 PM
[
https://issues.apache.org/jira/browse/SPARK-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14221379#comment-14221379
]
Debasish Das commented on SPARK-3066:
-
I did experiments on MovieLens dataset
[
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14221505#comment-14221505
]
Debasish Das edited comment on SPARK-1405 at 11/21/14 10:28 PM
[
https://issues.apache.org/jira/browse/SPARK-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14219799#comment-14219799
]
Debasish Das commented on SPARK-4231:
-
[~srowen] I added batch predict APIs for user
[
https://issues.apache.org/jira/browse/SPARK-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14218667#comment-14218667
]
Debasish Das edited comment on SPARK-3066 at 11/19/14 10:59 PM
[
https://issues.apache.org/jira/browse/SPARK-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14218667#comment-14218667
]
Debasish Das commented on SPARK-3066:
-
@mengxr as per our discussions, I added APIs
[
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14218845#comment-14218845
]
Debasish Das commented on SPARK-1405:
-
I would like to compare the LSA formulations
[
https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14218891#comment-14218891
]
Debasish Das commented on SPARK-2426:
-
With the MAP measures being added
[
https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14218891#comment-14218891
]
Debasish Das edited comment on SPARK-2426 at 11/20/14 2:13 AM
[
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14218941#comment-14218941
]
Debasish Das commented on SPARK-1405:
-
For LSA you can find references on the PR
and appears in test, we can simply
ignore it. -Xiangrui
On Tue, Nov 18, 2014 at 6:59 AM, Debasish Das debasish.da...@gmail.com
wrote:
Sean,
I thought sampleByKey (stratified sampling) in 1.1 was designed to solve
the problem that randomSplit can't sample by key...
Xiangrui,
What's
groupByKey does not run a combiner so be careful about the
performance...groupByKey does shuffle even for local groups...
reduceByKey and aggregateByKey does run a combiner but if you want a
separate function for each key, you can have a key to closure map that you
can broadcast and use it in
Use zipWithIndex but cache the data before you run zipWithIndex...that way
your ordering will be consistent (unless the bug has been fixed where you
don't have to cache the data)...
Normally these operations are used for dictionary building and so I am
hoping you can cache the dictionary of
I run my Spark on YARN jobs as:
HADOOP_CONF_DIR=/etc/hadoop/conf/ /app/data/v606014/dist/bin/spark-submit
--master yarn --jars test-job.jar --executor-cores 4 --num-executors 10
--executor-memory 16g --driver-memory 4g --class TestClass test.jar
It uses HADOOP_CONF_DIR to schedule executors and
Andrew,
I put up 1.1.1 branch and I am getting shuffle failures while doing flatMap
followed by groupBy...My cluster memory is less than the memory I need and
therefore flatMap does around 400 GB of shuffle...memory is around 120 GB...
14/11/13 23:10:49 WARN TaskSetManager: Lost task 22.1 in
Hi,
I have a rdd whose key is a userId and value is (movieId, rating)...
I want to sample 80% of the (movieId,rating) that each userId has seen for
train, rest is for test...
val indexedRating = sc.textFile(...).map{x= Rating(x(0), x(1), x(2))
val keyedRatings = indexedRating.map{x =
only if output RDD is expected to be
partitioned by some key.
RDD[X].flatmap(X=RDD[Y])
If it has to shuffle it should be local.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Thu, Nov 13, 2014 at 7:31 AM, Debasish Das
Hi,
If I look inside algebird Monoid implementation it uses
java.io.Serializable...
But when we use CMS/HLL in examples.streaming.TwitterAlgebirdCMS, I don't
see a KryoRegistrator for CMS and HLL monoid...
In these examples we will run with Kryo serialization on CMS and HLL or
they will be java
[
https://issues.apache.org/jira/browse/SPARK-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14209936#comment-14209936
]
Debasish Das commented on SPARK-3066:
-
On our internal datasets, flatMap is slow...I
Hi,
I am noticing the first step for Spark jobs does a TimSort in 1.2
branch...and there is some time spent doing the TimSort...Is this assigning
the RDD blocks to different nodes based on a sort order ?
Could someone please point to a JIRA about this change so that I can read
more about it ?
Hi,
I am doing a flatMap followed by mapPartitions to do some blocked
operation...flatMap is shuffling data but this shuffle is strictly
shuffling to disk and not over the network right ?
Thanks.
Deb
[
https://issues.apache.org/jira/browse/SPARK-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207298#comment-14207298
]
Debasish Das commented on SPARK-3066:
-
[~mengxr] I am testing recommendAllUsers
/SPARK-3066
The easiest case is when one side is small. If both sides are large,
this is a super-expensive operation. We can do block-wise cross
product and then find top-k for each user.
Best,
Xiangrui
On Thu, Nov 6, 2014 at 4:51 PM, Debasish Das debasish.da...@gmail.com
wrote
[
https://issues.apache.org/jira/browse/SPARK-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200373#comment-14200373
]
Debasish Das commented on SPARK-4231:
-
[~coderxiang] [~mengxr] [~srowen]
I looked
[
https://issues.apache.org/jira/browse/SPARK-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200426#comment-14200426
]
Debasish Das commented on SPARK-4231:
-
[~srowen] I need a standard metric to report
+1
The app to track PRs based on component is a great idea...
On Thu, Nov 6, 2014 at 8:47 AM, Sean McNamara sean.mcnam...@webtrends.com
wrote:
+1
Sean
On Nov 5, 2014, at 6:32 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
Hi all,
I wanted to share a discussion we've been having on
userFeatures.lookup(user).head to
work ?
On Mon, Nov 3, 2014 at 9:24 PM, Xiangrui Meng men...@gmail.com wrote:
Was user presented in training? We can put a check there and return
NaN if the user is not included in the model. -Xiangrui
On Mon, Nov 3, 2014 at 5:25 PM, Debasish Das debasish.da
if the user is not included in the model. -Xiangrui
On Mon, Nov 3, 2014 at 5:25 PM, Debasish Das debasish.da...@gmail.com
wrote:
Hi,
I am testing MatrixFactorizationModel.predict(user: Int, product: Int)
but
the code fails on userFeatures.lookup(user).head
In computeRmse
userFeatures.lookup(user).head to
work ?
On Mon, Nov 3, 2014 at 9:24 PM, Xiangrui Meng men...@gmail.com wrote:
Was user presented in training? We can put a check there and return
NaN if the user is not included in the model. -Xiangrui
On Mon, Nov 3, 2014 at 5:25 PM, Debasish Das debasish.da
if the user is not included in the model. -Xiangrui
On Mon, Nov 3, 2014 at 5:25 PM, Debasish Das debasish.da...@gmail.com
wrote:
Hi,
I am testing MatrixFactorizationModel.predict(user: Int, product: Int)
but
the code fails on userFeatures.lookup(user).head
In computeRmse
Debasish Das created SPARK-4231:
---
Summary: Add RankingMetrics to examples.MovieLensALS
Key: SPARK-4231
URL: https://issues.apache.org/jira/browse/SPARK-4231
Project: Spark
Issue Type
Hi,
I just built the master today and I was testing the IR metrics (MAP and
prec@k) on Movielens data to establish a baseline...
I am getting a weird error which I have not seen before:
MASTER=spark://TUSCA09LMLVT00C.local:7077 ./bin/run-example
mllib.MovieLensALS --kryo --lambda 0.065
[
https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Debasish Das updated SPARK-2426:
Affects Version/s: (was: 1.0.0)
1.2.0
Quadratic Minimization for MLlib
[
https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Debasish Das updated SPARK-2426:
Affects Version/s: (was: 1.2.0)
1.3.0
Quadratic Minimization for MLlib
:24 PM, Sean Owen so...@cloudera.com wrote:
MAP is effectively an average over all k from 1 to min(#
recommendations, # items rated) Getting first recommendations right is
more important than the last.
On Thu, Oct 30, 2014 at 10:21 PM, Debasish Das
debasish.da...@gmail.com
wrote
Hi,
I am testing MatrixFactorizationModel.predict(user: Int, product: Int) but
the code fails on userFeatures.lookup(user).head
In computeRmse MatrixFactorizationModel.predict(RDD[(Int, Int)]) has been
called and in all the test-cases that API has been used...
I can perhaps refactor my code to
[
https://issues.apache.org/jira/browse/SPARK-3987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14191502#comment-14191502
]
Debasish Das commented on SPARK-3987:
-
Nope...standard ALS...same as netflix params
[
https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14191551#comment-14191551
]
Debasish Das commented on SPARK-2426:
-
[~mengxr] The matlab comparison scripts
[
https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14191551#comment-14191551
]
Debasish Das edited comment on SPARK-2426 at 10/31/14 8:04 AM
[
https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14095232#comment-14095232
]
Debasish Das edited comment on SPARK-2426 at 10/31/14 4:20 PM
[
https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14191997#comment-14191997
]
Debasish Das commented on SPARK-2426:
-
Matlab comparisons of MOSEK, ECOS, PDCO
[
https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14192935#comment-14192935
]
Debasish Das commented on SPARK-2426:
-
Refactored QuadraticMinimizer and NNLS from
[
https://issues.apache.org/jira/browse/SPARK-3987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Debasish Das reopened SPARK-3987:
-
I can send you a further list of failures...this is one more example...I
strongly suggest moving
[
https://issues.apache.org/jira/browse/SPARK-3987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14191245#comment-14191245
]
Debasish Das commented on SPARK-3987:
-
NNLS iters 36 result
[
https://issues.apache.org/jira/browse/SPARK-3987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14191277#comment-14191277
]
Debasish Das commented on SPARK-3987:
-
Was there more changes that step size in your
[
https://issues.apache.org/jira/browse/SPARK-3987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14191387#comment-14191387
]
Debasish Das commented on SPARK-3987:
-
[~mengxr] this came out of an internal dataset
wonder if it is possible to extend the DIMSUM idea to computing top K
matrix multiply between the user and item factor matrices, as opposed to
all-pairs similarity of one matrix?
On Thu, Oct 30, 2014 at 5:28 AM, Debasish Das debasish.da...@gmail.com
wrote:
Is there an example of how to use
any of the topic modeling
algorithms as well...
Is there a better place for it other than mllib examples ?
On Thu, Oct 30, 2014 at 8:13 AM, Debasish Das debasish.da...@gmail.com
wrote:
I thought topK will save us...for each user we have 1xrank...now our movie
factor is a RDD...we pick topK movie
Hi,
In the current factorization flow, we cross validate on the test dataset
using the RMSE number but there are some other measures which are worth
looking into.
If we consider the problem as a regression problem and the ratings 1-5 are
considered as 5 classes, it is possible to generate a
, Debasish Das debasish.da...@gmail.com
wrote:
Hi,
In the current factorization flow, we cross validate on the test dataset
using the RMSE number but there are some other measures which are worth
looking into.
If we consider the problem as a regression problem and the ratings 1-5
to examples.MovielensALS. ROC
should be good to add as well. -Xiangrui
On Wed, Oct 29, 2014 at 11:23 AM, Debasish Das debasish.da...@gmail.com
wrote:
Hi,
In the current factorization flow, we cross validate on the test dataset
using the RMSE number but there are some other measures which are worth
:33 PM, Chih-Jen Lin cj...@csie.ntu.edu.tw wrote:
Debasish Das writes:
If the SVM is not already migrated to BFGS, that's the first thing you
should
try...Basically following LBFGS Logistic Regression come up with LBFGS
based
linear SVM...
About integrating TRON in mllib, David
If the SVM is not already migrated to BFGS, that's the first thing you
should try...Basically following LBFGS Logistic Regression come up with
LBFGS based linear SVM...
About integrating TRON in mllib, David already has a version of TRON in
breeze but someone needs to validate it for linear SVM
@dbtsai for condition number what did you use ? Diagonal preconditioning of
the inverse of B matrix ? But then B matrix keeps on changing...did u
condition it after every few iterations ?
Will it be possible to put that code in Breeze since it will be very useful
to condition other solvers as
[
https://issues.apache.org/jira/browse/SPARK-3987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14180001#comment-14180001
]
Debasish Das commented on SPARK-3987:
-
I will test it but this is how I called NNLS
[
https://issues.apache.org/jira/browse/SPARK-3987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14180314#comment-14180314
]
Debasish Das commented on SPARK-3987:
-
[~coderxiang] changing to 1e-6 to 1e-7 fixes
Hi Martin,
This problem is Ax = B where A is your matrix [2 1 3 ... 1; 1 0 3 ...;]
and x is what you want to find..B is 0 in this case...For mllib normally
this is labelbasically create a labeledPoint where label is 0 always...
Use mllib's linear regression and solve the following
the architecture. It has all the things you are thinking
about:)
Thanks,
Jayant
On Sat, Oct 18, 2014 at 8:49 AM, Debasish Das debasish.da...@gmail.com
wrote:
Hi,
Is someone working on a project on integrating Oryx model serving layer
with Spark ? Models will be built using either Streaming
[
https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14176358#comment-14176358
]
Debasish Das commented on SPARK-2426:
-
[~mengxr] I thought more on it and one
wrote:
Oryx 2 seems to be geared for Spark
https://github.com/OryxProject/oryx
2014-10-18 11:46 GMT-04:00 Debasish Das debasish.da...@gmail.com:
Hi,
Is someone working on a project on integrating Oryx model serving
layer
with Spark ? Models will be built using either
Hi,
Is someone working on a project on integrating Oryx model serving layer
with Spark ? Models will be built using either Streaming data / Batch data
in HDFS and cross validated with mllib APIs but the model serving layer
will give API endpoints like Oryx
and read the models may be from
[
https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14175167#comment-14175167
]
Debasish Das commented on SPARK-2426:
-
1. [~mengxr] Our legal was clear that Stanford
Hi,
I am validating the proximal algorithm for positive and bound constrained
ALS and I came across the bug detailed in the JIRA while running ALS with
NNLS:
https://issues.apache.org/jira/browse/SPARK-3987
ADMM based proximal algorithm came up with correct result...
Thanks.
Deb
Debasish Das created SPARK-3987:
---
Summary: NNLS generates incorrect result
Key: SPARK-3987
URL: https://issues.apache.org/jira/browse/SPARK-3987
Project: Spark
Issue Type: Bug
in a different implementation and it
has worked fine.
Now I have to go hunt for how the QR decomposition is exposed in BLAS...
Looks like its GEQRF which JBLAS helpfully exposes. Debasish you could try
it for fun at least.
On Oct 15, 2014 8:06 PM, Debasish Das debasish.da...@gmail.com wrote:
But do
Just checked, QR is exposed by netlib: import org.netlib.lapack.Dgeqrf
For the equality and bound version, I will use QR...it will be faster than
the LU that I am using through jblas.solveSymmetric...
On Thu, Oct 16, 2014 at 8:34 AM, Debasish Das debasish.da...@gmail.com
wrote:
@xiangrui
Hi,
If I take the Movielens data and run the default ALS with regularization as
0.0, I am hitting exception from LAPACK that the gram matrix is not
positive definite. This is on the master branch.
This is how I run it :
./bin/spark-submit --total-executor-cores 1 --master spark://
, 2014 at 5:01 PM, Liquan Pei liquan...@gmail.com wrote:
Hi Debaish,
I think ||r - wi'hj||^{2} is semi-positive definite.
Thanks,
Liquan
On Wed, Oct 15, 2014 at 4:57 PM, Debasish Das debasish.da...@gmail.com
wrote:
Hi,
If I take the Movielens data and run the default ALS with regularization
Awesome news Matei !
Congratulations to the databricks team and all the community members...
On Fri, Oct 10, 2014 at 7:54 AM, Matei Zaharia matei.zaha...@gmail.com
wrote:
Hi folks,
I interrupt your regularly scheduled user / dev list to bring you some
pretty cool news for the project, which
I have faced this in the past and I have to put a profile -Phadoop2.3...
mvn -Dhadoop.version=2.3.0-cdh5.1.0 -Phadoop-2.3 -Pyarn -DskipTests install
On Wed, Oct 8, 2014 at 1:40 PM, Chuang Liu liuchuan...@gmail.com wrote:
Hi:
I tried to build Spark (1.1.0) with hadoop 2.4.0, and ran a simple
Hi,
I have added some changes to ALS tests and I am re-running tests as:
mvn -Dhadoop.version=2.3.0-cdh5.1.0 -Phadoop-2.3 -Pyarn
-DwildcardSuites=org.apache.spark.mllib.recommendation.ALSSuite test
I have some INFO logs in the code which I want to see on my console. They
work fine if I add
=ERROR
log4j.logger.org.apache.zookeeper=WARN
log4j.logger.org.eclipse.jetty=WARN
log4j.logger.org.I0Itec.zkclient=WARN
On Tue, Oct 7, 2014 at 7:42 PM, Debasish Das debasish.da...@gmail.com
wrote:
Hi,
I have added some changes to ALS tests and I am re-running tests as:
mvn
Another rule of thumb is that definitely cache the RDD over which you need
to do iterative analysis...
For rest of them only cache if you have lot of free memory !
On Mon, Oct 6, 2014 at 2:39 PM, Sean Owen so...@cloudera.com wrote:
I think you mean that data2 is a function of data1 in the
Hi,
We write the output of models and other information as parquet files and
later we let data APIs run SQL queries on the columnar data...
SparkSQL is used to dump the data in parquet format and now we are
considering whether using SparkSQL or Impala to read it back...
I came across this
Can't you extend a class in place of object which can be generic ?
class GenericAccumulator[B] extends AccumulatorParam[Seq[B]] {
}
On Wed, Oct 1, 2014 at 3:38 AM, Johan Stenberg johanstenber...@gmail.com
wrote:
Just realized that, of course, objects can't be generic, but how do I
create a
If the missing values are 0, then you can also look into implicit
formulation...
On Tue, Sep 30, 2014 at 12:05 PM, Xiangrui Meng men...@gmail.com wrote:
We don't handle missing value imputation in the current version of
MLlib. In future releases, we can store feature information in the
Hi,
Inside mllib I am running tests using:
mvn -Dhadoop.version=2.3.0-cdh5.1.0 -Phadoop-2.3 -Pyarn install
The locat tests run fine but cluster tests are failing..
LBFGSClusterSuite:
- task size should be small *** FAILED ***
org.apache.spark.SparkException: Job aborted due to stage
I have done mvn clean several times...
Consistently all the mllib tests that are using
LocalClusterSparkContext.scala, they fail !
If the tree is too big build it on graphxbut it will need thorough
analysis so that the partitions are well balanced...
On Tue, Sep 30, 2014 at 2:45 PM, Andy Twigg andy.tw...@gmail.com wrote:
Hi Boromir,
Assuming the tree fits in memory, and what you want to do is parallelize
the
Only fit the data in memory where you want to run the iterative
algorithm
For map-reduce operations, it's better not to cache if you have a memory
crunch...
Also schedule the persist and unpersist such that you utilize the RAM
well...
On Tue, Sep 30, 2014 at 4:34 PM, Liquan Pei
You should look into Evan Spark's talk from Spark Summit 2014
http://spark-summit.org/2014/talk/model-search-at-scale
I am not sure if some of it is already open sourced through MLBase...
On Mon, Sep 29, 2014 at 7:45 PM, Lochana Menikarachchi locha...@gmail.com
wrote:
Hi,
Is there anyone
HBase regionserver needs to be balancedyou might have some skewness in
row keys and one regionserver is under pressuretry finding that key and
replicate it using random salt
On Wed, Sep 24, 2014 at 8:51 AM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Hi Ted,
It converts RDD[Edge] to
201 - 300 of 481 matches
Mail list logo