Re: sparse x sparse matrix multiplication

2014-11-05 Thread Wei Tan
I think Xiangrui's ALS code implement certain aspect of it. You may want to check it out. Best regards, Wei - Wei Tan, PhD Research Staff Member IBM T. J. Watson Research Center From: Xiangrui Meng men...@gmail.com To: Duy Huynh duy.huynh@gmail.com

Re: CUDA in spark, especially in MLlib?

2014-08-28 Thread Wei Tan
Thank you Debasish. I am fine with either Scala or Java. I would like to get a quick evaluation on the performance gain, e.g., ALS on GPU. I would like to try whichever library does the business :) Best regards, Wei - Wei Tan, PhD Research Staff Member IBM T. J

Re: CUDA in spark, especially in MLlib?

2014-08-27 Thread Wei Tan
Thank you all. Actually I was looking at JCUDA. Function wise this may be a perfect solution to offload computation to GPU. Will see how performance it will be, especially with the Java binding. Best regards, Wei - Wei Tan, PhD Research Staff Member IBM T. J

CUDA in spark, especially in MLlib?

2014-08-26 Thread Wei Tan
Hi I am trying to find a CUDA library in Scala, to see if some matrix manipulation in MLlib can be sped up. I googled a few but found no active projects on Scala+CUDA. Python is supported by CUDA though. Any suggestion on whether this idea makes any sense? Best regards, Wei

Re: MLLib: implementing ALS with distributed matrix

2014-08-17 Thread Wei Tan
. Any idea on which method is better? Thanks! Wei - Wei Tan, PhD Research Staff Member IBM T. J. Watson Research Center http://researcher.ibm.com/person/us-wtan From: Xiangrui Meng men...@gmail.com To: Wei Tan/Watson/IBM@IBMUS, Cc: user@spark.apache.org

Re: MLLib: implementing ALS with distributed matrix

2014-08-17 Thread Wei Tan
Hi Deb, thanks for sharing your result. Please find my comments inline in blue. Best regards, Wei From: Debasish Das debasish.da...@gmail.com To: Wei Tan/Watson/IBM@IBMUS, Cc: Xiangrui Meng men...@gmail.com, user@spark.apache.org user@spark.apache.org Date: 08/17/2014 08:15 PM

RE: executor-cores vs. num-executors

2014-07-16 Thread Wei Tan
Thanks for sharing your experience. I got the same experience -- multiple moderate JVMs beat a single huge JVM. Besides the minor JVM starting overhead, is it always better to have multiple JVMs rather than a single one? Best regards, Wei - Wei Tan, PhD

parallel stages?

2014-07-15 Thread Wei Tan
the two reduceByKey stages run in parallel given sufficient capacity? Best regards, Wei - Wei Tan, PhD Research Staff Member IBM T. J. Watson Research Center http://researcher.ibm.com/person/us-wtan

Re: parallel stages?

2014-07-15 Thread Wei Tan
- Wei Tan, PhD Research Staff Member IBM T. J. Watson Research Center http://researcher.ibm.com/person/us-wtan From: Sean Owen so...@cloudera.com To: user@spark.apache.org, Date: 07/15/2014 04:37 PM Subject:Re: parallel stages? The last two lines

Re: Recommended pipeline automation tool? Oozie?

2014-07-11 Thread Wei Tan
Just curious: how about using scala to drive the workflow? I guess if you use other tools (oozie, etc) you lose the advantage of reading from RDD -- you have to read from HDFS. Best regards, Wei - Wei Tan, PhD Research Staff Member IBM T. J. Watson Research

Re: rdd.cache() is not faster?

2014-06-18 Thread Wei Tan
cache? I will try more workers so that each JVM has a smaller heap. Best regards, Wei - Wei Tan, PhD Research Staff Member IBM T. J. Watson Research Center http://researcher.ibm.com/person/us-wtan From: Gaurav Jain ja...@student.ethz.ch To: u

rdd.cache() is not faster?

2014-06-17 Thread Wei Tan
- Wei Tan, PhD Research Staff Member IBM T. J. Watson Research Center http://researcher.ibm.com/person/us-wtan

Re: long GC pause during file.cache()

2014-06-16 Thread Wei Tan
Thanks you all for advice including (1) using CMS GC (2) use multiple worker instance and (3) use Tachyon. I will try (1) and (2) first and report back what I found. I will also try JDK 7 with G1 GC. Best regards, Wei - Wei Tan, PhD Research Staff Member IBM T

Re: long GC pause during file.cache()

2014-06-16 Thread Wei Tan
BTW: nowadays a single machine with huge RAM (200G to 1T) is really common. With virtualization you lose some performance. It would be ideal to see some best practice on how to use Spark in these state-of-art machines... Best regards, Wei - Wei Tan, PhD

long GC pause during file.cache()

2014-06-14 Thread Wei Tan
org.apache.spark.deploy.SparkSubmit spark-shell --class org.apache.spark.repl.Main Best regards, Wei - Wei Tan, PhD Research Staff Member IBM T. J. Watson Research Center http://researcher.ibm.com/person/us-wtan

Re: How to compile a Spark project in Scala IDE for Eclipse?

2014-06-08 Thread Wei Tan
/version /dependency dependency groupIdorg.apache.hadoop/groupId artifactIdhadoop-client/artifactId version1.2.1/version /dependency Best regards, Wei - Wei Tan, PhD Research Staff Member IBM T. J. Watson Research Center http://researcher.ibm.com/person/us

best practice: write and debug Spark application in scala-ide and maven

2014-06-06 Thread Wei Tan
application (like wordcount) and debug quickly on a remote spark instance? Thanks! Wei - Wei Tan, PhD Research Staff Member IBM T. J. Watson Research Center http://researcher.ibm.com/person/us-wtan

Re: reuse hadoop code in Spark

2014-06-05 Thread Wei Tan
to run it in Hadoop. It is fairly complex and relies on a lot of utility java classes I wrote. Can I reuse the map function in java and port it into Spark? Best regards, Wei - Wei Tan, PhD Research Staff Member IBM T. J. Watson Research Center http

reuse hadoop code in Spark

2014-06-04 Thread Wei Tan
Hello, I am trying to use spark in such a scenario: I have code written in Hadoop and now I try to migrate to Spark. The mappers and reducers are fairly complex. So I wonder if I can reuse the map() functions I already wrote in Hadoop (Java), and use Spark to chain them, mixing the Java