Re: Mllib Logistic Regression performance relative to Mahout

2016-02-28 Thread Yashwanth Kumar
Hi, If your features are numeric, try feature scaling and feed it to Spark Logistic Regression, It might increase rate% -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Mllib-Logistic-Regression-performance-relative-to-Mahout-tp26346p26358.html Sent from the

Re: Spark Integration Patterns

2016-02-28 Thread Yashwanth Kumar
Hi, To connect to Spark from a remote location and submit jobs, you can try Spark - Job Server.Its been open sourced now. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Integration-Patterns-tp26354p26357.html Sent from the Apache Spark User List

Re: How to unpersist a DStream in Spark Streaming

2015-11-04 Thread Yashwanth Kumar
Hi, DStream->Discretized Streams are made up of multiple RDDs You can unpersist each RDD by accessing the individual RDD's using dstreamrdd.foreachRDD { rdd.unpersist(). } -- View this message in context:

Re: SparkR vs R

2015-09-22 Thread Yashwanth Kumar
Hi, 1. The main difference between SparkR and R is that "SparkR" can handle bigdata. Yes, you can use other core libraries inside SparkR(not algos like lm(),glm(),kmean()) 2.Yes, core R libraries will not be distributed. You can use function from these libraries which are applicabe for mapper

Re: Partitions on RDDs

2015-09-22 Thread Yashwanth Kumar
HI, In the first rdd transformation (eg: reading from a file sc.textfile("path",partition)), the partition you specify will be applied to all further transformations and actions from this rdd. In few places repartitioning your rdd will give a added advantage. Repartition is usually done during

Re: Slow Performance with Apache Spark Gradient Boosted Tree training runs

2015-09-22 Thread Yashwanth Kumar
Hi vkutsenko, Can you just give partitions to the input labeled rdd, like: data = MLUtils.loadLibSVMFile(jsc.sc(), "s3://somebucket/somekey/plaintext_libsvm_file").toJavaRDD().*repartition(5)*; Here, i used 5, since you have have 5 cores. Also for further benchmark and performance tuning:

Re: MLlib inconsistent documentation

2015-09-22 Thread Yashwanth Kumar
Hi, I guess, the double values are number of visits rather than a visit flag (obviously it should be more useful than visit flag i.e 1/0) this is based on the assumption that while doing matrix factorisation, rating trained using implicit cannot be binary, as it gives poor feature values. In