Re: Run Apache Spark on Mini Cluster

2014-05-21 Thread Krishna Sankar
It depends on what stack you want to run. A quick cut: - Worker Machines (DataNode, HBase Region Servers, Spark Worker Nodes) - Dual 6 core CPU - 64 to 128 GB RAM - 3 X 3TB disk (JBOD) - Master Node (Name Node, HBase Master,Spark Master) - Dual 6 core CPU - 64

Re: How to Run Machine Learning Examples

2014-05-22 Thread Krishna Sankar
I couldn't find the classification.SVM class. - Most probably the command is something of the order of: - bin/spark-submit --class org.apache.spark.examples.mllib.BinaryClassification examples/target/scala-*/spark-examples-*.jar --algorithm SVM train.csv - For more details

Re: K-nearest neighbors search in Spark

2014-05-27 Thread Krishna Sankar
Carter, Just as a quick simple starting point for Spark. (caveats - lots of improvements reqd for scaling, graceful and efficient handling of RDD et al): import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import scala.collection.immutable.ListMap import

Re: Why Scala?

2014-05-29 Thread Krishna Sankar
Nicholas, Good question. Couple of thoughts from my practical experience: - Coming from R, Scala feels more natural than other languages. The functional succinctness of Scala is more suited for Data Science than other languages. In short, Scala-Spark makes sense, for Data Science,

Re: Trouble launching EC2 Cluster with Spark

2014-06-04 Thread Krishna Sankar
- Original Message - From: Krishna Sankar ksanka...@gmail.com To: user@spark.apache.org Sent: Wednesday, June 4, 2014 8:52:59 AM Subject: Re: Trouble launching EC2 Cluster with Spark One reason could be that the keys are in a different region. Need to create the keys in us-east-1-North

Re: Spark Usecase

2014-06-04 Thread Krishna Sankar
Shahab, Interesting question. Couple of points (based on the information from your e-mail) 1. One can support the use case in Spark as a set of transformations on a WIP TDD over a span of time and the final transformation outputting to a processed TDD - Spark streaming would be

Re: How to compile a Spark project in Scala IDE for Eclipse?

2014-06-08 Thread Krishna Sankar
Project-Properties-Java Build Path-Add External Jars Add the /spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar Cheers K/ On Sun, Jun 8, 2014 at 8:06 AM, Carter gyz...@hotmail.com wrote: Hi All, I just downloaded the Scala IDE for Eclipse. After I created a Spark project and

Re: problem starting the history server on EC2

2014-06-10 Thread Krishna Sankar
Yep, it gives tons of errors. I was able to make it work with sudo. Looks like ownership issue. Cheers k/ On Tue, Jun 10, 2014 at 6:29 PM, zhen z...@latrobe.edu.au wrote: I created a Spark 1.0 cluster on EC2 using the provided scripts. However, I do not seem to be able to start the history

Multi-dimensional Uniques over large dataset

2014-06-13 Thread Krishna Sankar
Hi, Would appreciate insights and wisdom on a problem we are working on: 1. Context: - Given a csv file like: - d1,c1,a1 - d1,c1,a2 - d1,c2,a1 - d1,c1,a1 - d2,c1,a3 - d2,c2,a1 - d3,c1,a1 - d3,c3,a1 - d3,c2,a1 - d3,c3,a2

Re: Multi-dimensional Uniques over large dataset

2014-06-13 Thread Krishna Sankar
Answered one of my questions (#5) : val pairs = new PairRDDFunctions(RDD) works fine locally. Now I can do groupByKey et al. Am not sure if it is scalable for millions of records memory efficient. heers k/ On Fri, Jun 13, 2014 at 8:52 PM, Krishna Sankar ksanka...@gmail.com wrote: Hi

Re: Multi-dimensional Uniques over large dataset

2014-06-14 Thread Krishna Sankar
And got the first cut: val res = pairs.groupByKey().map((x) = (x._1, x._2.size, x._2.toSet.size)) gives the total unique. The question : is it scalable efficient ? Would appreciate insights. Cheers k/ On Fri, Jun 13, 2014 at 10:51 PM, Krishna Sankar ksanka...@gmail.com wrote

Re: GroupByKey results in OOM - Any other alternative

2014-06-15 Thread Krishna Sankar
Ian, Yep, HLL is an appropriate mechanism. The countApproxDistinctByKey is a wrapper around the com.clearspring.analytics.stream.cardinality.HyperLogLogPlus. Cheers k/ On Sun, Jun 15, 2014 at 4:50 PM, Ian O'Connell i...@ianoconnell.com wrote: Depending on your requirements when doing hourly

Re: Spark streaming RDDs to Parquet records

2014-06-17 Thread Krishna Sankar
Mahesh, - One direction could be : create a parquet schema, convert save the records to hdfs. - This might help https://github.com/massie/spark-parquet-example/blob/master/src/main/scala/com/zenfractal/SparkParquetExample.scala Cheers k/ On Tue, Jun 17, 2014 at 12:52 PM,

Re: Spark Processing Large Data Stuck

2014-06-21 Thread Krishna Sankar
Hi, - I have seen similar behavior before. As far as I can tell, the root cause is the out of memory error - verified this by monitoring the memory. - I had a 30 GB file and was running on a single machine with 16GB. So I knew it would fail. - But instead of raising an

Re: Unable to run Spark 1.0 SparkPi on HDP 2.0

2014-07-07 Thread Krishna Sankar
Konstantin, 1. You need to install the hadoop rpms on all nodes. If it is Hadoop 2, the nodes would have hdfs YARN. 2. Then you need to install Spark on all nodes. I haven't had experience with HDP, but the tech preview might have installed Spark as well. 3. In the end, one should

Re: Requirements for Spark cluster

2014-07-09 Thread Krishna Sankar
I rsync the spark-1.0.1 directory to all the nodes. Yep, one needs Spark in all the nodes irrespective of Hadoop/YARN. Cheers k/ On Tue, Jul 8, 2014 at 6:24 PM, Robert James srobertja...@gmail.com wrote: I have a Spark app which runs well on local master. I'm now ready to put it on a

Re: Need help on spark Hbase

2014-07-15 Thread Krishna Sankar
One vector to check is the HBase libraries in the --jars as in : spark-submit --class your class --master master url --jars

Re: Out of any idea

2014-07-19 Thread Krishna Sankar
Probably you have - if not, try a very simple app in the docker container and make sure it works. Sometimes resource contention/allocation can get in the way. This happened to me in the YARN container. Also try single worker thread. Cheers k/ On Sat, Jul 19, 2014 at 2:39 PM, boci

MLlib 1.2 New Interesting Features

2014-09-27 Thread Krishna Sankar
Guys, - Need help in terms of the interesting features coming up in MLlib 1.2. - I have a 2 Part, ~3 hr hands-on tutorial at the Big Data Tech Con - The Hitchhiker's Guide to Machine Learning with Python Apache Spark[2] - At minimum, it would be good to take the last 30 min

Re: MLlib 1.2 New Interesting Features

2014-09-29 Thread Krishna Sankar
binary packages and documentation can be easily found on spark.apache.org, which is important for making hands-on tutorial. Best, Xiangrui On Sat, Sep 27, 2014 at 12:15 PM, Krishna Sankar ksanka...@gmail.com wrote: Guys, Need help in terms of the interesting features coming up in MLlib

Re: MLlib Linear Regression Mismatch

2014-10-01 Thread Krishna Sankar
to be 0.1 or 0.01? Best, Burak - Original Message - From: Krishna Sankar ksanka...@gmail.com To: user@spark.apache.org Sent: Wednesday, October 1, 2014 12:43:20 PM Subject: MLlib Linear Regression Mismatch Guys, Obviously I am doing something wrong. May be 4 points are too small

Re: Can not see any spark metrics on ganglia-web

2014-10-02 Thread Krishna Sankar
Hi, I am sure you can use the -Pspark-ganglia-lgpl switch to enable Ganglia. This step only adds the support for Hadoop,Yarn,Hive et al in the spark executable.No need to run if one is not using them. Cheers k/ On Thu, Oct 2, 2014 at 12:29 PM, danilopds danilob...@gmail.com wrote: Hi

Re: Breaking the previous large-scale sort record with Spark

2014-10-13 Thread Krishna Sankar
Well done guys. MapReduce sort at that time was a good feat and Spark now has raised the bar with the ability to sort a PB. Like some of the folks in the list, a summary of what worked (and didn't) as well as the monitoring practices would be good. Cheers k/ P.S: What are you folks planning next ?

Re: Spark or MR, Scala or Java?

2014-11-22 Thread Krishna Sankar
Adding to already interesting answers: - Is there any case where MR is better than Spark? I don't know what cases I should be used Spark by MR. When is MR faster than Spark? - Many. MR would be better (am not saying faster ;o)) for - Very large dataset, - Multistage

Re: Spark or MR, Scala or Java?

2014-11-23 Thread Krishna Sankar
Good point. On the positive side, whether we choose the most efficient mechanism in Scala might not be as important, as the Spark framework mediates the distributed computation. Even if there is some declarative part in Spark, we can still choose an inefficient computation path that is not

Re: Spark or MR, Scala or Java?

2014-11-23 Thread Krishna Sankar
A very timely article http://rahulkavale.github.io/blog/2014/11/16/scrap-your-map-reduce/ Cheers k/ P.S: Now reply to ALL. On Sun, Nov 23, 2014 at 7:16 PM, Krishna Sankar ksanka...@gmail.com wrote: Good point. On the positive side, whether we choose the most efficient mechanism in Scala might

Re: Calling ALS-MlLib from desktop application/ Training ALS

2014-12-13 Thread Krishna Sankar
a) There is no absolute RSME - it depends on the domain. Also RSME is the error based on what you have seen so far, a snapshot of a slice of the domain. b) My suggestion is put the system in place, see what happens when users interact with the system and then you can think of reducing the RSME as

Re: spark-shell working in scala-2.11

2015-01-28 Thread Krishna Sankar
Stephen, Scala 2.11 worked fine for me. Did the dev change and then compile. Not using in production, but I go back and forth between 2.10 2.11. Cheers k/ On Wed, Jan 28, 2015 at 12:18 PM, Stephen Haberman stephen.haber...@gmail.com wrote: Hey, I recently compiled Spark master against

[no subject]

2015-01-10 Thread Krishna Sankar
Guys, registerTempTable(Employees) gives me the error Exception in thread main scala.ScalaReflectionException: class org.apache.spark.sql.catalyst.ScalaReflection in JavaMirror with primordial classloader with boot classpath

Re: DeepLearning and Spark ?

2015-01-09 Thread Krishna Sankar
I am also looking at this domain. We could potentially use the broadcast capability in Spark to distribute the parameters. Haven't thought thru yet. Cheers k/ On Fri, Jan 9, 2015 at 2:56 PM, Andrei faithlessfri...@gmail.com wrote: Does it makes sense to use Spark's actor system (e.g. via

Re: Re: I think I am almost lost in the internals of Spark

2015-01-06 Thread Krishna Sankar
Interestingly Google Chrome translates the materials. Cheers k/ On Tue, Jan 6, 2015 at 7:26 PM, Boromir Widas vcsub...@gmail.com wrote: I do not understand Chinese but the diagrams on that page are very helpful. On Tue, Jan 6, 2015 at 9:46 PM, eric wong win19...@gmail.com wrote: A good

Re: Spark for core business-logic? - Replacing: MongoDB?

2015-01-03 Thread Krishna Sankar
Alec, Good questions. Suggestions: 1. Refactor the problem into layers viz. DFS, Data Store, DB, SQL Layer, Cache, Queue, App Server, App (Interface), App (backend ML) et al. 2. Then slot-in the appropriate technologies - may be even multiple technologies for the same layer and

Re: IPyhon notebook command for spark need to be updated?

2015-03-20 Thread Krishna Sankar
Yep the command-option is gone. No big deal, just add the '%pylab inline' command as part of your notebook. Cheers k/ On Fri, Mar 20, 2015 at 3:45 PM, cong yue yuecong1...@gmail.com wrote: Hello : I tried ipython notebook with the following command in my enviroment.

Re: General Purpose Spark Cluster Hardware Requirements?

2015-03-08 Thread Krishna Sankar
Without knowing the data size, computation storage requirements ... : - Dual 6 or 8 core machines, 256 GB memory each, 12-15 TB per machine. Probably 5-10 machines. - Don't go for the most exotic machines, otoh don't go for cheapest ones either. - Find a sweet spot with your

Re: Movie Recommendation tutorial

2015-02-23 Thread Krishna Sankar
1. The RSME varies a little bit between the versions. 2. Partitioned the training,validation,test set like so: - training = ratings_rdd_01.filter(lambda x: (x[3] % 10) 6) - validation = ratings_rdd_01.filter(lambda x: (x[3] % 10) = 6 and (x[3] % 10) 8) - test =

Re: Movie Recommendation tutorial

2015-02-24 Thread Krishna Sankar
to set lambda to 0.1. -Xiangrui On Mon, Feb 23, 2015 at 3:06 PM, Krishna Sankar ksanka...@gmail.com wrote: The RSME varies a little bit between the versions. Partitioned the training,validation,test set like so: training = ratings_rdd_01.filter(lambda x: (x[3] % 10) 6) validation

Re: randomSplit instead of a huge map reduce ?

2015-02-21 Thread Krishna Sankar
- Divide and conquer with reduceByKey (like Ashish mentioned, each pair being the key) would work - looks like a mapReduce with combiners problem. I think reduceByKey would use combiners while aggregateByKey wouldn't. - Could we optimize this further by using combineByKey directly ?

Re: Trouble working with Spark-CSV package (error: object databricks is not a member of package com)

2015-04-23 Thread Krishna Sankar
Do you have commons-csv-1.1-bin.jar in your path somewhere ? I had to download and add this. Cheers k/ On Wed, Apr 22, 2015 at 11:01 AM, Mohammed Omer beancinemat...@gmail.com wrote: Afternoon all, I'm working with Scala 2.11.6, and Spark 1.3.1 built from source via: `mvn -Pyarn

Re: Dataset announcement

2015-04-15 Thread Krishna Sankar
Thanks Olivier. Good work. Interesting in more than one ways - including training, benchmarking, testing new releases et al. One quick question - do you plan to make it available as an S3 bucket ? Cheers k/ On Wed, Apr 15, 2015 at 5:58 PM, Olivier Chapelle oliv...@chapelle.cc wrote: Dear Spark

import pyspark.sql.Row gives error in 1.4.1

2015-07-02 Thread Krishna Sankar
Error - ImportError: No module named Row Cheers enjoy the long weekend k/

Re: making dataframe for different types using spark-csv

2015-07-01 Thread Krishna Sankar
- use .cast(...).alias('...') after the DataFrame is read. - sql.functions.udf for any domain-specific conversions. Cheers k/ On Wed, Jul 1, 2015 at 11:03 AM, Hafiz Mujadid hafizmujadi...@gmail.com wrote: Hi experts! I am using spark-csv to lead csv data into dataframe. By default it

Re: Kmeans Labeled Point RDD

2015-05-21 Thread Krishna Sankar
You can predict and then zip it with the points RDD to get approx. same as LP. Cheers k/ On Thu, May 21, 2015 at 6:19 PM, anneywarlord anneywarl...@gmail.com wrote: Hello, New to Spark. I wanted to know if it is possible to use a Labeled Point RDD in org.apache.spark.mllib.clustering.KMeans.

Re: SparkSQL built in functions

2015-06-29 Thread Krishna Sankar
Interesting. Looking at the definitions, sql.functions.pow is defined only for (col,col). Just as an experiment, create a column with value 2 and see if that works. Cheers k/ On Mon, Jun 29, 2015 at 1:34 PM, Bob Corsaro rcors...@gmail.com wrote: 1.4 and I did set the second parameter. The DSL

Re: Spark MLib v/s SparkR

2015-08-05 Thread Krishna Sankar
A few points to consider: a) SparkR gives the union of R_in_a_single_machine and the distributed_computing_of_Spark: b) It also gives the ability to wrangle with data in R, that is in the Spark eco system c) Coming to MLlib, the question is MLlib and R (not MLlib or R) - depending on the scale,

Re: Sum elements of an iterator inside an RDD

2015-07-11 Thread Krishna Sankar
Looks like reduceByKey() should work here. Cheers k/ On Sat, Jul 11, 2015 at 11:02 AM, leonida.gianfagna leonida.gianfa...@gmail.com wrote: Thanks a lot oubrik, I got your point, my consideration is that sum() should be already a built-in function for iterators in python. Anyway I tried

Re: HDP 2.3 support for Spark 1.5.x

2015-09-28 Thread Krishna Sankar
;> finished.* >> >> 2015-09-23 1:31 GMT+02:00 Zhan Zhang <zzh...@hortonworks.com>: >> >>> Hi Krishna, >>> >>> For the time being, you can download from upstream, and it should be >>> running OK for HDP2.3. For hdp specific problem, you

Thanks For a Job Well Done !!!

2016-06-18 Thread Krishna Sankar
Hi all, Just wanted to thank all for the dataset API - most of the times we see only bugs in these lists ;o). - Putting some context, this weekend I was updating the SQL chapters of my book - it had all the ugliness of SchemaRDD, registerTempTable, take(10).foreach(println) and

Re: Is Spark right for us?

2016-03-06 Thread Krishna Sankar
Good question. It comes to computational complexity, computational scale and data volume. 1. If you can store the data in a single server or a small cluster of db server (say mysql) then hdfs/Spark might be an overkill 2. If you can run the computation/process the data on a single

Re: Spark 1.6.2 version displayed as 1.6.1

2016-07-25 Thread Krishna Sankar
This intrigued me as well. - Just for sure, I downloaded the 1.6.2 code and recompiled. - spark-shell and pyspark both show 1.6.2 as expected. Cheers On Mon, Jul 25, 2016 at 1:45 AM, Daniel Darabos < daniel.dara...@lynxanalytics.com> wrote: > Another possible explanation is that by

Re: Spark ml.ALS question -- RegressionEvaluator .evaluate giving ~1.5 output for same train and predict data

2016-07-24 Thread Krishna Sankar
Thanks Nick. I also ran into this issue. VG, One workaround is to drop the NaN from predictions (df.na.drop()) and then use the dataset for the evaluator. In real life, probably detect the NaN and recommend most popular on some window. HTH. Cheers On Sun, Jul 24, 2016 at 12:49 PM, Nick Pentreath

Re: Vector size mismatch in logistic regression - Spark ML 2.0

2016-08-21 Thread Krishna Sankar
Hi, Looks like the test-dataset has different sizes for X & Y. Possible steps: 1. What is the test-data-size ? - If it is 15,909, check the prediction variable vector - it is now 29,471, should be 15,909 - If you expect it to be 29,471, then the X Matrix is not right.

Re: Vector size mismatch in logistic regression - Spark ML 2.0

2016-08-21 Thread Krishna Sankar
that does the data split and the datasets where they are allocated to. Cheers On Sun, Aug 21, 2016 at 4:37 PM, Krishna Sankar <ksanka...@gmail.com> wrote: > Hi, > Looks like the test-dataset has different sizes for X & Y. Possible > steps: > >1. W