Re: Multi-dimensional Uniques over large dataset

2014-06-14 Thread Krishna Sankar
And got the first cut: val res = pairs.groupByKey().map((x) = (x._1, x._2.size, x._2.toSet.size)) gives the total unique. The question : is it scalable efficient ? Would appreciate insights. Cheers k/ On Fri, Jun 13, 2014 at 10:51 PM, Krishna Sankar ksanka...@gmail.com wrote:

Re: MLlib-Missing Regularization Parameter and Intercept for Logistic Regression

2014-06-14 Thread Xiangrui Meng
1. examples/src/main/scala/org/apache/spark/examples/mllib/BinaryClassification.scala contains example code that shows how to set regParam. 2. A static method with more than 3 parameters becomes hard to remember and hard to maintain. Please use LogistricRegressionWithSGD's default constructor

SparkSQL registerAsTable - No TypeTag available Error

2014-06-14 Thread premdass
Hi, I am trying to run the spark sql example provided on the example https://spark.apache.org/docs/latest/sql-programming-guide.html as a standalone program. When i try to run the compile the program, i am getting the below error Done updating. Compiling 1 Scala source to

Accumulable with huge accumulated value?

2014-06-14 Thread Nilesh Chakraborty
Hey all! I have got an iterative problem. I'm trying to find something similar to Hadoop's MultipleOutputs [1] in Spark 1.0. I need to build up a couple of large dense vectors (may contain billions of elements - 2 billion doubles = at least 16GB) by adding partial vector chunks to it. This can be

GroupByKey results in OOM - Any other alternative

2014-06-14 Thread Vivek YS
Hi, For last couple of days I have been trying hard to get around this problem. Please share any insights on solving this problem. Problem : There is a huge list of (key, value) pairs. I want to transform this to (key, distinct values) and then eventually to (key, distinct values count) On

Re: GroupByKey results in OOM - Any other alternative

2014-06-14 Thread Sean Owen
Grouping by key is always problematic since a key might have a huge number of values. You can do a little better than grouping *all* values and *then* finding distinct values by using foldByKey, putting values into a Set. At least you end up with only distinct values in memory. (You don't need two

DStream are not processed after upgrade to Spark 1.0

2014-06-14 Thread Chang Lim
Hi All, I've some Streaming code in Java that works on 0.9.1. After upgrading to 1.0 (with fix to minor API changes) DStream does not seem to be executing. The tasks got killed in 1 second by the worker. Any idea what is causing it? The worker log file is not logging my debug statements. The

Is shuffle stable?

2014-06-14 Thread Daniel Darabos
What I mean is, let's say I run this: sc.parallelize(Seq(0-3, 0-2, 0-1), 3).partitionBy(HashPartitioner(3)).collect Will the result always be Array((0,3), (0,2), (0,1))? Or could I possibly get a different order? I'm pretty sure the shuffle files are taken in the order of the source

Re: guidance on simple unit testing with Sprk

2014-06-14 Thread Gerard Maas
Ll mlll On Jun 14, 2014 4:05 AM, Matei Zaharia matei.zaha...@gmail.com wrote: You need to factor your program so that it’s not just a main(). This is not a Spark-specific issue, it’s about how you’d unit test any program in general. In this case, your main() creates a SparkContext, so you

Re: SparkSQL registerAsTable - No TypeTag available Error

2014-06-14 Thread Michael Armbrust
Can you maybe attach the full scala file? On Sat, Jun 14, 2014 at 5:03 AM, premdass premdas...@yahoo.co.in wrote: Hi, I am trying to run the spark sql example provided on the example https://spark.apache.org/docs/latest/sql-programming-guide.html as a standalone program. When i try to

Re: SparkSQL registerAsTable - No TypeTag available Error

2014-06-14 Thread Michael Armbrust
Actually, are you defining Person as an inner class? You might be running into this: http://stackoverflow.com/questions/18866866/why-there-is-no-typetag-available-in-nested-instantiations-when-interpreted-by On Sat, Jun 14, 2014 at 1:51 PM, Michael Armbrust mich...@databricks.com wrote: Can

Failing to run standalone streaming app: IOException; classNotFoundException; and more

2014-06-14 Thread pns
Hi,I'm attempting to run the following simple standalone app on mac os and spark 1.0 using sbt:val sparkConf = new SparkConf().setAppName(ProcessEvents).setMaster(local[*]).setSparkHome(/Users/me/Downloads/spark)val ssc = new StreamingContext(sparkConf, Seconds(10))val lines =

Re: Is shuffle stable?

2014-06-14 Thread Matei Zaharia
The order is not guaranteed actually, only which keys end up in each partition. Reducers may fetch data from map tasks in an arbitrary order, depending on which ones are available first. If you’d like a specific order, you should sort each partition. Here you might be getting it because each

Re: spark master UI does not keep detailed application history

2014-06-14 Thread wxhsdp
hi, zhen i met the same problem in ec2, application details can not be accessed. but i can read stdout and stderr. the problem has not been solved yet -- View this message in context:

Re: Is shuffle stable?

2014-06-14 Thread Daniel Darabos
Thanks Matei! In the example all three items have the same key, so they go to the same partition: scala sc.parallelize(Seq(0-3, 0-2, 0-1), 3).partitionBy(new HashPartitioner(3)).glom.collect Array(Array((0,3), (0,2), (0,1)), Array(), Array()) I guess the apparent stability is just due to the

long GC pause during file.cache()

2014-06-14 Thread Wei Tan
Hi, I have a single node (192G RAM) stand-alone spark, with memory configuration like this in spark-env.sh SPARK_WORKER_MEMORY=180g SPARK_MEM=180g In spark-shell I have a program like this: val file = sc.textFile(/localpath) //file size is 40G file.cache() val output = file.map(line =

Re: GroupByKey results in OOM - Any other alternative

2014-06-14 Thread Vivek YS
Thanks for the input. I will give foldByKey a shot. The way I am doing is, data is partitioned hourly. So I am computing distinct values hourly. Then I use unionRDD to merge them and compute distinct on the overall data. Is there a way to know which key,value pair is resulting in the OOM ? Is