And got the first cut:
val res = pairs.groupByKey().map((x) = (x._1, x._2.size, x._2.toSet.size))
gives the total unique.
The question : is it scalable efficient ? Would appreciate insights.
Cheers
k/
On Fri, Jun 13, 2014 at 10:51 PM, Krishna Sankar ksanka...@gmail.com
wrote:
1.
examples/src/main/scala/org/apache/spark/examples/mllib/BinaryClassification.scala
contains example code that shows how to set regParam.
2. A static method with more than 3 parameters becomes hard to
remember and hard to maintain. Please use LogistricRegressionWithSGD's
default constructor
Hi,
I am trying to run the spark sql example provided on the example
https://spark.apache.org/docs/latest/sql-programming-guide.html as a
standalone program.
When i try to run the compile the program, i am getting the below error
Done updating.
Compiling 1 Scala source to
Hey all!
I have got an iterative problem. I'm trying to find something similar to
Hadoop's MultipleOutputs [1] in Spark 1.0. I need to build up a couple of
large dense vectors (may contain billions of elements - 2 billion doubles =
at least 16GB) by adding partial vector chunks to it. This can be
Hi,
For last couple of days I have been trying hard to get around this
problem. Please share any insights on solving this problem.
Problem :
There is a huge list of (key, value) pairs. I want to transform this to
(key, distinct values) and then eventually to (key, distinct values count)
On
Grouping by key is always problematic since a key might have a huge number
of values. You can do a little better than grouping *all* values and *then*
finding distinct values by using foldByKey, putting values into a Set. At
least you end up with only distinct values in memory. (You don't need two
Hi All,
I've some Streaming code in Java that works on 0.9.1. After upgrading to
1.0 (with fix to minor API changes) DStream does not seem to be executing.
The tasks got killed in 1 second by the worker. Any idea what is causing
it?
The worker log file is not logging my debug statements. The
What I mean is, let's say I run this:
sc.parallelize(Seq(0-3, 0-2, 0-1), 3).partitionBy(HashPartitioner(3)).collect
Will the result always be Array((0,3), (0,2), (0,1))? Or could I
possibly get a different order?
I'm pretty sure the shuffle files are taken in the order of the source
Ll mlll
On Jun 14, 2014 4:05 AM, Matei Zaharia matei.zaha...@gmail.com wrote:
You need to factor your program so that it’s not just a main(). This is
not a Spark-specific issue, it’s about how you’d unit test any program in
general. In this case, your main() creates a SparkContext, so you
Can you maybe attach the full scala file?
On Sat, Jun 14, 2014 at 5:03 AM, premdass premdas...@yahoo.co.in wrote:
Hi,
I am trying to run the spark sql example provided on the example
https://spark.apache.org/docs/latest/sql-programming-guide.html as a
standalone program.
When i try to
Actually, are you defining Person as an inner class?
You might be running into this:
http://stackoverflow.com/questions/18866866/why-there-is-no-typetag-available-in-nested-instantiations-when-interpreted-by
On Sat, Jun 14, 2014 at 1:51 PM, Michael Armbrust mich...@databricks.com
wrote:
Can
Hi,I'm attempting to run the following simple standalone app on mac os and
spark 1.0 using sbt:val sparkConf = new
SparkConf().setAppName(ProcessEvents).setMaster(local[*]).setSparkHome(/Users/me/Downloads/spark)val
ssc = new StreamingContext(sparkConf, Seconds(10))val lines =
The order is not guaranteed actually, only which keys end up in each partition.
Reducers may fetch data from map tasks in an arbitrary order, depending on
which ones are available first. If you’d like a specific order, you should sort
each partition. Here you might be getting it because each
hi, zhen
i met the same problem in ec2, application details can not be accessed.
but i can read stdout
and stderr. the problem has not been solved yet
--
View this message in context:
Thanks Matei!
In the example all three items have the same key, so they go to the same
partition:
scala sc.parallelize(Seq(0-3, 0-2, 0-1), 3).partitionBy(new
HashPartitioner(3)).glom.collect
Array(Array((0,3), (0,2), (0,1)), Array(), Array())
I guess the apparent stability is just due to the
Hi,
I have a single node (192G RAM) stand-alone spark, with memory
configuration like this in spark-env.sh
SPARK_WORKER_MEMORY=180g
SPARK_MEM=180g
In spark-shell I have a program like this:
val file = sc.textFile(/localpath) //file size is 40G
file.cache()
val output = file.map(line =
Thanks for the input. I will give foldByKey a shot.
The way I am doing is, data is partitioned hourly. So I am computing
distinct values hourly. Then I use unionRDD to merge them and compute
distinct on the overall data.
Is there a way to know which key,value pair is resulting in the OOM ?
Is
17 matches
Mail list logo