Hello,
I want to count the number of elements in the DStream, like RDD.count() .
Since there is no such method in DStream, I thought of using DStream.count
and use the accumulator.
How do I do DStream.count() to count the number of elements in a DStream?
How do I create a shared variable in
Do you see anything suspicious in the logs? How did you run the application?
On Thu, Aug 7, 2014 at 10:02 PM, XiaoQinyu xiaoqinyu_sp...@outlook.com
wrote:
Hi~
I run a spark streaming app to receive data from flume event.When I run on
standalone,Spark Streaming can receive the Flume event
I think the eigenvalues and eigenvectors you are talking about is that of
M^T*M or M*M^T, if we get M=U*s*V^T as SVD. What I want is to get
eigenvectors and eigenvalues of M itself. Is this my misunderstanding of
linear algebra or the API?
[image: M^{*} M = V \Sigma^{*} U^{*}\, U \Sigma V^{*} = V
I'd appreciate if anyone could confirm whether this is a bug or intended
behavior of Spark.
Thanks,
Milos
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Partitioning-Where-do-my-partitions-go-tp11635p11766.html
Sent from the Apache Spark User List mailing
They are two different RDDs. Spark doesn't guarantee that the first
partition of RDD1 and the first partition of RDD2 will stay in the
same worker node. If that is the case, if you have 1000
single-partition RDDs the first worker will have very heavy load.
-Xiangrui
On Thu, Aug 7, 2014 at 2:20
Do you mean that you want a continuously updated count as more
events/records are received in the DStream (remember, DStream is a
continuous stream of data)? Assuming that is what you want, you can use a
global counter
var globalCount = 0L
dstream.count().foreachRDD(rdd = { globalCount +=
You can also use Update by key interface to store this shared variable. As
for count you can use foreachRDD to run counts on RDD then store that as
another RDD or put it in updatebykey
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi
My spark program just hang there after mesos saw many task failures. I read an
earlier post
https://groups.google.com/forum/#!msg/spark-users/RThPAN-5zX8/vuxXp27P5-MJ
which says that we should set mesos's failover_timeout, otherwise mesos would
have a rather long default failover timeout.
Hi Vinay,
First of all you should probably migrate to sparksql as shark is not
actively supported anymore.
The 100x benefit entails in-memory caching DAG, since you are not able to
cache the performance can be quite low..
Alternatives you can explore
1. Use parquet as storage which will push down
Thanks for you answer.
But the same problem appears if you start from one common RDD:
val partitioner = new HashPartitioner(10)
val dummyJob = sc.parallelize(0 until 10).map(x = (x,x))
dummyJob.partitionBy(partitioner).foreach { case (ind, x) =
println(Dummy1 - Id = +
I'm running spark 1.0.0 on EMR. I'm able to access the master web UI but not
the worker web UIs or the application detail UI (Server not found).
I added the following inbound rule to the ElasticMapreduce-slave security
group but it didn't help:
Type = All TCP
Port range = 0 - 65535
Source = My
Hi All,
I am running a customized label propagation using Pregel. After a few
iterations, the program becomes slow and wastes a lot of time in mapPartitions
(at GraphImpl.scala:184 or VertexRDD.scala:318, or VertexRDD.scala:323). And
the amount of shuffle write reaches 15GB, while the size of
I would like to use Spark (and Spark streaming) to do some processing on time
series. I have text files with many lines where each line contains a
timestamp and values associated with this timestamp. Each timestamp is
unique. Timestamps are ordered. I am considering them as keys. The lines in
my
Hi Mayur,
I cannot use spark sql in this case because many of the aggregations are not
supported yet. Hence I migrated back to use Shark as all those aggregation
functions are supported.
Hi,
I have similar problem. I need matrix operations such as dot product ,
cross product , transpose, matrix multiplication to be performed on Spark.
Does spark has inbuilt API to support these?
I see matrix factorization implementation in mlib.
On Fri, Aug 8, 2014 at 12:38 PM, yaochunnan [via
Hello,
Thanks a lot! I installed Maven 3.2.2 and the building worked with maven.
But I also got the prebuilt version to run. So I will be using the prebuilt
version. Is there any downside to using the prebuilt version?
Also could you tell me what I would need to do if I had to build it without
When I run a spark program with mesos, if all slaves are down, the program just
hang there.
I could check the Mesos UI or logs to know that all slaves are down.
Is there a way to detect this situation in Spark programs automatically?
Or is there a way to detect this situation programmably?
try to add following jars in classpath
Could be some issues with the way you access it.
If you are able to see http://master-ip-public-ip:8080 then ideally the
application UI (if you havent changed the default) will be available on
http://master-public-ip:4040,
Similarly, you can see the worker UIs at http://worker-public-ip:8081
The SVD does not in general give you eigenvalues of its input.
Are you just trying to access the U and V matrices? they are also
returned in the API. But they are not the eigenvectors of M, as you
note.
I don't think MLlib has anything to help with the general eigenvector problem.
Maybe you can
By the way, for anyone using elasticsearch-hadoop, there is a fix for this
here: https://github.com/elasticsearch/elasticsearch-hadoop/issues/239
Ryan - using the nightly snapshot build of 2.1.0.BUILD-SNAPSHOT fixed this
for me.
On Thu, Aug 7, 2014 at 3:58 PM, Nick Pentreath
Hi Team,
Do we have Job ACL's for Spark which is similar to Hadoop Job ACL’s.
Where I can restrict who can submit the Job to the Spark Master service.
In our hadoop cluster we enabled Job ACL;s by using job queues and
restricting the default queues and have Fair scheduler for managing the
Thanks for your answers. I added some lines to my code and it went through,
but I get a error message for my compute cost function now...
scala val WSSSE = model.computeCost(train)14/08/08 15:48:42 WARN
BlockManagerMasterActor: Removing BlockManager BlockManagerId(driver,
192.168.0.33, 49242, 0)
Thank you. I will look into setting up a hadoop hdfs node.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-reduceByWindow-reduceFunc-invReduceFunc-windowDuration-slideDuration-tp11591p11790.html
Sent from the Apache Spark User List mailing
Thanks. I tried this option, but still got the same error.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Could-not-load-native-gpl-library-tp11743p11791.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
I want to keep track of the events processed in a batch.
How come 'globalCount' work for DStream? I think similar construct won't
work for RDD, that's why there is accumulator.
On Fri, Aug 8, 2014 at 12:52 AM, Tathagata Das tathagata.das1...@gmail.com
wrote:
Do you mean that you want a
Is it possible to keep the events in memory rather than pushing them out to
the file system?
Ali
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-reduceByWindow-reduceFunc-invReduceFunc-windowDuration-slideDuration-tp11591p11792.html
Sent
Thanks Andrew. Actually my job did not use any data in .lzo format. Here is
the program itself:
import org.apache.spark._
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.mllib.classification.LogisticRegressionWithSGD
object Test {
def main(args: Array[String]) {
val
(-incubator, +user)
It's a method of KMeansModel, not KMeans. On first glance it looks
like model should be a KMeansModel, but Scala says it's not. The
problem is...
val model = new KMeans()
.setInitializationMode(k-means||)
.setK(2)
.setMaxIterations(2)
.setEpsilon(1e-4)
.setRuns(1)
.run(train)
Ah, that makes perfect sense. Thanks for the concise explanation!
On Thu, Aug 7, 2014 at 9:14 PM, Xiangrui Meng men...@gmail.com wrote:
ratings.map{ case Rating(u,m,r) = {
val pred = model.predict(u, m)
(r - pred)*(r - pred)
}
}.mean()
The code doesn't work because the
Hi,
I am using a single node Spark cluster on HDFS. When I was going through
the SparkPageRank.scala code, I came across the following line:
*val lines = ctx.textFile(args(0), 1)*
where, args(0) is the path of the input file from the HDFS, and the second
argument is the minimum split of Hadoop
Hi There
I ran into a problem and can’t find a solution.
I was running bin/pyspark ../python/wordcount.py
The wordcount.py is here:
import sys
from operator import add
from pyspark import SparkContext
datafile = '/mnt/data/m1.txt'
sc =
Do you know how I might do a percentile then? I can't figure out how to order
my data and count it so that I can calculate and get to the percentile.
--
View this message in context:
Also, I tried that code and I keep getting this error:
console:26: error: overloaded method value max with alternatives:
(x$1: Double,x$2: Double)Double and
(x$1: Float,x$2: Float)Float and
(x$1: Long,x$2: Long)Long and
(x$1: Int,x$2: Int)Int
cannot be applied to (String, Int.type)
i was using sbt package when I got this error. Then I switched to using sbt
assembly and that solved the issue. To run sbt assembly, you need to have
a file called plugins.sbt in the project root/project directory and it
has the following line:
addSbtPlugin(com.eed3si9n % sbt-assembly % 0.11.2)
Thanks for posting the solution! You can also append `% provided` to
the `spark-mllib` dependency line and remove `spark-core` (because
spark-mllib already depends on spark-core) to make the assembly jar
smaller. -Xiangrui
On Fri, Aug 8, 2014 at 10:05 AM, SK skrishna...@gmail.com wrote:
i was
Just realized that my dStream was being inputted as a String stream. I'm
trying to use the textSocketStream but the .toInt method doesn't seem to be
working. Is there another way to get a numerical stream from a socket?
--
View this message in context:
Is it possible to create custom transformations in Spark? For example data
security transforms such as encrypt and decrypt. Ideally its something one
would like to reuse across Spark streaming, Spark SQL and Spark.
@Miles, eigen-decomposition with asymmetric matrix doesn't always give
real-value solutions, and it doesn't have the nice properties that
symmetric matrix holds. Usually you want to symmetrize your asymmetric
matrix in some way, e.g. see
jI currently have a 4 node spark setup, 1 master and 3 workers running in
spark standalone mode. I am currently stress testing a spark application I
wrote that reads data from kafka and puts it into redshift. I'm pretty happy
with the performance (Reading about 6k messages per second out of kafka)
So I think I have a better idea of the problem now.
The environment is YARN client and IIRC PySpark doesn't run on YARN
cluster.
So my client is heavily loaded which causes iy loose a lot of e executors
which might be part of the problem.
Btw any plans in supporting PySpark in YARN clusters
Same here Ravi. See my post on a similar thread.
Are you running on YARN client?
On Aug 7, 2014 2:56 PM, rpandya r...@iecommerce.com wrote:
I'm running into a problem with executors failing, and it's not clear
what's
causing it. Any suggestions on how to diagnose fix it would be
Hi,
I am able to run my hql query on yarn cluster mode when connecting to the
default hive metastore defined in hive-site.xml.
however, if I want to switch to a different database, like:
hql(use other-database)
it only works in yarn client mode, but failed on yarn-cluster mode with the
Figured it out! Just mapped it to a .toInt version of itself.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Use-SparkStreaming-to-find-the-max-of-a-dataset-tp11734p11812.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
I'm processing about 10GB of tab delimited rawdata with a few fields (page
and user id along with timestamp when user viewed the page) using a 40 node
cluster and using SparkSQL to compute the number of unique visitors per page
at various intervals. I'm currently just reading the data as
Hi Avishek,
As of Spark 1.0, PySpark does in fact run on YARN.
-Sandy
On Fri, Aug 8, 2014 at 12:47 PM, Avishek Saha avishek.s...@gmail.com
wrote:
So I think I have a better idea of the problem now.
The environment is YARN client and IIRC PySpark doesn't run on YARN
cluster.
So my client
You mean YARN cluster, right?
Also, my jobs runs thru all their stages just fine. But the entire
code crashes when I do a saveAsTextFile.
On 8 August 2014 13:24, Sandy Ryza sandy.r...@cloudera.com wrote:
Hi Avishek,
As of Spark 1.0, PySpark does in fact run on YARN.
-Sandy
On Fri, Aug 8,
Hey I was reading the Berkley paper Spark: Cluster Computung with Working
Sets and come across an sentence which is bothering me. Currently I am
trying to run an python script on Spark which executes a parallel k-means
... my problem is ...
after the algorithm finish working with the dataset (ca.
Btw, I get this for Spark-1.0.2
I guess YARN cluster is still not supported for PySpark.
-
Error: Cluster deploy mode is currently not supported for python.
Run with --help for usage help or --verbose for debug
For the reduce/aggregate question, driver collects results in
sequence. We now use tree aggregation in MLlib to reduce driver's
load:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/RDDFunctions.scala#L89
It is faster than aggregate when there are many
Hi Avishek,
I'm running on a manual cluster setup, and all the code is Scala. The load
averages don't seem high when I see these failures (about 12 on a 16-core
machine).
Ravi
--
View this message in context:
near bottom:
http://tobert.github.io/post/2014-07-15-installing-cassandra-spark-stack.html
On Fri, Aug 8, 2014 at 2:00 AM, chutium teng@gmail.com wrote:
try to add following jars in classpath
What exactly do you mean by YARN cluster. Do you mean running Spark
against a YARN cluster in general, or particularly in yarn-cluster mode,
where the driver runs inside a Spark application master?
Also, what error are you seeing in your executors?
-Sandy
On Fri, Aug 8, 2014 at 2:00 PM,
I recently moved my Spark installation from one Linux user to another one,
i.e. changed the folder and ownership of the files. That was everything, no
other settings were changed or different machines used.
However, now it suddenly takes three minutes to have all executors in the
Spark shell
Generally adjacency matrix is undirected(symmetric) on social network, so
you can get eigenvectors from SVD computed result.
A = UDV^t
The first column of U is the biggest eigenvector corresponding to the first
value of D.
xj @ Tokyo
On Sat, Aug 9, 2014 at 4:08 AM, Li Pu
Hi,
Can you anyone point me where to find the sql dialect for Spark SQL? Unlike
HQL, there are lot of tasks involved in creating and querying tables which
is very cumbersome one. If we have to fire multiple queries on 10's and
100's of tables then it is very difficult at this point. Given Spark
Hi TD,
I tried some different setup on maven these days, and now I can at least get
something when running mvn test. However, it seems like scalatest cannot
find the test cases specified in the test suite.
Here is the output I get:
You can always define an arbitrary RDD-to-RDD function, use it from both
Spark and Spark Streaming. For example,
def myTransofmration(rdd: RDD[X]): RDD[Y] = { }
In spark you can obvious apply it on an RDD. In spark streaming, you can
apply on the RDDs of a DStream by
In Spark Streaming, StreamContext.fileStream gives a FileInputDStream.
Within each batch interval, it would launch map tasks for the new files
detected during that interval. It appears that the way Spark compute the
number of map tasks is based oo block size of files.
Below is the quote from
Our prototype application reads a 20GB dataset from HDFS (nearly 180
partitions), groups it by key, sorts by rank and write out to HDFS in that
order. The job runs against two nodes (16G, 24 cores per node available to
the job). I noticed that the execution plan results in two sortByKey
stages,
60 matches
Mail list logo