It seems possible that you are running out of memory unrolling a single
partition of the RDD. This is something that can cause your executor to
OOM, especially if the cache is close to being full so the executor doesn't
have much free memory left. How large are your executors? At the time of
BTW - the reason why the workaround could help is because when persisting
to DISK_ONLY, we explicitly avoid materializing the RDD partition in
memory... we just pass it through to disk
On Mon, Aug 4, 2014 at 1:10 AM, Patrick Wendell pwend...@gmail.com wrote:
It seems possible that you are
For hortonworks, I believe it should work to just link against the
corresponding upstream version. I.e. just set the Hadoop version to 2.4.0
Does that work?
- Patrick
On Mon, Aug 4, 2014 at 12:13 AM, Ron's Yahoo! zlgonza...@yahoo.com.invalid
wrote:
Hi,
Not sure whose issue this is, but if
Are you directly caching files from Hadoop or are you doing some
transformation on them first? If you are doing a groupBy or some type of
transformation, then you could be causing data skew that way.
On Sun, Aug 3, 2014 at 1:19 PM, iramaraju iramar...@gmail.com wrote:
I am running spark 1.0.0,
That failed since it defaulted the versions for yarn and hadoop
I’ll give it a try with just 2.4.0 for both yarn and hadoop…
Thanks,
Ron
On Aug 4, 2014, at 9:44 AM, Patrick Wendell pwend...@gmail.com wrote:
Can you try building without any of the special `hadoop.version` flags and
just
I meant yarn and hadoop defaulted to 1.0.4 so the yarn build fails since 1.0.4
doesn’t exist for yarn...
Thanks,
Ron
On Aug 4, 2014, at 10:01 AM, Ron's Yahoo! zlgonza...@yahoo.com wrote:
That failed since it defaulted the versions for yarn and hadoop
I’ll give it a try with just 2.4.0 for
Thanks Patrick.
But why am I getting a Bad Digest error when I am saving large amount of
data to s3?
/Loss was due to org.apache.hadoop.fs.s3.S3Exception
org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException:
S3 PUT failed for
I don’t think there is an hwx profile, but there probably should be.
- Steve
From: Patrick Wendell pwend...@gmail.com
Date: Monday, August 4, 2014 at 10:08
To: Ron's Yahoo! zlgonza...@yahoo.com
Cc: Ron's Yahoo! zlgonza...@yahoo.com.invalid, Steve Nunez
snu...@hortonworks.com,
The profile does set it automatically:
https://github.com/apache/spark/blob/master/pom.xml#L1086
yarn.version should default to hadoop.version
It shouldn't hurt, and should work, to set to any other specific
version. If one HDP version works and another doesn't, are you sure
the repo has the
What would such a profile do though? In general building for a
specific vendor version means setting hadoop.verison and/or
yarn.version. Any hard-coded value is unlikely to match what a
particular user needs. Setting protobuf versions and so on is already
done by the generic profiles.
In a
Hmm. Fair enough. I hadn¹t given that answer much thought and on
reflection think you¹re right in that a profile would just be a bad hack.
On 8/4/14, 10:35, Sean Owen so...@cloudera.com wrote:
What would such a profile do though? In general building for a
specific vendor version means setting
I got a 40 node cdh 5.1 cluster and attempting to run a simple spark app that
processes about 10-15GB raw data but I keep running into this error:
java.lang.OutOfMemoryError: GC overhead limit exceeded
Each node has 8 cores and 2GB memory. I notice the heap size on the
executors is set to
I am trying to create a asynchronous thread using Java executor service and
launching the javaSparkContext in this thread. But it is failing with exit
code 0(zero). I basically want to submit spark job in one thread and
continue doing something else after submitting. Any help on this? Thanks.
Cool thanks!
On Monday, August 4, 2014 8:58 AM, kriskalish k...@kalish.net wrote:
Hey Ron,
It was pretty much exactly as Sean had depicted. I just needed to provide
count an anonymous function to tell it which elements to count. Since I
wanted to count them all, the function is simply
After playing around with mapPartition I think this does exactly what I
want. I can pass in a function to mapPartition that looks like this:
def f1(iter: Iterator[String]): Iterator[MyIndex] = {
val idx: MyIndex = new MyIndex()
while (iter.hasNext) {
I am using the latest Spark master and additionally, I am loading these jars:
- spark-streaming-twitter_2.10-1.1.0-SNAPSHOT.jar
- twitter4j-core-4.0.2.jar
- twitter4j-stream-4.0.2.jar
My simple test program that I execute in the shell looks as follows:
import org.apache.spark.streaming._
Hi all,
I have setup 2 nodes (master and slave1) on stand alone mode. Tried running
SparkPi example and its working fine. However when I move on to wordcount
its giving me below error:
14/08/04 21:40:33 INFO storage.MemoryStore: ensureFreeSpace(32856) called
with curMem=0, maxMem=311387750
One key thing I forgot to mention is that I changed the avro version to 1.7.7
to get AVRO-1476.
I took a closer look at the jars, and what I noticed is that the assembly jars
that work do not have the org.apache.avro.mapreduce package packaged into the
assembly. For spark-1.0.1,
(- incubator list, + user list)
(Answer copied from original posting at
http://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/Spark-app-throwing-java-lang-OutOfMemoryError-GC-overhead-limit/m-p/16396#U16396
-- let's follow up one place. If it's not specific to CDH, this is a
good place
At 2014-08-04 20:52:26 +0800, Bin wubin_phi...@126.com wrote:
I wonder how spark parameters, e.g., number of paralellism, affect Pregel
performance? Specifically, sendmessage, mergemessage, and vertexprogram?
I have tried label propagation on a 300,000 edges graph, and I found that no
I am (not) seeing this also... No items in the storage UI page. using 1.0
with HDFS...
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Can-t-see-any-thing-one-the-storage-panel-of-application-UI-tp10296p11361.html
Sent from the Apache Spark User List
Hi all,
Could you check with `sc.getExecutorStorageStatus` to see if the blocks are
in fact present? This returns a list of StorageStatus objects, and you can
check whether each status' `blocks` is non-empty. If the blocks do exist,
then this is likely a bug in the UI. There have been a couple of
Good idea Andrew... Using this feature allowed me to debug that my app
wasn't caching properly-- the UI is working as designed for me in 1.0. It
might be a good idea to say no cached blocks instead of an empty page...
just a thought...
On Mon, Aug 4, 2014 at 1:17 PM, Andrew Or-2 [via Apache
This looks pretty comprehensive to me. A few quick suggestions:
- On the VM part: we've actually been avoiding this in all the Databricks
training efforts because the VM itself can be annoying to install and it makes
it harder for people to really use Spark for development (they can learn it,
What is your checkpoint interval of the updateStateByKey's DStream?
Did you modify it? Also, do you have a simple program and the
step-by-step process by which I can reproduce the issue? If not, can
you give me the full DEBUG level logs of the program before and after
restart?
TD
On Mon, Aug 4,
If mesos is allocating a container that is exactly the same as the max heap
size then that is leaving no buffer space for non-heap JVM memory, which
seems wrong to me.
The problem here is that cacheTable is more aggressive about grabbing large
ByteBuffers during caching (which it later releases
Hi there,
I was wondering if somebody could tell me how to create an object with given
classtag so as to make the function below work. The only thing to do is just
to write one line to create an object of Class T. I tried new T but it does
not work. Would it possible to give me one scala line to
I am using spark-1.1.0-SNAPSHOT right now and trying to get familiar with
the JDBC thrift server. I have everything compiled correctly, I can access
data in spark-shell on yarn from my hive installation. Cached tables, etc
all work.
When I execute ./sbin/start-thriftserver.sh
I get the error
Can you show us the code that you are using to write to RabbitMQ. I
fear that this is a relatively common problem where you are using
something like this.
dstream.foreachRDD(rdd = {
// create connection / channel to source
rdd.foreach(element = // write using channel
})
This is not the
Hi,
I was wondering if you can give me an example on How to resgister a DStream
content as a table and access it.
Thanks,
Ali
--
View this message in context:
Hi,
I am trying to run the Big Data Benchmark
https://amplab.cs.berkeley.edu/benchmark/ , and I am stuck at Query 2 for
Spark SQL using Spark 1.0.1:
SELECT SUBSTR(sourceIP, 1, X), SUM(adRevenue) FROM uservisits GROUP BY
SUBSTR(sourceIP, 1, X)
When I look into the sourcecode, it seems that
Can one of the Scala experts please explain this bit of pattern magic from
the Spark ML tutorial: _._2.user ?
As near as I can tell, this is applying the _2 function to the wildcard, and
then applying the Œuser¹ function to that. In a similar way the Œproduct¹
function is applied in the next
This one turned out to be another problem with my app configuration, not with
Spark. The compute task was dependent on the local filesystem, and config
errors on 8 of 10 of the nodes made them fail early. The Spark wrapper was
not checking the process exit value, so it appeared as if they were
There are other threads in the mailing list that has the solution.
For example,
http://apache-spark-user-list.1001560.n3.nabble.com/SQL-streaming-td9668.html
Its recommended that you search through them before posting :)
On Mon, Aug 4, 2014 at 2:45 PM, salemi alireza.sal...@udo.edu wrote:
Hi,
ratings is an RDD of Rating objects. You can see them created as the
second element of the tuple. It's a simple case class:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala#L66
This is just accessing the user and product field of
Hi Steve,
The _ notation can be a bit confusing when starting with Scala, we can
rewrite it to avoid using it here. So instead of
val numUsers = ratings.map(_._2.user) we can write val numUsers =
ratings.map(x = x._2.user)
ratings is an Key-Value RDD (which is an RDD comprised of tuples) and so
Hello,
I downloaded and built Spark 1.0.1 using sbt/sbt assembly. Once built I
attempted to go through a couple examples. I could run Spark interactively
through the Scala Shell and the example sc.parallelize(1 to 1000).count()
returned correcly with 1000. Then I attempted to run the example
1. Does your cluster have access to the machines that run kafka?
2. Is there any error in logs? If so can you please post them?
TD
On Mon, Aug 4, 2014 at 1:12 PM, salemi alireza.sal...@udo.edu wrote:
Hi,
I have the following driver and it works when I run it in the local[*] mode
but if I
Hi all,
Any help would be much appreciated.
Thanks,
Al
On Mon, Aug 4, 2014 at 7:09 PM, Al Amin alamin.is...@gmail.com wrote:
Hi all,
I have setup 2 nodes (master and slave1) on stand alone mode. Tried
running SparkPi example and its working fine. However when I move on to
wordcount its
Hi All,
I am trying to move away from spark-shell to spark-submit and have been making
some code changes. However, I am now having problem with serialization. It
used to work fine before the code update. Not sure what I did wrong. However,
here is the code
JaccardScore.scala
Hello,
Try something like this:
scala def newFoo[T]()(implicit ct: ClassTag[T]): T =
ct.runtimeClass.newInstance().asInstanceOf[T]
newFoo: [T]()(implicit ct: scala.reflect.ClassTag[T])T
scala newFoo[String]()
res2: String =
scala newFoo[java.util.ArrayList[String]]()
res5:
Using 3.0.3 (downloaded from http://mvnrepository.com/artifact/org.twitter4j
) changes the error to
Exception in thread Thread-55 java.lang.NoClassDefFoundError:
twitter4j/StatusListener
at
org.apache.spark.streaming.twitter.TwitterInputDStream.getReceiver(TwitterInputDStream.scala:55)
3.0.3 is being used
https://github.com/apache/spark/blob/master/external/twitter/pom.xml
Are you sure you are deploying the twitter4j3.0.3, and there is not
other version of twitter4j in the path?
TD
On Mon, Aug 4, 2014 at 4:48 PM, durin m...@simon-schaefer.net wrote:
Using 3.0.3 (downloaded
Aaah sorry, I should have been more clear. Can you give me INFO (DEBUG
even better) level logs since the start of the program? I need to see
how the cleaning up code is managing to delete the block.
TD
On Fri, Aug 1, 2014 at 10:26 PM, Kanwaldeep kanwal...@gmail.com wrote:
Here is the log file.
Hi Yan,
That is a good suggestion. I believe non-Zookeeper offset management will
be a feature in the upcoming Kafka 0.8.2 release tentatively scheduled for
September.
https://cwiki.apache.org/confluence/display/KAFKA/Inbuilt+Consumer+Offset+Management
That should make this fairly easy to
In the WebUI Environment tab, the section Classpath Entries lists the
following ones as part of System Classpath:
/foo/hadoop-2.0.0-cdh4.5.0/etc/hadoop
/foo/spark-master-2014-07-28/assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop2.0.0-cdh4.5.0.jar
/foo/spark-master-2014-07-28/conf
From the log, I noticed the substr was added on July 15th, 1.0.1 release
should be earlier than that. Community is now working on releasing the 1.1.0,
and also some of the performance improvements were added. Probably you can try
that for your benchmark.
Cheng Hao
-Original Message-
Yeah, there will likely be a community preview build soon for the 1.1
release. Benchmarking that will both give you better performance and help
QA the release.
Bonus points if you turn on codegen for Spark SQL (experimental feature)
when benchmarking and report bugs: SET spark.sql.codegen=true
Hello Spark Users,
I have a spark streaming program that stream data from kafka topics and
output as parquet file on HDFS.
Now I want to write a unit test for this program to make sure the output
data is correct (i.e not missing any data from kafka).
However, I have no idea about how to do
Appropriately timed question! Here is the PR that adds a real unit
test for Kafka stream in Spark Streaming. Maybe this will help!
https://github.com/apache/spark/pull/1751/files
On Mon, Aug 4, 2014 at 6:30 PM, JiajiaJing jj.jing0...@gmail.com wrote:
Hello Spark Users,
I have a spark
This helps a lot!!
Thank you very much!
Jiajia
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Unit-Test-for-Spark-Streaming-tp11394p11396.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Are you able to run it locally? If not, can you try creating an
all-inclusive jar with all transitive dependencies together (sbt
assembly) and then try running the app? Then this will be a self
contained environment, which will help us debug better.
TD
On Mon, Aug 4, 2014 at 5:06 PM, durin
To get the ClassTag object inside your function with the original syntax you
used (T: ClassTag), you can do this:
def read[T: ClassTag](): T = {
val ct = classTag[T]
ct.runtimeClass.newInstance().asInstanceOf[T]
}
Passing the ClassTag with : ClassTag lets you have an implicit parameter that
Actually, if you don’t use method like persist or cache, it even not store the
rdd to the disk. Every time you use this rdd, they just compute it from the
original one.
In logistic regression from mllib, they don't persist the changed input , so I
can't see the rdd from the web gui.
I have
Is there a way to visualize the task dependency graph of an application,
during or after its execution? The list of stages on port 4040 is useful,
but still quite limited. For example, I've found that if I don't cache() the
result of one expensive computation, it will get repeated 4 times, but it
Hello everybody!
I'm getting started with spark and mllib. I'm successful in building a
small cluster and follow the tutorial. However, I would like to ask about
how to use the model, which is trained by mllib. I understand that, with
data we can training the model such as Classifier model, then
Some extra work is needed to close the loop. One related example is
streaming linear regression added by Jeremy very recently:
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/StreamingLinearRegression.scala
You can use a model trained offline
I agree that this is definitely useful.
One related project I know of is Sparkling [1] (also see talk at Spark
Summit 2014 [2]), but it'd be great (and I imagine somewhat
challenging) to visualize the *physical execution* graph of a Spark
job.
[1] http://pr01.uml.edu/
[2]
58 matches
Mail list logo