Re: What should happen if we try to cache more data than the cluster can hold in memory?

2014-08-04 Thread Patrick Wendell
It seems possible that you are running out of memory unrolling a single partition of the RDD. This is something that can cause your executor to OOM, especially if the cache is close to being full so the executor doesn't have much free memory left. How large are your executors? At the time of

Re: What should happen if we try to cache more data than the cluster can hold in memory?

2014-08-04 Thread Patrick Wendell
BTW - the reason why the workaround could help is because when persisting to DISK_ONLY, we explicitly avoid materializing the RDD partition in memory... we just pass it through to disk On Mon, Aug 4, 2014 at 1:10 AM, Patrick Wendell pwend...@gmail.com wrote: It seems possible that you are

Re: Issues with HDP 2.4.0.2.1.3.0-563

2014-08-04 Thread Patrick Wendell
For hortonworks, I believe it should work to just link against the corresponding upstream version. I.e. just set the Hadoop version to 2.4.0 Does that work? - Patrick On Mon, Aug 4, 2014 at 12:13 AM, Ron's Yahoo! zlgonza...@yahoo.com.invalid wrote: Hi, Not sure whose issue this is, but if

Re: Cached RDD Block Size - Uneven Distribution

2014-08-04 Thread Patrick Wendell
Are you directly caching files from Hadoop or are you doing some transformation on them first? If you are doing a groupBy or some type of transformation, then you could be causing data skew that way. On Sun, Aug 3, 2014 at 1:19 PM, iramaraju iramar...@gmail.com wrote: I am running spark 1.0.0,

Re: Issues with HDP 2.4.0.2.1.3.0-563

2014-08-04 Thread Ron's Yahoo!
That failed since it defaulted the versions for yarn and hadoop I’ll give it a try with just 2.4.0 for both yarn and hadoop… Thanks, Ron On Aug 4, 2014, at 9:44 AM, Patrick Wendell pwend...@gmail.com wrote: Can you try building without any of the special `hadoop.version` flags and just

Re: Issues with HDP 2.4.0.2.1.3.0-563

2014-08-04 Thread Ron's Yahoo!
I meant yarn and hadoop defaulted to 1.0.4 so the yarn build fails since 1.0.4 doesn’t exist for yarn... Thanks, Ron On Aug 4, 2014, at 10:01 AM, Ron's Yahoo! zlgonza...@yahoo.com wrote: That failed since it defaulted the versions for yarn and hadoop I’ll give it a try with just 2.4.0 for

Re: Bad Digest error while doing aws s3 put

2014-08-04 Thread lmk
Thanks Patrick. But why am I getting a Bad Digest error when I am saving large amount of data to s3? /Loss was due to org.apache.hadoop.fs.s3.S3Exception org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException: S3 PUT failed for

Re: Issues with HDP 2.4.0.2.1.3.0-563

2014-08-04 Thread Steve Nunez
I don’t think there is an hwx profile, but there probably should be. - Steve From: Patrick Wendell pwend...@gmail.com Date: Monday, August 4, 2014 at 10:08 To: Ron's Yahoo! zlgonza...@yahoo.com Cc: Ron's Yahoo! zlgonza...@yahoo.com.invalid, Steve Nunez snu...@hortonworks.com,

Re: Issues with HDP 2.4.0.2.1.3.0-563

2014-08-04 Thread Sean Owen
The profile does set it automatically: https://github.com/apache/spark/blob/master/pom.xml#L1086 yarn.version should default to hadoop.version It shouldn't hurt, and should work, to set to any other specific version. If one HDP version works and another doesn't, are you sure the repo has the

Re: Issues with HDP 2.4.0.2.1.3.0-563

2014-08-04 Thread Sean Owen
What would such a profile do though? In general building for a specific vendor version means setting hadoop.verison and/or yarn.version. Any hard-coded value is unlikely to match what a particular user needs. Setting protobuf versions and so on is already done by the generic profiles. In a

Re: Issues with HDP 2.4.0.2.1.3.0-563

2014-08-04 Thread Steve Nunez
Hmm. Fair enough. I hadn¹t given that answer much thought and on reflection think you¹re right in that a profile would just be a bad hack. On 8/4/14, 10:35, Sean Owen so...@cloudera.com wrote: What would such a profile do though? In general building for a specific vendor version means setting

Spark app throwing java.lang.OutOfMemoryError: GC overhead limit exceeded

2014-08-04 Thread buntu
I got a 40 node cdh 5.1 cluster and attempting to run a simple spark app that processes about 10-15GB raw data but I keep running into this error: java.lang.OutOfMemoryError: GC overhead limit exceeded Each node has 8 cores and 2GB memory. I notice the heap size on the executors is set to

Re: Spark job tracker.

2014-08-04 Thread abhiguruvayya
I am trying to create a asynchronous thread using Java executor service and launching the javaSparkContext in this thread. But it is failing with exit code 0(zero). I basically want to submit spark job in one thread and continue doing something else after submitting. Any help on this? Thanks.

Re: Computing mean and standard deviation by key

2014-08-04 Thread Ron Gonzalez
Cool thanks!  On Monday, August 4, 2014 8:58 AM, kriskalish k...@kalish.net wrote: Hey Ron, It was pretty much exactly as Sean had depicted. I just needed to provide count an anonymous function to tell it which elements to count. Since I wanted to count them all, the function is simply

Re: creating a distributed index

2014-08-04 Thread Philip Ogren
After playing around with mapPartition I think this does exactly what I want. I can pass in a function to mapPartition that looks like this: def f1(iter: Iterator[String]): Iterator[MyIndex] = { val idx: MyIndex = new MyIndex() while (iter.hasNext) {

Spark Streaming fails - where is the problem?

2014-08-04 Thread durin
I am using the latest Spark master and additionally, I am loading these jars: - spark-streaming-twitter_2.10-1.1.0-SNAPSHOT.jar - twitter4j-core-4.0.2.jar - twitter4j-stream-4.0.2.jar My simple test program that I execute in the shell looks as follows: import org.apache.spark.streaming._

Configuration setup and Connection refused

2014-08-04 Thread alamin.ishak
Hi all, I have setup 2 nodes (master and slave1) on stand alone mode. Tried running SparkPi example and its working fine. However when I move on to wordcount its giving me below error: 14/08/04 21:40:33 INFO storage.MemoryStore: ensureFreeSpace(32856) called with curMem=0, maxMem=311387750

Re: Issues with HDP 2.4.0.2.1.3.0-563

2014-08-04 Thread Ron Gonzalez
One key thing I forgot to mention is that I changed the avro version to 1.7.7 to get AVRO-1476. I took a closer look at the jars, and what I noticed is that the assembly jars that work do not have the org.apache.avro.mapreduce package packaged into the assembly. For spark-1.0.1,

Re: Spark app throwing java.lang.OutOfMemoryError: GC overhead limit exceeded

2014-08-04 Thread Sean Owen
(- incubator list, + user list) (Answer copied from original posting at http://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/Spark-app-throwing-java-lang-OutOfMemoryError-GC-overhead-limit/m-p/16396#U16396 -- let's follow up one place. If it's not specific to CDH, this is a good place

Re: [GraphX] How spark parameters relate to Pregel implementation

2014-08-04 Thread Ankur Dave
At 2014-08-04 20:52:26 +0800, Bin wubin_phi...@126.com wrote: I wonder how spark parameters, e.g., number of paralellism, affect Pregel performance? Specifically, sendmessage, mergemessage, and vertexprogram? I have tried label propagation on a 300,000 edges graph, and I found that no

Re: Can't see any thing one the storage panel of application UI

2014-08-04 Thread anthonyjschu...@gmail.com
I am (not) seeing this also... No items in the storage UI page. using 1.0 with HDFS... -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Can-t-see-any-thing-one-the-storage-panel-of-application-UI-tp10296p11361.html Sent from the Apache Spark User List

Re: Can't see any thing one the storage panel of application UI

2014-08-04 Thread Andrew Or
Hi all, Could you check with `sc.getExecutorStorageStatus` to see if the blocks are in fact present? This returns a list of StorageStatus objects, and you can check whether each status' `blocks` is non-empty. If the blocks do exist, then this is likely a bug in the UI. There have been a couple of

Re: Can't see any thing one the storage panel of application UI

2014-08-04 Thread anthonyjschu...@gmail.com
Good idea Andrew... Using this feature allowed me to debug that my app wasn't caching properly-- the UI is working as designed for me in 1.0. It might be a good idea to say no cached blocks instead of an empty page... just a thought... On Mon, Aug 4, 2014 at 1:17 PM, Andrew Or-2 [via Apache

Re: Spark Training Course?

2014-08-04 Thread Matei Zaharia
This looks pretty comprehensive to me. A few quick suggestions: - On the VM part: we've actually been avoiding this in all the Databricks training efforts because the VM itself can be annoying to install and it makes it harder for people to really use Spark for development (they can learn it,

Re: [SparkStreaming]StackOverflow on restart

2014-08-04 Thread Tathagata Das
What is your checkpoint interval of the updateStateByKey's DStream? Did you modify it? Also, do you have a simple program and the step-by-step process by which I can reproduce the issue? If not, can you give me the full DEBUG level logs of the program before and after restart? TD On Mon, Aug 4,

Re: SQLCtx cacheTable

2014-08-04 Thread Michael Armbrust
If mesos is allocating a container that is exactly the same as the max heap size then that is leaving no buffer space for non-heap JVM memory, which seems wrong to me. The problem here is that cacheTable is more aggressive about grabbing large ByteBuffers during caching (which it later releases

Create a new object by given classtag

2014-08-04 Thread Parthus
Hi there, I was wondering if somebody could tell me how to create an object with given classtag so as to make the function below work. The only thing to do is just to write one line to create an object of Class T. I tried new T but it does not work. Would it possible to give me one scala line to

Spark SQL JDBC

2014-08-04 Thread John Omernik
I am using spark-1.1.0-SNAPSHOT right now and trying to get familiar with the JDBC thrift server. I have everything compiled correctly, I can access data in spark-shell on yarn from my hive installation. Cached tables, etc all work. When I execute ./sbin/start-thriftserver.sh I get the error

Re: Writing to RabbitMQ

2014-08-04 Thread Tathagata Das
Can you show us the code that you are using to write to RabbitMQ. I fear that this is a relatively common problem where you are using something like this. dstream.foreachRDD(rdd = { // create connection / channel to source rdd.foreach(element = // write using channel }) This is not the

Streaming + SQL : How to resgister a DStream content as a table and access it

2014-08-04 Thread salemi
Hi, I was wondering if you can give me an example on How to resgister a DStream content as a table and access it. Thanks, Ali -- View this message in context:

Substring in Spark SQL

2014-08-04 Thread Tom
Hi, I am trying to run the Big Data Benchmark https://amplab.cs.berkeley.edu/benchmark/ , and I am stuck at Query 2 for Spark SQL using Spark 1.0.1: SELECT SUBSTR(sourceIP, 1, X), SUM(adRevenue) FROM uservisits GROUP BY SUBSTR(sourceIP, 1, X) When I look into the sourcecode, it seems that

MovieLensALS - Scala Pattern Magic

2014-08-04 Thread Steve Nunez
Can one of the Scala experts please explain this bit of pattern magic from the Spark ML tutorial: _._2.user ? As near as I can tell, this is applying the _2 function to the wildcard, and then applying the Œuser¹ function to that. In a similar way the Œproduct¹ function is applied in the next

Re: Memory compute-intensive tasks

2014-08-04 Thread rpandya
This one turned out to be another problem with my app configuration, not with Spark. The compute task was dependent on the local filesystem, and config errors on 8 of 10 of the nodes made them fail early. The Spark wrapper was not checking the process exit value, so it appeared as if they were

Re: Streaming + SQL : How to resgister a DStream content as a table and access it

2014-08-04 Thread Tathagata Das
There are other threads in the mailing list that has the solution. For example, http://apache-spark-user-list.1001560.n3.nabble.com/SQL-streaming-td9668.html Its recommended that you search through them before posting :) On Mon, Aug 4, 2014 at 2:45 PM, salemi alireza.sal...@udo.edu wrote: Hi,

Re: MovieLensALS - Scala Pattern Magic

2014-08-04 Thread Sean Owen
ratings is an RDD of Rating objects. You can see them created as the second element of the tuple. It's a simple case class: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala#L66 This is just accessing the user and product field of

Re: MovieLensALS - Scala Pattern Magic

2014-08-04 Thread Holden Karau
Hi Steve, The _ notation can be a bit confusing when starting with Scala, we can rewrite it to avoid using it here. So instead of val numUsers = ratings.map(_._2.user) we can write val numUsers = ratings.map(x = x._2.user) ratings is an Key-Value RDD (which is an RDD comprised of tuples) and so

Running Examples

2014-08-04 Thread cetaylor
Hello, I downloaded and built Spark 1.0.1 using sbt/sbt assembly. Once built I attempted to go through a couple examples. I could run Spark interactively through the Scala Shell and the example sc.parallelize(1 to 1000).count() returned correcly with 1000. Then I attempted to run the example

Re: spark streaming kafka

2014-08-04 Thread Tathagata Das
1. Does your cluster have access to the machines that run kafka? 2. Is there any error in logs? If so can you please post them? TD On Mon, Aug 4, 2014 at 1:12 PM, salemi alireza.sal...@udo.edu wrote: Hi, I have the following driver and it works when I run it in the local[*] mode but if I

Re: Configuration setup and Connection refused

2014-08-04 Thread alamin.ishak
Hi all, Any help would be much appreciated. Thanks, Al On Mon, Aug 4, 2014 at 7:09 PM, Al Amin alamin.is...@gmail.com wrote: Hi all, I have setup 2 nodes (master and slave1) on stand alone mode. Tried running SparkPi example and its working fine. However when I move on to wordcount its

java.lang.IllegalArgumentException: Unable to create serializer com.esotericsoftware.kryo.serializers.FieldSerializer

2014-08-04 Thread Sameer Tilak
Hi All, I am trying to move away from spark-shell to spark-submit and have been making some code changes. However, I am now having problem with serialization. It used to work fine before the code update. Not sure what I did wrong. However, here is the code JaccardScore.scala

Re: Create a new object by given classtag

2014-08-04 Thread Marcelo Vanzin
Hello, Try something like this: scala def newFoo[T]()(implicit ct: ClassTag[T]): T = ct.runtimeClass.newInstance().asInstanceOf[T] newFoo: [T]()(implicit ct: scala.reflect.ClassTag[T])T scala newFoo[String]() res2: String = scala newFoo[java.util.ArrayList[String]]() res5:

Re: Spark Streaming fails - where is the problem?

2014-08-04 Thread durin
Using 3.0.3 (downloaded from http://mvnrepository.com/artifact/org.twitter4j ) changes the error to Exception in thread Thread-55 java.lang.NoClassDefFoundError: twitter4j/StatusListener at org.apache.spark.streaming.twitter.TwitterInputDStream.getReceiver(TwitterInputDStream.scala:55)

Re: Spark Streaming fails - where is the problem?

2014-08-04 Thread Tathagata Das
3.0.3 is being used https://github.com/apache/spark/blob/master/external/twitter/pom.xml Are you sure you are deploying the twitter4j3.0.3, and there is not other version of twitter4j in the path? TD On Mon, Aug 4, 2014 at 4:48 PM, durin m...@simon-schaefer.net wrote: Using 3.0.3 (downloaded

Re: Spark Streaming : Could not compute split, block not found

2014-08-04 Thread Tathagata Das
Aaah sorry, I should have been more clear. Can you give me INFO (DEBUG even better) level logs since the start of the program? I need to see how the cleaning up code is managing to delete the block. TD On Fri, Aug 1, 2014 at 10:26 PM, Kanwaldeep kanwal...@gmail.com wrote: Here is the log file.

Re: Low Level Kafka Consumer for Spark

2014-08-04 Thread Jonathan Hodges
Hi Yan, That is a good suggestion. I believe non-Zookeeper offset management will be a feature in the upcoming Kafka 0.8.2 release tentatively scheduled for September. https://cwiki.apache.org/confluence/display/KAFKA/Inbuilt+Consumer+Offset+Management That should make this fairly easy to

Re: Spark Streaming fails - where is the problem?

2014-08-04 Thread durin
In the WebUI Environment tab, the section Classpath Entries lists the following ones as part of System Classpath: /foo/hadoop-2.0.0-cdh4.5.0/etc/hadoop /foo/spark-master-2014-07-28/assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop2.0.0-cdh4.5.0.jar /foo/spark-master-2014-07-28/conf

RE: Substring in Spark SQL

2014-08-04 Thread Cheng, Hao
From the log, I noticed the substr was added on July 15th, 1.0.1 release should be earlier than that. Community is now working on releasing the 1.1.0, and also some of the performance improvements were added. Probably you can try that for your benchmark. Cheng Hao -Original Message-

Re: Substring in Spark SQL

2014-08-04 Thread Michael Armbrust
Yeah, there will likely be a community preview build soon for the 1.1 release. Benchmarking that will both give you better performance and help QA the release. Bonus points if you turn on codegen for Spark SQL (experimental feature) when benchmarking and report bugs: SET spark.sql.codegen=true

Unit Test for Spark Streaming

2014-08-04 Thread JiajiaJing
Hello Spark Users, I have a spark streaming program that stream data from kafka topics and output as parquet file on HDFS. Now I want to write a unit test for this program to make sure the output data is correct (i.e not missing any data from kafka). However, I have no idea about how to do

Re: Unit Test for Spark Streaming

2014-08-04 Thread Tathagata Das
Appropriately timed question! Here is the PR that adds a real unit test for Kafka stream in Spark Streaming. Maybe this will help! https://github.com/apache/spark/pull/1751/files On Mon, Aug 4, 2014 at 6:30 PM, JiajiaJing jj.jing0...@gmail.com wrote: Hello Spark Users, I have a spark

Re: Unit Test for Spark Streaming

2014-08-04 Thread JiajiaJing
This helps a lot!! Thank you very much! Jiajia -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Unit-Test-for-Spark-Streaming-tp11394p11396.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark Streaming fails - where is the problem?

2014-08-04 Thread Tathagata Das
Are you able to run it locally? If not, can you try creating an all-inclusive jar with all transitive dependencies together (sbt assembly) and then try running the app? Then this will be a self contained environment, which will help us debug better. TD On Mon, Aug 4, 2014 at 5:06 PM, durin

Re: Create a new object by given classtag

2014-08-04 Thread Matei Zaharia
To get the ClassTag object inside your function with the original syntax you used (T: ClassTag), you can do this: def read[T: ClassTag](): T = {   val ct = classTag[T]   ct.runtimeClass.newInstance().asInstanceOf[T] } Passing the ClassTag with : ClassTag lets you have an implicit parameter that

Re: Can't see any thing one the storage panel of application UI

2014-08-04 Thread binbinbin915
Actually, if you don’t use method like persist or cache, it even not store the rdd to the disk. Every time you use this rdd, they just compute it from the original one. In logistic regression from mllib, they don't persist the changed input , so I can't see the rdd from the web gui. I have

Visualizing stage task dependency graph

2014-08-04 Thread rpandya
Is there a way to visualize the task dependency graph of an application, during or after its execution? The list of stages on port 4040 is useful, but still quite limited. For example, I've found that if I don't cache() the result of one expensive computation, it will get repeated 4 times, but it

about spark and using machine learning model

2014-08-04 Thread Hoai-Thu Vuong
Hello everybody! I'm getting started with spark and mllib. I'm successful in building a small cluster and follow the tutorial. However, I would like to ask about how to use the model, which is trained by mllib. I understand that, with data we can training the model such as Classifier model, then

Re: about spark and using machine learning model

2014-08-04 Thread Xiangrui Meng
Some extra work is needed to close the loop. One related example is streaming linear regression added by Jeremy very recently: https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/StreamingLinearRegression.scala You can use a model trained offline

Re: Visualizing stage task dependency graph

2014-08-04 Thread Zongheng Yang
I agree that this is definitely useful. One related project I know of is Sparkling [1] (also see talk at Spark Summit 2014 [2]), but it'd be great (and I imagine somewhat challenging) to visualize the *physical execution* graph of a Spark job. [1] http://pr01.uml.edu/ [2]