Re: java.lang.OutOfMemoryError while running Shark on Mesos

2014-05-23 Thread Akhil Das
Hi Prabeesh, Do a export _JAVA_OPTIONS=-Xmx10g before starting the shark. Also you can do a ps aux | grep shark and see how much memory it is being allocated, mostly it should be 512mb, in that case increase the limit. Thanks Best Regards On Fri, May 23, 2014 at 10:22 AM, prabeesh k

Re: Use SparkListener to get overall progress of an action

2014-05-23 Thread Mayur Rustagi
We have an internal patched version of Spark webUI which exports application related data as Json. We use monitoring systems as well as alternate UI for that json data for our specific application. Found it much cleaner. Can provide 0.9.1 version. Would submit as a pull request soon. Mayur

Re: Unsubscribe

2014-05-23 Thread James Jones
Unsubscribe James Jones Acquisition Editor [ Packt Publishing ] Tel: 0121 265 6486 Web: www.packtpub.com Linkedin: uk.linkedin.com/pub/james-jones/52/3b9/596/ Twitter: @_James_Jones_ Packt Publishing Limited Registered Office: Livery Place, 35 Livery Street, Birmingham, West Midlands,

Re: Error while launching ec2 spark cluster with HVM (r3.large)

2014-05-23 Thread Mayur Rustagi
I am not sure if EC2 script was updated for R3, R3 doesnt provide formatted instance store also requires newer version of AMI for the same. Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Fri, May 23, 2014

Re: controlling the time in spark-streaming

2014-05-23 Thread Mayur Rustagi
Well its hard to use text data as time of input. But if you are adament here's what you would do. Have a Dstream object which works in on a folder using filestream/textstream Then have another process (spark streaming or cron) read through the files you receive push them into the folder in order

Re: how to set task number?

2014-05-23 Thread Mayur Rustagi
How many cores do you see on your spark master (8080 port). By default spark application should take all cores when you launch it. Unless you have set max core configuration. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi

Re: Spark on HBase vs. Spark on HDFS

2014-05-23 Thread Mayur Rustagi
Also I am unsure if Spark on Hbase leverages Locality. When you cache process data do you see node_local jobs in process list. Spark on HDFS leverages locality quite well can really boost performance by 3-4x in my experience. If you are loading all your data from HBase to spark then you are

Re: Shark resilience to unusable slaves

2014-05-23 Thread Praveen R
You might use bin/shark-withdebug to find the exact issue for the failure. That said, easiest way to get the cluster running, is to get rid of dis-functional machine from spark cluster (remove it from slaves file). Hope that helps. On Thu, May 22, 2014 at 9:04 PM, Yana Kadiyska

Re: Spark Performace Comparison Spark on YARN vs Spark Standalone

2014-05-23 Thread Otávio Carvalho
I have been analyzing Storm performance and there's no significant overhead added to the processing nodes. I'm interested in those results over Spark as well. Thanks in advance, Otávio Carvalho. Undergrad. CompSci Student at UFRGS Porto Alegre, Brazil. 2014-05-20 18:46 GMT-03:00

Re: Spark Streaming using Flume body size limitation

2014-05-23 Thread lemieud
Hi, I think I found the problem. In SparkFlumeEvent the readExternal method use in.read(bodyBuff) which read the first 1020 bytes, but no more. The code should make sure to read everything. The following change will fix the problem: in.read(bodyBuff) to: in.readFully(bodyBuff) I attached a

RE: How to turn off MetadataCleaner?

2014-05-23 Thread Adrian Mocanu
Hi TD, I use 0.9.1. Thanks for letting me know. This issue drove me up the wall. I even made a method to close all that I could think of: def stopSpark(ssc: StreamingContext) = { ssc.sparkContext.cleanup(500) ssc.sparkContext.clearFiles() ssc.sparkContext.clearJars()

Re: Spark Streaming using Flume body size limitation

2014-05-23 Thread David Lemieux
For some reason the patch did not make it. Trying via email: /D On May 23, 2014, at 9:52 AM, lemieud david.lemi...@radialpoint.com wrote: Hi, I think I found the problem. In SparkFlumeEvent the readExternal method use in.read(bodyBuff) which read the first 1020 bytes, but no more. The

Shuffle file consolidation

2014-05-23 Thread Nathan Kronenfeld
In trying to sort some largish datasets, we came across the spark.shuffle.consolidateFiles property, and I found in the source code that it is set, by default, to false, with a note to default it to true when the feature is stable. Does anyone know what is unstable about this? If we set it true,

Re: Use SparkListener to get overall progress of an action

2014-05-23 Thread Pierre Borckmans
That would be great, Mayur, thanks! Anyhow, to be more specific, my question really was the following: Is there any way to link events in the SparkListener to an action triggered in your code? Cheers Pierre Borckmans Software team RealImpact Analytics | Brussels Office

Re: Shuffle file consolidation

2014-05-23 Thread Han JU
Hi Nathan, There's some explanation in the spark configuration section: ``` If set to true, consolidates intermediate files created during a shuffle. Creating fewer files can improve filesystem performance for shuffles with large numbers of reduce tasks. It is recommended to set this to true

Re: Use SparkListener to get overall progress of an action

2014-05-23 Thread Chester At Yahoo
Sounds like just what we need. For Hadoop we have progress bar to show the current status of the job . We like to do the same for spark. The yarn client only shows the percentage progress does show any text info. Does your PR works for yarn mode ? Chester Sent from my iPhone On May 23,

Re: Use SparkListener to get overall progress of an action

2014-05-23 Thread Otávio Carvalho
Mayur, I'm interested on it as well. Can you send me? Cheers, Otávio Carvalho. Undergrad. Student at Federal University of Rio Grande do Sul Porto Alegre, Brazil. 2014-05-23 11:00 GMT-03:00 Pierre Borckmans pierre.borckm...@realimpactanalytics.com: That would be great, Mayur, thanks!

Unsubscribe

2014-05-23 Thread Prasanta Bose
Unsubscribe

Re: Comprehensive Port Configuration reference?

2014-05-23 Thread Andrew Ash
Hi everyone, I've also been interested in better understanding what ports are used where and the direction the network connections go. I've observed a running cluster and read through code, and came up with the below documentation addition. https://github.com/apache/spark/pull/856 Scott and

Re: Use SparkListener to get overall progress of an action

2014-05-23 Thread Pierre B
I’ve been looking at how this is implemented in the UI: https://github.com/apache/spark/blob/branch-0.9/core/src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala 1/ it’s easy to get the RDD name at the Stage events level 2/ the tricky part is that at the task level, we cannot link

Re: Computing cosine similiarity using pyspark

2014-05-23 Thread Andrew Ash
Hi Jamal, I don't believe there are pre-written algorithms for Cosine similarity or Pearson Porrelation in PySpark that you can re-use. If you end up writing your own implementation of the algorithm though, the project would definitely appreciate if you shared that code back with the project for

Re: Use SparkListener to get overall progress of an action

2014-05-23 Thread Philip Ogren
Hi Pierre, I asked a similar question on this list about 6 weeks ago. Here is one answer http://mail-archives.apache.org/mod_mbox/spark-user/201404.mbox/%3ccamjob8n3foaxd-dc5j57-n1oocwxefcg5chljwnut7qnreq...@mail.gmail.com%3E I got that is of particular note: In the upcoming release of

Re: Use SparkListener to get overall progress of an action

2014-05-23 Thread Pierre B
Thanks Philip, I don’t want to go the JobLogger way (too hacky ;) ) In version 1.0, if i’m not mistaken, you can even do what I’m asking for, since they removed the “private” for TaskInfo and such and replaced it with the “@DeveloperApi” annotation. I was looking for a simple way to do this

credential in UserGroupInformation

2014-05-23 Thread Hollyen Edison
Hi, I have view the code about UGI in spark. If spark interactive with kerberos HDFS, The spark will apply delegate token in scheduler side, and stored as credential into the UGI; And the credential will be transferred to spark executor so that they can authenticate the HDFS. My question is

Re: Computing cosine similiarity using pyspark

2014-05-23 Thread Andrei
Do you need cosine distance and correlation between vectors or between variables (elements of vector)? It would be helpful if you could tell us details of your task. On Thu, May 22, 2014 at 5:49 PM, jamal sasha jamalsha...@gmail.com wrote: Hi, I have bunch of vectors like

Re: Spark Streaming using Flume body size limitation

2014-05-23 Thread David Lemieux
Created https://issues.apache.org/jira/browse/SPARK-1916 I'll submit a pull request soon. /D On May 23, 2014, at 9:56 AM, David Lemieux david.lemi...@radialpoint.com wrote: For some reason the patch did not make it. Trying via email: /D On May 23, 2014, at 9:52 AM, lemieud

Re: Unable to run a Standalone job

2014-05-23 Thread Jacek Laskowski
Hi Shrikar, How did you build Spark 1.0.0-SNAPSHOT on your machine? My understanding is that `sbt publishLocal` is not enough and you really need `sbt assembly` instead. Give it a try and report back. As to your build.sbt, upgrade Scala to 2.10.4 and org.apache.spark %% spark-streaming %

Setting spark.akka.frameSize

2014-05-23 Thread MattSills
Hi all, Configuration: Standalone 0.9.1-cdh4 cluster, 7 workers per node, 32gb per worker I'm running a job on a spark cluster, and running into some strange behavior. After a while, the akka frame sizes exceed 10mb, and then the whole job seizes up. I set spark.akka.frameSize to 128 in the

Invalid Class Exception

2014-05-23 Thread Suman Somasundar
Hi, I get the following exception when using Spark to run various programs. java.io.InvalidClassException: org.apache.spark.SerializableWritable; local class incompatible: stream classdesc serialVersionUID = 6301214776158303468, local class serialVersionUID = -7785455416944904980 at

Re: Using Spark to analyze complex JSON

2014-05-23 Thread Nicholas Chammas
Michael, What an excellent example! Thank you for posting such a detailed explanation and sample code. So I see what you’re doing and it looks like it works very well as long as your source data has a well-known and fixed structure. I’m looking for a pattern that can be used to expose JSON data

Re: Unable to run a Standalone job

2014-05-23 Thread Shrikar archak
Still the same error no change Thanks, Shrikar On Fri, May 23, 2014 at 2:38 PM, Jacek Laskowski ja...@japila.pl wrote: Hi Shrikar, How did you build Spark 1.0.0-SNAPSHOT on your machine? My understanding is that `sbt publishLocal` is not enough and you really need `sbt assembly` instead.

Trying to run Spark on Yarn

2014-05-23 Thread zsterone
I'm running into an authentication issue when running against YARN. I am using my own method to create the JAR assembly file and most likely I am missing something. This method used to work, but I recently ran into this problem. Here is the error from the YARN server: 14/05/23 19:03:02 INFO

Sources for kafka-0.7.2-spark

2014-05-23 Thread Stephen Boesch
We are using a back version of spark (0.8.1) that depends on a customized version of kafka 0.7.2-spark. Where are the sources for it - either svn/github or simply the sources..jar For reference here is the maven repo location for the binaries:

Seattle Spark Meetup: xPatterns Slides and @pacoid session next week!

2014-05-23 Thread Denny Lee
For those whom were not able to attend the last Seattle Spark Meetup, we had a great session by Claudiu Barbura on xPatterns on Spark, Shark, Tachyon, and Mesos - you can find the slides at: http://www.slideshare.net/ClaudiuBarbura/seattle-spark-meetup-may-2014. As well, check out the next

Working with Avro Generic Records in the interactive scala shell

2014-05-23 Thread Jeremy Lewi
Hi Spark Users, I'm trying to read and process an Avro dataset using the interactive spark scala shell. When my pipeline executes I get the ClassNotFoundException pasted at the end of this email. I'm trying to use the Generic Avro API (not the Specific API). Here's a gist of the commands I'm

Re: Spark GCE Script

2014-05-23 Thread Aureliano Buendia
On Fri, May 16, 2014 at 11:19 AM, Akhil Das ak...@sigmoidanalytics.comwrote: Hi I have sent a pull request https://github.com/apache/spark/pull/681 you can verify it and add it :) Matei, Would you please verify this pull request for Jenkins? It has been a couple of weeks. Thanks Best