Re: HBase row count

2014-02-25 Thread Soumitra Kumar
Found the issue, actually splits in HBase was not uniform, so one job was taking 90% of time. BTW, is there a way to save the details available port 4040 after job is finished? On Tue, Feb 25, 2014 at 7:26 AM, Nick Pentreath nick.pentre...@gmail.comwrote: It's tricky really since you may not

Sharing SparkContext

2014-02-25 Thread abhinav chowdary
Hi, I am looking for ways to share the sparkContext, meaning i need to be able to perform multiple operations on the same spark context. Below is code of a simple app i am testing def main(args: Array[String]) { println(Welcome to example application!) val sc = new

Re: Sharing SparkContext

2014-02-25 Thread Mayur Rustagi
fair scheduler merely reorders tasks .. I think he is looking to run multiple pieces of code on a single context on demand from customers...if the code order is decided then fair scheduler will ensure that all tasks get equal cluster time :) Mayur Rustagi Ph: +919632149971 h

Re: How to get well-distribute partition

2014-02-25 Thread Mayur Rustagi
okay you caught me on this.. I havnt used python api. Lets try http://www.cs.berkeley.edu/~pwendell/strataconf/api/pyspark/pyspark.rdd.RDD-class.html#partitionByon the rdd customize the partitioner instead of hash to a custom function. Please update on the list if it works, it seems to be a

RE: Size of RDD larger than Size of data on disk

2014-02-25 Thread Suraj Satishkumar Sheth
Hi Mayur, Thanks for replying. Is it usually double the size of data on disk? I have observed this many times. Storage section of Spark is telling me that 100% of RDD is cached using 97 GB of RAM while the data in HDFS is only 47 GB. Thanks and Regards, Suraj Sheth From: Mayur Rustagi

Re: Size of RDD larger than Size of data on disk

2014-02-25 Thread Matei Zaharia
The problem is that Java objects can take more space than the underlying data, but there are options in Spark to store data in serialized form to get around this. Take a look at https://spark.incubator.apache.org/docs/latest/tuning.html. Matei On Feb 25, 2014, at 12:01 PM, Suraj Satishkumar

Re: How to get well-distribute partition

2014-02-25 Thread Mayur Rustagi
It seems are you are already using parititonBy, you can simply plugin in your custom function instead of lambda x:x it should use that to partition. Range partitioner is available in Scala I am not sure if its exposed directly in python. Regards Mayur Mayur Rustagi Ph: +919632149971 h

Re: Need some tutorials and examples about customized partitioner

2014-02-25 Thread Tao Xiao
Thank you Mayur, I think that will help me a lot Best, Tao 2014-02-26 8:56 GMT+08:00 Mayur Rustagi mayur.rust...@gmail.com: Type of Shuffling is best explained by Matei in Spark Internals . http://www.youtube.com/watch?v=49Hr5xZyTEA#t=2203 Why dont you look at that then if you have follow

Help with building and running examples with GraphX from the REPL

2014-02-25 Thread Soumya Simanta
I'm not able to run the GraphX examples from the Scala REPL. Can anyone point to the correct documentation that talks about the configuration and/or how to build GraphX for the REPL ? Thanks

Re: [HELP] ask for some information about public data set

2014-02-25 Thread Evan R. Sparks
Hi hyqgod, This is probably a better question for the spark user's list than the dev list (cc'ing user and bcc'ing dev on this reply). To answer your question, though: Amazon's Public Datasets Page is a nice place to start: http://aws.amazon.com/datasets/ - these work well with spark because

Re: Implementing a custom Spark shell

2014-02-26 Thread Matei Zaharia
In Spark 0.9 and master, you can pass the -i argument to spark-shell to load a script containing commands before opening the prompt. This is also a feature of the Scala shell as a whole (try scala -help for details). Also, once you’re in the shell, you can use :load file.scala to execute the

Re: ReduceByKey or groupByKey to Count?

2014-02-26 Thread dmpour23
If i use groupbyKey as so... JavaPairRDDString, Listlt;String twos = ones.groupByKey(3).cache(); How would I write to a file/ or Hadoop the contents of the List of Strings. Do i need to transform the JavaPairRDD to JavaRDD and call f saveAsTextFile? -- View this message in context:

Build Spark in IntelliJ IDEA 13

2014-02-26 Thread Yanzhe Chen
Hi, all I'm trying to build Spark in IntelliJ IDEA 13. I clone the latest repo and run sbt/sbt gen-idea in the root folder. Then import it into IntelliJ IDEA. Scala plugin for IntelliJ IDEA has been installed. Everything seems ok until I ran Build Make Project: Information: Using javac

Re: Build Spark in IntelliJ IDEA 13

2014-02-26 Thread Sean Owen
I also use IntelliJ 13 on a Mac, with only Java 7, and have never seen this. If you look at the Spark build, you will see that it specifies Java 6, not 7. Even if you changed java.version in the build, you would not get this error, since it specifies source and target to be the same value. In

Re: Dealing with headers in csv file pyspark

2014-02-26 Thread Chengi Liu
I am not sure.. the suggestion is to open a TB file and remove a line? That doesnt sounds that good. I am hacking my way by using a filter.. Can I put a try:except clause in my lambda function.. Maybe i should just try that out. But thanks for the suggestion. Also, can i run scripts against spark

Actors and sparkcontext actions

2014-02-26 Thread Ognen Duzlevski
Can someone point me to a simple, short code example of creating a basic Actor that gets a context and runs an operation such as .textFile.count? I am trying to figure out how to create just a basic actor that gets a message like this: case class Msg(filename:String, ctx: SparkContext) and

worker keeps getting disassociated upon a failed job spark version 0.90

2014-02-26 Thread Shirish
I am an newbie!! I am running Spark 0.90 in standalone mode on my mac. The master and worker run on the same machine. Both of them startup fine (at least that is what I see in the log). *Upon start-up master log is:* 14/02/26 15:38:08 INFO Slf4jLogger: Slf4jLogger started 14/02/26 15:38:08

Re: [incubating-0.9.0] Too Many Open Files on Workers

2014-02-26 Thread Rohit Rai
Hello Andy, This is a problem we have seen in using the CQL Java driver under heavy ready loads where it is using NIO and is waiting on many pending responses which causes to many open sockets and hence too many open files. Are you by any chance using async queries? I am the maintainer of

Re: Rename filter() into keep(), remove() or take() ?

2014-02-27 Thread Nick Pentreath
Agree that filter is perhaps unintuitive. Though the Scala collections API has filter and filterNot which together provide context that makes it more intuitive. And yes the change could be via added methods that don't break existing API. Still overall I would be -1 on this unless a

Re: Spark app gets slower as it gets executed more times

2014-02-27 Thread Aureliano Buendia
On Fri, Feb 7, 2014 at 7:48 AM, Aaron Davidson ilike...@gmail.com wrote: Sorry for delay, by long-running I just meant if you were running an iterative algorithm that was slowing down over time. We have observed this in the spark-perf benchmark; as file system state builds up, the job can

Re: Spark streaming on ec2

2014-02-27 Thread Tathagata Das
Yes! Spark streaming programs are just like any spark program and so any ec2 cluster setup using the spark-ec2 scripts can be used to run spark streaming programs as well. On Thu, Feb 27, 2014 at 10:11 AM, Aureliano Buendia buendia...@gmail.comwrote: Hi, Does the ec2 support for spark 0.9

Re: Spark streaming on ec2

2014-02-27 Thread Aureliano Buendia
On Thu, Feb 27, 2014 at 6:17 PM, Tathagata Das tathagata.das1...@gmail.comwrote: Yes! Spark streaming programs are just like any spark program and so any ec2 cluster setup using the spark-ec2 scripts can be used to run spark streaming programs as well. Great. Does it come with any input

Re: ReduceByKey or groupByKey to Count?

2014-02-27 Thread Mayur Rustagi
Sortbykey would be better I think as I am not sure groupbyKey will sort the keyspace globally. I would say you should you take input K, V GroupbyKey K,V = K,Seq(V..) partitionBy default partitioner (hash) SoryByKey K,Seq(V..) Output this, only thing is if you need K,V pairs you will have to

Re: Build Spark Against CDH5

2014-02-27 Thread Brian Brunner
Just as a second note, I am able to build the source in the official 0.9.0 release (http://d3kbcqa49mib13.cloudfront.net/spark-0.9.0-incubating-bin-hadoop2.tgz). -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Build-Spark-Against-CDH5-tp2129p2130.html Sent

Running Spark with Python 2.7.5+

2014-02-27 Thread nicholas.chammas
The provided Spark EC2 scriptshttps://spark.incubator.apache.org/docs/0.9.0/ec2-scripts.htmland default AMI ship with Python 2.6.8. I would like to use Python 2.7.5 or later. I believe that among the 2.x versions, 2.7 is the most popular. What's the easiest way to get my Spark cluster on Python

Re: Spark streaming on ec2

2014-02-27 Thread Tathagata Das
Yes, the default spark EC2 cluster runs the standalone deploy mode. Since Spark 0.9, the standalone deploy mode allows you to launch the driver app within the cluster itself and automatically restart it if it fails. You can read about launching your app inside the cluster

Re: Build Spark Against CDH5

2014-02-28 Thread Brian Brunner
After successfully building the official 0.9.0 release I attempted to build off of the github code again and was successfully able to do so. Not really sure what happened, but it works now. -- View this message in context:

Re: Spark streaming on ec2

2014-02-28 Thread Aureliano Buendia
Also, in this talk http://www.youtube.com/watch?v=OhpjgaBVUtU on using spark streaming in production, the author seems to have missed the topic of how to manage cloud instances. On Fri, Feb 28, 2014 at 6:48 PM, Aureliano Buendia buendia...@gmail.comwrote: What's the updated way of deploying

Re: Error reading HDFS file using spark 0.9.0 / hadoop 2.2.0 - incompatible protobuf 2.5 and 2.4.1

2014-02-28 Thread Egor Pahomov
Spark 0.9 uses protobuf 2.5.0 Hadoop 2.2 uses protobuf 2.5.0 protobuf 2.5.0 can read massages serialized with protobuf 2.4.1 So there is not any reason why you can't read some messages from hadoop 2.2 with protobuf 2.5.0, probably you somehow have 2.4.1 in your class path. Of course it's very bad,

Re: Error reading HDFS file using spark 0.9.0 / hadoop 2.2.0 - incompatible protobuf 2.5 and 2.4.1

2014-02-28 Thread Egor Pahomov
In that same pom profile idyarn/id properties hadoop.major.version2/hadoop.major.version hadoop.version2.2.0/hadoop.version protobuf.version2.5.0/protobuf.version /properties modules moduleyarn/module /modules /profile

Spark stream example SimpleZeroMQPublisher high cpu usage

2014-02-28 Thread Aureliano Buendia
Hi, Running: ./bin/run-example org.apache.spa.streaming.examples.SimpleZeroMQPublisher tcp://127.0.1.1:1234 foo causes over 100% cpu usage on os x. Given that it's just a simple zmq publisher, this shouldn't be expected. Is there something wrong with that example?

Re: Spark streaming on ec2

2014-02-28 Thread Nicholas Chammas
Yeah, the Spark on EMR bootstrap scripts referenced herehttp://aws.amazon.com/articles/4926593393724923need some polishing. I had a lot of trouble just getting through that tutorial. And yes, the version of Spark they're using is 0.8.1. On Fri, Feb 28, 2014 at 2:39 PM, Aureliano Buendia

Connection Refused When Running SparkPi Locally

2014-02-28 Thread Benny Thompson
I'm trying to run a simple execution of the SparkPi example. I started the master and one worker, then executed the job on my local cluster, but end up getting a sequence of errors all ending with Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection

How to provide a custom Comparator to sortByKey?

2014-02-28 Thread Tao Xiao
I am using Spark 0.9 I have an array of tuples, and I want to sort these tuples using the *sortByKey *API as follows in Spark shell: val A:Array[(String, String)] = Array((1, One), (9, Nine), (3, three), (5, five), (4, four)) val P = sc.parallelize(A) // MyComparator is an example, maybe I have

Re: Incrementally add/remove vertices in GraphX

2014-03-02 Thread psnively
Does this suggest value in an integration of GraphX and neo4j? Sent from my Verizon Wireless Phone - Reply message - From: Matei Zaharia matei.zaha...@gmail.com To: user@spark.apache.org Cc: u...@spark.incubator.apache.org Subject: Incrementally add/remove vertices in GraphX Date: Sun,

Re: flatten RDD[RDD[T]]

2014-03-02 Thread Josh Rosen
Nope, nested RDDs aren't supported: https://groups.google.com/d/msg/spark-users/_Efj40upvx4/DbHCixW7W7kJ https://groups.google.com/d/msg/spark-users/KC1UJEmUeg8/N_qkTJ3nnxMJ https://groups.google.com/d/msg/spark-users/rkVPXAiCiBk/CORV5jyeZpAJ On Sun, Mar 2, 2014 at 5:37 PM, Cosmin Radoi

Help with groupByKey

2014-03-02 Thread David Thomas
I have an RDD of (K, Array[V]) pairs. For example: ((key1, (1,2,3)), (key2, (3,2,4)), (key1, (4,3,2))) How can I do a groupByKey such that I get back an RDD of the form (K, Array[V]) pairs. Ex: ((key1, (1,2,3,4,3,2)), (key2, (3,2,4)))

Beginners Hadoop question

2014-03-03 Thread goi cto
Hi, I am sorry for the beginners question but... I have a spark java code which reads a file (c:\my-input.csv) process it and writes an output file (my-output.csv) Now I want to run it on Hadoop in a distributed environment 1) My inlut file should be one big file or separate smaller files? 2) if

Re: Beginners Hadoop question

2014-03-03 Thread Alonso Isidoro Roman
Hi, i am a beginner too, but as i have learned, hadoop works better with big files, at least with 64MB, 128MB or even more. I think you need to aggregate all the files into a new big one. Then you must copy to HDFS using this command: hadoop fs -put MYFILE /YOUR_ROUTE_ON_HDFS/MYFILE hadoop just

Re: Job initialization performance of Spark standalone mode vs YARN

2014-03-03 Thread Koert Kuipers
If you need quick response re-use your spark context between queries and cache rdds in memory On Mar 3, 2014 12:42 AM, polkosity polkos...@gmail.com wrote: Thanks for the advice Mayur. I thought I'd report back on the performance difference... Spark standalone mode has executors processing

Re: Job initialization performance of Spark standalone mode vs YARN

2014-03-03 Thread Sandy Ryza
Are you running in yarn-standalone mode or yarn-client mode? Also, what YARN scheduler and what NodeManager heartbeat? On Sun, Mar 2, 2014 at 9:41 PM, polkosity polkos...@gmail.com wrote: Thanks for the advice Mayur. I thought I'd report back on the performance difference... Spark

Re: Job initialization performance of Spark standalone mode vs YARN

2014-03-03 Thread Andrew Ash
polkosity, have you seen the job server that Ooyala open sourced? I think it's very similar to what you're proposing with a REST API and re-using a SparkContext. https://github.com/apache/incubator-spark/pull/222 http://engineering.ooyala.com/blog/open-sourcing-our-spark-job-server On Mon, Mar

Re: Job initialization performance of Spark standalone mode vs YARN

2014-03-03 Thread Mayur Rustagi
+1 Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Mon, Mar 3, 2014 at 4:10 PM, polkosity polkos...@gmail.com wrote: Thats exciting! Will be looking into that, thanks Andrew. Related topic, has anyone had any

Re: Job initialization performance of Spark standalone mode vs YARN

2014-03-03 Thread Koert Kuipers
yes, tachyon is in memory serialized, which is not as fast as cached in memory in spark (not serialized). the difference really depends on your job type. On Mon, Mar 3, 2014 at 7:10 PM, polkosity polkos...@gmail.com wrote: Thats exciting! Will be looking into that, thanks Andrew. Related

Re: o.a.s.u.Vector instances for equality

2014-03-03 Thread Shixiong Zhu
Vector is an enhanced Array[Double]. You can compare it like Array[Double]. E.g., scala val v1 = Vector(1.0, 2.0) v1: org.apache.spark.util.Vector = (1.0, 2.0) scala val v2 = Vector(1.0, 2.0) v2: org.apache.spark.util.Vector = (1.0, 2.0) scala val exactResult =

Shuffle Files

2014-03-03 Thread Usman Ghani
Where on the filesystem does spark write the shuffle files?

Re: Shuffle Files

2014-03-04 Thread Aniket Mokashi
From BlockManager code + ShuffleMapTask code, it writes under spark.local.dir or java.io.tmpdir. val diskBlockManager = new DiskBlockManager(shuffleBlockManager, conf.get(spark.local.dir, System.getProperty(java.io.tmpdir))) On Mon, Mar 3, 2014 at 10:45 PM, Usman Ghani us...@platfora.com

RE: Connection Refused When Running SparkPi Locally

2014-03-04 Thread Li, Rui
I've encountered similar problems. Maybe you can try using hostname or FQDN (rather than IP address) of your node for the master URI. In my case, AKKA picks the FQDN for master URI and worker has to use exactly the same string for connection. From: Benny Thompson [mailto:ben.d.tho...@gmail.com]

RE: Actors and sparkcontext actions

2014-03-04 Thread Suraj Satishkumar Sheth
Hi Ognen, See if this helps. I was working on this : class MyClass[T](sc : SparkContext, flag1 : Boolean, rdd : RDD[T], hdfsPath : String) extends Actor { def act(){ if(flag1) this.process() else this.count } private def process(){ println(sc.textFile(hdfsPath).count)

Re: Problem with delete spark temp dir on spark 0.8.1

2014-03-04 Thread Akhil Das
Hi, Try to clean your temp dir, System.getProperty(java.io.tmpdir) Also, Can you paste a longer stacktrace? Thanks Best Regards On Tue, Mar 4, 2014 at 2:55 PM, goi cto goi@gmail.com wrote: Hi, I am running a spark java program on a local machine. when I try to write the output to

Re: Problem with delete spark temp dir on spark 0.8.1

2014-03-04 Thread goi cto
Exception in thread delete Spark temp dir C:\Users\... java.io.IOException: failed to delete: C:\Users\...\simple-project-1.0.jar at org.apache.spark.util.utils$.deleteRecursively(Utils.scala:495) at org.apache.spark.util.utils$$anonfun$deleteRecursively$1.apply(Utils.scala:491) I deleted my

RDD Manipulation in Scala.

2014-03-04 Thread trottdw
Hello, I am using Spark with Scala and I am attempting to understand the different filtering and mapping capabilities available. I haven't found an example of the specific task I would like to do. I am trying to read in a tab spaced text file and filter specific entries. I would like this

Re: RDD Manipulation in Scala.

2014-03-04 Thread trottdw
Thanks Sean, I think that is doing what I needed. It was much simpler than what I had been attempting. Is it possible to do an OR statement filter? So, that for example column 2 can be filtered by A2 appearances and column 3 by A4? -- View this message in context:

Re: Actors and sparkcontext actions

2014-03-04 Thread Debasish Das
Hi Ognen, Any particular reason of choosing scalatra over options like play or spray ? Is scalatra much better in serving apis or is it due to similarity with ruby's sinatra ? Did you try the other options and then pick scalatra ? Thanks. Deb On Tue, Mar 4, 2014 at 4:50 AM, Ognen Duzlevski

Re: o.a.s.u.Vector instances for equality

2014-03-04 Thread Oleksandr Olgashko
Thanks. Does it make sence to add ==/equals method for Vector with this (or same) behavior? 2014-03-04 6:00 GMT+02:00 Shixiong Zhu zsxw...@gmail.com: Vector is an enhanced Array[Double]. You can compare it like Array[Double]. E.g., scala val v1 = Vector(1.0, 2.0) v1:

Re: Actors and sparkcontext actions

2014-03-04 Thread Ognen Duzlevski
Deb, On 3/4/14, 9:02 AM, Debasish Das wrote: Hi Ognen, Any particular reason of choosing scalatra over options like play or spray ? Is scalatra much better in serving apis or is it due to similarity with ruby's sinatra ? Did you try the other options and then pick scalatra ? Not really.

Re: Missing Spark URL after staring the master

2014-03-04 Thread Bin Wang
Hi Mayur, I am using CDH4.6.0p0.26. And the latest Cloudera Spark parcel is Spark 0.9.0 CDH4.6.0p0.50. As I mentioned, somehow, the Cloudera Spark version doesn't contain the run-example shell scripts.. However, it is automatically configured and it is pretty easy to set up across the cluster...

Spark Streaming Maven Build

2014-03-04 Thread Bin Wang
Hi there, I tried the Kafka WordCount example and it works perfect and the code is pretty straightforward to understand. Can anyone show to me how to start your own maven project with the KafkaWordCount example using minimum-effort. 1. How the pom file should look like (including jar-plugin?

Word Count on Mesos Cluster

2014-03-05 Thread juanpedromoreno
Hi there, I tried the SimpleApp WordCount example and it works perfect on local environment. My code: object SimpleApp { def main(args: Array[String]) { val logFile = README.md val conf = new SparkConf() .setMaster(zk://172.31.0.11:2181/mesos) .setAppName(Simple App)

Problem with HBase external table on freshly created EMR cluster

2014-03-05 Thread Philip Limbeck
Hi! I created an EMR cluster with Spark and HBase according to http://aws.amazon.com/articles/4926593393724923 with --hbase flag to include HBase. Although spark and shark both work nicely with the provided S3 examples, there is a problem with external tables pointing to the HBase instance. We

Re: Explain About Logs NetworkWordcount.scala

2014-03-05 Thread eduardocalfaia
Hi TD, I have seen in the web UI the stage number that result has been zero and in the field GC Times there is nothing. http://apache-spark-user-list.1001560.n3.nabble.com/file/n2306/CaptureStage.png -- View this message in context:

Re: Unable to redirect Spark logs to slf4j

2014-03-05 Thread Sergey Parhomenko
Hi Sean, We're not using log4j actually, we're trying to redirect all logging to slf4j which then uses logback as the logging implementation. The fix you mentioned - am I right to assume it is not part of the latest released Spark version (0.9.0)? If so, are there any workarounds or advices on

Re: Spark Worker crashing and Master not seeing recovered worker

2014-03-05 Thread Ognen Duzlevski
Rob, I have seen this too. I have 16 nodes in my spark cluster and for some reason (after app failures) one of the workers will go offline. I will ssh to the machine in question and find that the java process is running but for some reason the master is not noticing this. I have not had the

Re: pyspark and Python virtual enviroments

2014-03-05 Thread Bryn Keller
Hi Christian, The PYSPARK_PYTHON environment variable specifies the python executable to use for pyspark. You can put the path to a virtualenv's python executable and it will work fine. Remember you have to have the same installation at the same path on each of your cluster nodes for pyspark to

Re: disconnected from cluster; reconnecting gives java.net.BindException

2014-03-05 Thread Nicholas Chammas
Whoopdeedoo, after just waiting for like an hour (well, I was doing other stuff) the process holding that address seems to have died automatically and now I can start up pyspark without any warnings. Would there be a faster way to go through this than just wait around for the orphaned process to

Problem with HBase external table on freshly created EMR cluster

2014-03-05 Thread phil3k
Hi! I created an EMR cluster with Spark and HBase according to http://aws.amazon.com/articles/4926593393724923 with --hbase flag to include HBase. Although spark and shark both work nicely with the provided S3 examples, there is a problem with external tables pointing to the HBase instance. We

Re: pyspark and Python virtual enviroments

2014-03-05 Thread Christian
Thanks Bryn. On Wed, Mar 5, 2014 at 9:00 PM, Bryn Keller xol...@xoltar.org wrote: Hi Christian, The PYSPARK_PYTHON environment variable specifies the python executable to use for pyspark. You can put the path to a virtualenv's python executable and it will work fine. Remember you have to

Re: Unable to redirect Spark logs to slf4j

2014-03-05 Thread Sergey Parhomenko
Hi Patrick, Thanks for the patch. I tried building a patched version of spark-core_2.10-0.9.0-incubating.jar but the Maven build fails: *[ERROR] /home/das/Work/thx/incubator-spark/core/src/main/scala/org/apache/spark/Logging.scala:22: object impl is not a member of package org.slf4j* *[ERROR]

Re: PIG to SPARK

2014-03-05 Thread Mayur Rustagi
The real question is why do you want to run pig script using Spark Are you planning to user spark as underlying processing engine for Spark? thats not simple Are you planning to feed Pig data to spark for further processing, then you can write it to HDFS trigger your spark script. rdd.pipe is

Re: trying to understand job cancellation

2014-03-05 Thread Koert Kuipers
i also noticed that jobs (with a new JobGroupId) which i run after this use which use the same RDDs get very confused. i see lots of cancelled stages and retries that go on forever. On Tue, Mar 4, 2014 at 5:02 PM, Koert Kuipers ko...@tresata.com wrote: i have a running job that i cancel while

Re: trying to understand job cancellation

2014-03-05 Thread Mayur Rustagi
How do you cancel the job. Which API do you use? Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Wed, Mar 5, 2014 at 2:29 PM, Koert Kuipers ko...@tresata.com wrote: i also noticed that jobs (with a new JobGroupId) which

Running spark 0.9 on mesos 0.15

2014-03-05 Thread elyast
Hi, Quick question do I need to compile spark against exactly same version of mesos library, currently spark depends on 0.13. The problem I am facing is following I am running MLib example with SVM and it works nicely when I use coarse grained mode, however when running fine grained mode on

Re: trying to understand job cancellation

2014-03-05 Thread Mayur Rustagi
One issue is that job cancellation is posted on eventloop. So its possible that subsequent jobs submitted to job queue may beat the job cancellation event hence the job cancellation event may end up closing them too. So there's definitely a race condition you are risking even if not running into.

Re: trying to understand job cancellation

2014-03-05 Thread Koert Kuipers
got it. seems like i better stay away from this feature for now.. On Wed, Mar 5, 2014 at 5:55 PM, Mayur Rustagi mayur.rust...@gmail.comwrote: One issue is that job cancellation is posted on eventloop. So its possible that subsequent jobs submitted to job queue may beat the job cancellation

Re: Unable to redirect Spark logs to slf4j

2014-03-05 Thread Patrick Wendell
Hey, Maybe I don't understand the slf4j model completely, but I think you need to add a concrete implementation of a logger. So in your case you'd the logback-classic binding in place of the log4j binding at compile time: http://mvnrepository.com/artifact/ch.qos.logback/logback-classic/1.1.1 -

Re: Implementing a custom Spark shell

2014-03-06 Thread Sampo Niskanen
Hi, I've tried to enable debug logging, but can't figure out what might be going wrong. Can anyone assist decyphering the log? The log of the startup and run attempts is at http://pastebin.com/XyeY92VF This uses SparkILoop, DEBUG level logging and settings.debug.value = true option. Line 323:

Re: Kryo serialization does not compress

2014-03-06 Thread pradeeps8
We are trying to use kryo serialization, but with kryo serialization ON the memory consumption does not change. We have tried this on multiple sets of data. We have also checked the logs of Kryo serialization and have confirmed that Kryo is being used. Can somebody please help us with this? The

Re: disconnected from cluster; reconnecting gives java.net.BindException

2014-03-06 Thread Nicholas Chammas
So this happened again today. As I noted before, the Spark shell starts up fine after I reconnect to the cluster, but this time around I tried opening a file and doing some processing. I get this message over and over (and can't do anything): 14/03/06 15:43:09 WARN scheduler.TaskSchedulerImpl:

Building spark with native library support

2014-03-06 Thread Alan Burlison
Hi, I've successfully built 0.9.0-incubating on Solaris using sbt, following the instructions at http://spark.incubator.apache.org/docs/latest/ and it seems to work OK. However, when I start it up I get an error about missing Hadoop native libraries. I can't find any mention of how to build

Re: PIG to SPARK

2014-03-06 Thread suman bharadwaj
Thanks Mayur. I don't have clear idea on how pipe works wanted to understand more on it. But when do we use pipe() and how it works ?. Can you please share some sample code if you have ( even pseudo-code is fine ) ? It will really help. Regards, Suman Bharadwaj S On Thu, Mar 6, 2014 at 3:46 AM,

Re: Building spark with native library support

2014-03-06 Thread Matei Zaharia
Is it an error, or just a warning? In any case, you need to get those libraries from a build of Hadoop for your platform. Then add them to the SPARK_LIBRARY_PATH environment variable in conf/spark-env.sh, or to your -Djava.library.path if launching an application separately. These libraries

RE: Building spark with native library support

2014-03-06 Thread Jeyaraj, Arockia R (Arockia)
Hi, I am trying to setup Spark in windows for development environment. I get following error when I run sbt. Pl help me to resolve this issue. I am working for Verizon and am in my company network and can't access internet without proxy. C:\Userssbt Getting org.fusesource.jansi jansi 1.11 ...

Access SBT with proxy

2014-03-06 Thread Mayur Rustagi
export JAVA_OPTS=$JAVA_OPTS -Dhttp.proxyHost=yourserver -Dhttp.proxyPort=8080 -Dhttp.proxyUser=username -Dhttp.proxyPassword=password Also please use separate thread for different questions. Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi

RE: Access SBT with proxy

2014-03-06 Thread Jeyaraj, Arockia R (Arockia)
Thanks Alan. I am very new to Spark. I am trying to set Spark development environment in Windows. I added below mentioned export as set in sbt.bat file and tried, it was not working. Where will I see .gitconfig? set JAVA_OPTS=%JAVA_OPTS% -Dhttp.proxyHost=myservername -Dhttp.proxyPort=8080

Re: major Spark performance problem

2014-03-06 Thread Christopher Nguyen
Dana, When you run multiple applications under Spark, and if each application takes up the entire cluster resources, it is expected that one will block the other completely, thus you're seeing that the wall time add together sequentially. In addition there is some overhead associated with

Pig on Spark

2014-03-06 Thread Sameer Tilak
Hi everyone, We are using to Pig to build our data pipeline. I came across Spork -- Pig on Spark at: https://github.com/dvryaboy/pig and not sure if it is still active. Can someone please let me know the status of Spork or any other effort that will let us run Pig on Spark? We can

Re: Pig on Spark

2014-03-06 Thread Tom Graves
I had asked a similar question on the dev mailing list a while back (Jan 22nd).  See the archives:  http://mail-archives.apache.org/mod_mbox/spark-dev/201401.mbox/browser - look for spork. Basically Matei said: Yup, that was it, though I believe people at Twitter picked it up again recently.

Re: Pig on Spark

2014-03-06 Thread Aniket Mokashi
There is some work to make this work on yarn at https://github.com/aniket486/pig. (So, compile pig with ant -Dhadoopversion=23) You can look at https://github.com/aniket486/pig/blob/spork/pig-spark to find out what sort of env variables you need (sorry, I haven't been able to clean this up-

Re: Building spark with native library support

2014-03-06 Thread Alan Burlison
On 06/03/2014 18:55, Matei Zaharia wrote: For the native libraries, you can use an existing Hadoop build and just put them on the path. For linking to Hadoop, Spark grabs it through Maven, but you can do mvn install locally on your version of Hadoop to install it to your local Maven cache, and

RE: Pig on Spark

2014-03-06 Thread Sameer Tilak
Hi Aniket,Many thanks! I will check this out. Date: Thu, 6 Mar 2014 13:46:50 -0800 Subject: Re: Pig on Spark From: aniket...@gmail.com To: user@spark.apache.org; tgraves...@yahoo.com There is some work to make this work on yarn at https://github.com/aniket486/pig. (So, compile pig with ant

Re: Job aborted: Spark cluster looks down

2014-03-06 Thread Mayur Rustagi
Can you see your webUI of Spark. Is it running? (would run on masterurl:8080) if so what is the master URL shown thr.. MASTER=spark://URL:PORT ./bin/spark-shell Should work. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On

Re: NoSuchMethodError - Akka - Props

2014-03-06 Thread Deepak Nulu
I see the same error. I am trying a standalone example integrated into a Play Framework v2.2.2 application. The error occurs when I try to create a Spark Streaming Context. Compilation succeeds, so I am guessing it has to do with the version of Akka getting picked up at runtime. -- View this

Re: NoSuchMethodError - Akka - Props

2014-03-06 Thread Tathagata Das
Are you launching your application using scala or java command? scala command bring in a version of Akka that we have found to cause conflicts with Spark's version for Akka. So its best to launch using Java. TD On Thu, Mar 6, 2014 at 3:45 PM, Deepak Nulu deepakn...@gmail.com wrote: I see the

Re: NoSuchMethodError - Akka - Props

2014-03-06 Thread Deepak Nulu
I was just able to fix this in my environment. By looking at the repository/cache in my Play Framework installation, I was able to determine that spark-0.9.0-incubating uses Akka version 2.2.3. Similarly, looking at repository/local revealed that Play Framework 2.2.2 ships with Akka version

Re: Python 2.7 + numpy break sortByKey()

2014-03-06 Thread Patrick Wendell
The difference between your two jobs is that take() is optimized and only runs on the machine where you are using the shell, whereas sortByKey requires using many machines. It seems like maybe python didn't get upgraded correctly on one of the slaves. I would look in the /root/spark/work/ folder

Re: NoSuchMethodError in KafkaReciever

2014-03-06 Thread Tathagata Das
I dont have a Eclipse setup so I am not sure what is going on here. I would try to use maven in the command line with a pom to see if this compiles. Also, try to cleanup your system maven cache. Who knows if it had pulled in a wrong version of kafka 0.8 and using it all the time. Blowing away the

Re: need someone to help clear some questions.

2014-03-06 Thread qingyang li
many thanks for guiding. 2014-03-06 23:39 GMT+08:00 Yana Kadiyska yana.kadiy...@gmail.com: Hi qingyang, 1. You do not need to install shark on every node. 2. Not really sure..it's just a warning so I'd see if it works despite it 3. You need to provide the actual hdfs path, e.g.

Re: Job initialization performance of Spark standalone mode vs YARN

2014-03-06 Thread polkosity
We're not using Ooyala's job server. We are holding the spark context for reuse within our own REST server (with a service to run each job). Our low-latency job now reads all its data from a memory cached RDD, instead of from HDFS seq file (upstream jobs cache resultant RDDs for downstream jobs

Running actions in loops

2014-03-06 Thread Ognen Duzlevski
Hello, What is the general approach people take when trying to do analysis across multiple large files where the data to be extracted from a successive file depends on the data extracted from a previous file or set of files? For example: I have the following: a group of HDFS files each

Re: Job initialization performance of Spark standalone mode vs YARN

2014-03-06 Thread Mayur Rustagi
Would you be the best person in the world share some code. Its a pretty common problem . On Mar 6, 2014 6:36 PM, polkosity polkos...@gmail.com wrote: We're not using Ooyala's job server. We are holding the spark context for reuse within our own REST server (with a service to run each job).

  1   2   3   4   5   6   7   8   9   10   >