Re: Spark task hangs infinitely when accessing S3 from AWS

2015-11-12 Thread Michael Cutler
Reading files directly from Amazon S3 can be frustrating especially if you're dealing with a large number of input files, could you please elaborate more on your use-case? Does the S3 bucket in question already contain a large number of files? The implementation of the * wildcard operator in S3 i

Re: How do you run your spark app?

2014-06-19 Thread Michael Cutler
st release TAR.GZ direct from HDFS, unpack it and launch the appropriate script. Makes for a much cleaner development / testing / deployment to package everything required in one go instead of relying on cluster specific classpath additions or any add-jars functionality. On 19 June 2014 22:53, Mich

Re: How do you run your spark app?

2014-06-19 Thread Michael Cutler
When you start seriously using Spark in production there are basically two things everyone eventually needs: 1. Scheduled Jobs - recurring hourly/daily/weekly jobs. 2. Always-On Jobs - that require monitoring, restarting etc. There are lots of ways to implement these requirements, everythin

Re: Spark streaming and rate limit

2014-06-19 Thread Michael Cutler
/TwitterInputDStream.scala>..) > is how to limit the external service call rate and manage the incoming > buffer size (enqueuing). > Could you give me some tips for that? > > Thanks again, > Flavio > > > On Thu, Jun 19, 2014 at 10:19 AM, Michael Cutler > wrote: &g

Re: Spark streaming and rate limit

2014-06-19 Thread Michael Cutler
ell as failover/retry logic etc. Best of luck! MC *Michael Cutler* Founder, CTO *Mobile: +44 789 990 7847Email: mich...@tumra.com Web: tumra.com <http://tumra.com/?utm_source=signature&utm_medium=email>* *Visit us at our offices in Chiswick Park <http://goo.gl/maps/abBxq>*

Re: spark streaming, kafka, SPARK_CLASSPATH

2014-06-17 Thread Michael Cutler
gy.last case x => old(x) } } You can see the "exclude()" has to go around the spark-streaming-kafka dependency, and I've used a MergeStrategy to solve the "deduplicate: different file contents found in the following" errors. Build the JAR with sbt ass

Re: Is There Any Benchmarks Comparing C++ MPI with Spark

2014-06-16 Thread Michael Cutler
Hello Wei, I talk from experience of writing many HPC distributed application using Open MPI (C/C++) on x86, PowerPC and Cell B.E. processors, and Parallel Virtual Machine (PVM) way before that back in the 90's. I can say with absolute certainty: *Any gains you believe there are because "C++ is

Re: Using Spark to crack passwords

2014-06-12 Thread Michael Cutler
rage the precomputed files stored in HDFS. Done right you should be able to achieve interactive (few second) lookups. Have fun! MC *Michael Cutler* Founder, CTO *Mobile: +44 789 990 7847Email: mich...@tumra.com Web: tumra.com <http://tumra.com/?utm_source=signature&utm_medium=em

Re: json parsing with json4s

2014-06-11 Thread Michael Cutler
Hello, You're absolutely right, the syntax you're using is returning the json4s value objects, not native types like Int, Long etc. fix that problem and then everything else (filters) will work as you expect. This is a short snippet of a larger example: [1] val lines = sc.textFile("likes.jso

Re: Performance of Akka or TCP Socket input sources vs HDFS: Data locality in Spark Streaming

2014-06-10 Thread Michael Cutler
Hey Nilesh, Great to hear your using Spark Streaming, in my opinion the crux of your question comes down to what you want to do with the data in the future and/or if there is utility it using it from more than one Spark/Streaming job. 1). *One-time-use fire and forget *- as you rightly point out,

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Michael Cutler
eExample.scala <https://gist.github.com/cotdp/b3512dd1328f10ee9257> *Michael Cutler* Founder, CTO *Mobile: +44 789 990 7847Email: mich...@tumra.com Web: tumra.com <http://tumra.com/?utm_source=signature&utm_medium=email>* *Visit us at our offices in Chiswick Park <http://goo.gl/maps/abB

Re: Spark Job Server first steps

2014-05-22 Thread Michael Cutler
cted when moving classes around in an IDE like Eclipse. Best, Michael *Michael Cutler* Founder, CTO *Mobile: +44 789 990 7847Email: mich...@tumra.com Web: tumra.com <http://tumra.com/?utm_source=signature&utm_medium=email>* *Visit us at our offices in Chiswick Park <htt

Re: Using Spark to analyze complex JSON

2014-05-22 Thread Michael Cutler
KE", "RLIKE" and "REGEXP" so clearly some of the basics are in there. As the saying goes ... *"Use the source, Luke! <http://blog.codinghorror.com/learn-to-read-the-source-luke/>"* :o) ᐧ *Michael Cutler* Founder, CTO *Mobile: +44 789 990 784

Re: Using Spark to analyze complex JSON

2014-05-21 Thread Michael Cutler
,1,114.17202697445208] ["mostly_male",2824,590,1,97.08852691218131] ["mostly_female",1934,590,1,99.0517063081696] ["unisex",2674,590,1,113.42071802543006] [,11023,590,1,93.45677220357435] */ Full working example: CandyCrushSQL.s

Re: count()-ing gz files gives java.io.IOException: incorrect header check

2014-05-21 Thread Michael Cutler
https://issues.apache.org/jira/browse/HADOOP-8900 and it affects all Hadoop releases prior to 1.2.X MC *Michael Cutler* Founder, CTO *Mobile: +44 789 990 7847Email: mich...@tumra.com Web: tumra.com <http://tumra.com/?utm_source=signature&utm_medium=email>* *Visit us at our

Re: facebook data mining with Spark

2014-05-20 Thread Michael Cutler
gender") // Return a Tuple of RDD[gender: String, (level: Int, count: Int)] ( gender, (level, 1) ) }).filter(a => { // Filter out entries with a level of zero a._2._1 > 0 }).reduceByKey( (a, b) => { // Sum the levels and counts so we can ave