Reading files directly from Amazon S3 can be frustrating especially if
you're dealing with a large number of input files, could you please
elaborate more on your use-case? Does the S3 bucket in question already
contain a large number of files?
The implementation of the * wildcard operator in S3 i
st release TAR.GZ
direct from HDFS, unpack it and launch the appropriate script.
Makes for a much cleaner development / testing / deployment to package
everything required in one go instead of relying on cluster specific
classpath additions or any add-jars functionality.
On 19 June 2014 22:53, Mich
When you start seriously using Spark in production there are basically two
things everyone eventually needs:
1. Scheduled Jobs - recurring hourly/daily/weekly jobs.
2. Always-On Jobs - that require monitoring, restarting etc.
There are lots of ways to implement these requirements, everythin
/TwitterInputDStream.scala>..)
> is how to limit the external service call rate and manage the incoming
> buffer size (enqueuing).
> Could you give me some tips for that?
>
> Thanks again,
> Flavio
>
>
> On Thu, Jun 19, 2014 at 10:19 AM, Michael Cutler
> wrote:
&g
ell as failover/retry logic etc.
Best of luck!
MC
*Michael Cutler*
Founder, CTO
*Mobile: +44 789 990 7847Email: mich...@tumra.com Web:
tumra.com <http://tumra.com/?utm_source=signature&utm_medium=email>*
*Visit us at our offices in Chiswick Park <http://goo.gl/maps/abBxq>*
gy.last
case x => old(x)
}
}
You can see the "exclude()" has to go around the spark-streaming-kafka
dependency,
and I've used a MergeStrategy to solve the "deduplicate: different file
contents found in the following" errors.
Build the JAR with sbt ass
Hello Wei,
I talk from experience of writing many HPC distributed application using
Open MPI (C/C++) on x86, PowerPC and Cell B.E. processors, and Parallel
Virtual Machine (PVM) way before that back in the 90's. I can say with
absolute certainty:
*Any gains you believe there are because "C++ is
rage the precomputed files stored in HDFS. Done right you should
be able to achieve interactive (few second) lookups.
Have fun!
MC
*Michael Cutler*
Founder, CTO
*Mobile: +44 789 990 7847Email: mich...@tumra.com Web:
tumra.com <http://tumra.com/?utm_source=signature&utm_medium=em
Hello,
You're absolutely right, the syntax you're using is returning the json4s
value objects, not native types like Int, Long etc. fix that problem and
then everything else (filters) will work as you expect. This is a short
snippet of a larger example: [1]
val lines = sc.textFile("likes.jso
Hey Nilesh,
Great to hear your using Spark Streaming, in my opinion the crux of your
question comes down to what you want to do with the data in the future
and/or if there is utility it using it from more than one Spark/Streaming
job.
1). *One-time-use fire and forget *- as you rightly point out,
eExample.scala
<https://gist.github.com/cotdp/b3512dd1328f10ee9257>
*Michael Cutler*
Founder, CTO
*Mobile: +44 789 990 7847Email: mich...@tumra.com Web:
tumra.com <http://tumra.com/?utm_source=signature&utm_medium=email>*
*Visit us at our offices in Chiswick Park <http://goo.gl/maps/abB
cted when moving
classes around in an IDE like Eclipse.
Best,
Michael
*Michael Cutler*
Founder, CTO
*Mobile: +44 789 990 7847Email: mich...@tumra.com Web:
tumra.com <http://tumra.com/?utm_source=signature&utm_medium=email>*
*Visit us at our offices in Chiswick Park <htt
KE", "RLIKE" and "REGEXP" so clearly some of the basics are in there.
As the saying goes ... *"Use the source, Luke!
<http://blog.codinghorror.com/learn-to-read-the-source-luke/>"* :o)
ᐧ
*Michael Cutler*
Founder, CTO
*Mobile: +44 789 990 784
,1,114.17202697445208]
["mostly_male",2824,590,1,97.08852691218131]
["mostly_female",1934,590,1,99.0517063081696]
["unisex",2674,590,1,113.42071802543006]
[,11023,590,1,93.45677220357435]
*/
Full working example:
CandyCrushSQL.s
https://issues.apache.org/jira/browse/HADOOP-8900 and it affects all Hadoop
releases prior to 1.2.X
MC
*Michael Cutler*
Founder, CTO
*Mobile: +44 789 990 7847Email: mich...@tumra.com Web:
tumra.com <http://tumra.com/?utm_source=signature&utm_medium=email>*
*Visit us at our
gender")
// Return a Tuple of RDD[gender: String, (level: Int, count: Int)]
( gender, (level, 1) )
}).filter(a => {
// Filter out entries with a level of zero
a._2._1 > 0
}).reduceByKey( (a, b) => {
// Sum the levels and counts so we can ave
16 matches
Mail list logo