Hi, You need good monitoring tools to send you alarms about disk, network or applications errors, but I think it is general dev ops work not very specific to spark or hadoop.
BR, Arkadiusz Bicz https://www.linkedin.com/in/arkadiuszbicz On Thu, Feb 11, 2016 at 7:09 PM, Andy Davidson <a...@santacruzintegration.com> wrote: > We recently started a Spark/Spark Streaming POC. We wrote a simple streaming > app in java to collect tweets. We choose twitter because we new we get a lot > of data and probably lots of burst. Good for stress testing > > We spun up a couple of small clusters using the spark-ec2 script. In one > cluster we wrote all the tweets to HDFS in a second cluster we write all the > tweets to S3 > > We were surprised that our HDFS file system reached 100 % of capacity in a > few days. This resulted with “all data nodes dead”. We where surprised > because the actually stream app continued to run. We had no idea we had a > problem until a day or two after the disk became full when we noticed we > where missing a lot of data. > > We ran into a similar problem with our s3 cluster. We had a permission > problem and where un able to write any data yet our stream app continued to > run > > > Spark generated mountains of logs,We are using the stand alone cluster > manager. All the log levels wind up in the “error” log. Making it hard to > find real errors and warnings using the web UI. Our app is written in Java > so my guess is the write errors must be unable. I.E. We did not know in > advance that they could occur . They are basically undocumented. > > > > We are a small shop. Running something like splunk would add a lot of > expense and complexity for us at this stage of our growth. > > What are best practices > > Kind Regards > > Andy --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org