We recently started a Spark/Spark Streaming POC. We wrote a simple streaming
app in java to collect tweets. We choose twitter because we new we get a lot
of data and probably lots of burst. Good for stress testing

We spun up  a couple of small clusters using the spark-ec2 script. In one
cluster we wrote all the tweets to HDFS in a second cluster we write all the
tweets to S3

We were surprised that our HDFS file system reached 100 % of capacity in a
few days. This resulted with ³all data nodes dead². We where surprised
because the actually stream app continued to run. We had no idea we had a
problem until a day or two after the disk became full when we noticed we
where missing a lot of data.

We ran into a similar problem with our s3 cluster. We had a permission
problem and where un able to write any data yet our stream app continued to
run


Spark generated mountains of logs,We are using the stand alone cluster
manager. All the log levels wind up in the ³error² log. Making it hard to
find real errors and warnings using the web UI. Our app is written in Java
so my guess is the write errors must be unable. I.E. We did not know in
advance that they could occur . They are basically undocumented.



We are a small shop. Running something like splunk would add a lot of
expense and complexity for us at this stage of our growth.

What are best practices

Kind Regards

Andy


Reply via email to