I need to save the twitter status I receive so that I can do additional batch based processing on them in the future. Is it safe to assume HDFS is the best way to go?
Any idea what is the best way to save twitter status to HDFS? JavaStreamingContext ssc = new JavaStreamingContext(jsc, new Duration(1000)); Authorization twitterAuth = setupTwitterAuthorization(); JavaDStream<Status> tweets = TwitterFilterQueryUtils.createStream(ssc, twitterAuth, query); http://spark.apache.org/docs/latest/streaming-programming-guide.html#output- operations-on-dstreams saveAsHadoopFiles(prefix, [suffix])Save this DStream's contents as Hadoop files. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]". Python API This is not available in the Python API. How ever JavaDStream<> does not support any savesAs* functions DStream<Status> dStream = tweets.dstream(); http://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/dstr eam/DStream.html DStream<Status> only supports saveAsObjectFiles <http://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/dst ream/DStream.html#saveAsObjectFiles(java.lang.String,%20java.lang.String)> ()and saveAsTextFiles <http://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/dst ream/DStream.html#saveAsTextFiles(java.lang.String,%20java.lang.String)> (() saveAsTextFiles public void saveAsTextFiles(java.lang.String prefix, java.lang.String suffix) Save each RDD in this DStream as at text file, using string representation of elements. The file name at each batch interval is generated based on prefix andsuffix: "prefix-TIME_IN_MS.suffix². Any idea where I would find these files? I assume they will be spread out all over my cluster? Also I wonder if using the saveAs*() functions are going to cause other problems. My duration is set to 1 sec. Am I going to overwhelm the system with a bunch of tiny files? Many of them will be empty Kind regards Andy