You can use the .saveAsObjectFiles("hdfs://sigmoid/twitter/status/") since you want to store the Status object and for every batch it will create a directory under /status (name will mostly be the timestamp), since the data is small (hardly couple of MBs for 1 sec interval) it will not overwhelm the cluster.
Thanks Best Regards On Sat, Oct 24, 2015 at 7:05 AM, Andy Davidson < a...@santacruzintegration.com> wrote: > I need to save the twitter status I receive so that I can do additional > batch based processing on them in the future. Is it safe to assume HDFS is > the best way to go? > > Any idea what is the best way to save twitter status to HDFS? > > JavaStreamingContext ssc = new JavaStreamingContext(jsc, new > Duration(1000)); > > Authorization twitterAuth = setupTwitterAuthorization(); > > JavaDStream<Status> tweets = TwitterFilterQueryUtils.createStream( > ssc, twitterAuth, query); > > > > http://spark.apache.org/docs/latest/streaming-programming-guide.html#output-operations-on-dstreams > > > > *saveAsHadoopFiles*(*prefix*, [*suffix*])Save this DStream's contents as > Hadoop files. The file name at each batch interval is generated based on > *prefix* and *suffix*: *"prefix-TIME_IN_MS[.suffix]"*. > Python API This is not available in the Python API. > > How ever JavaDStream<> does not support any savesAs* functions > > > DStream<Status> dStream = tweets.dstream(); > > > http://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/dstream/DStream.html > > DStream<Status> only supports *saveAsObjectFiles > <http://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/dstream/DStream.html#saveAsObjectFiles(java.lang.String,%20java.lang.String)>()and > **saveAsTextFiles > <http://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/dstream/DStream.html#saveAsTextFiles(java.lang.String,%20java.lang.String)>* > (() > > > saveAsTextFiles > > public void saveAsTextFiles(java.lang.String prefix, > java.lang.String suffix) > > Save each RDD in this DStream as at text file, using string representation > of elements. The file name at each batch interval is generated based on > prefix andsuffix: "prefix-TIME_IN_MS.suffixā€¯. > > > Any idea where I would find these files? I assume they will be spread out > all over my cluster? > > > Also I wonder if using the saveAs*() functions are going to cause other > problems. My duration is set to 1 sec. Am I going to overwhelm the system > with a bunch of tiny files? Many of them will be empty > > > Kind regards > > > Andy >