streaming.twitter.TwitterUtils what is the best way to save twitter status to HDFS?

Andy Davidson Fri, 23 Oct 2015 18:36:07 -0700

I need to save the twitter status I receive so that I can do additional
batch based processing on them in the future. Is it safe to assume HDFS is
the best way to go?


Any idea what is the best way to save twitter status to HDFS?

        JavaStreamingContext ssc = new JavaStreamingContext(jsc, new
Duration(1000));

        Authorization twitterAuth = setupTwitterAuthorization();

        JavaDStream<Status> tweets =
TwitterFilterQueryUtils.createStream(ssc, twitterAuth, query);



http://spark.apache.org/docs/latest/streaming-programming-guide.html#output-
operations-on-dstreams



saveAsHadoopFiles(prefix, [suffix])Save this DStream's contents as Hadoop
files. The file name at each batch interval is generated based on prefix and
suffix: "prefix-TIME_IN_MS[.suffix]".
Python API This is not available in the Python API.


How ever JavaDStream<> does not support any savesAs* functions



        DStream<Status> dStream = tweets.dstream();


http://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/dstr
eam/DStream.html
DStream<Status> only supports saveAsObjectFiles
<http://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/dst
ream/DStream.html#saveAsObjectFiles(java.lang.String,%20java.lang.String)>
()and saveAsTextFiles
<http://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/dst
ream/DStream.html#saveAsTextFiles(java.lang.String,%20java.lang.String)> (()


saveAsTextFiles
public void saveAsTextFiles(java.lang.String prefix,
                   java.lang.String suffix)
Save each RDD in this DStream as at text file, using string representation
of elements. The file name at each batch interval is generated based on
prefix andsuffix: "prefix-TIME_IN_MS.suffix².


Any idea where I would find these files? I assume they will be spread out
all over my cluster?


Also I wonder if using the saveAs*() functions are going to cause other
problems. My duration is set to 1 sec. Am I going to overwhelm the system
with a bunch of tiny files? Many of them will be empty



Kind regards



Andy

streaming.twitter.TwitterUtils what is the best way to save twitter status to HDFS?

Reply via email to