Re: streaming.twitter.TwitterUtils what is the best way to save twitter status to HDFS?

2015-11-01 Thread Akhil Das
You can use the .saveAsObjectFiles("hdfs://sigmoid/twitter/status/") since
you want to store the Status object and for every batch it will create a
directory under /status (name will mostly be the timestamp), since the data
is small (hardly couple of MBs for 1 sec interval) it will not overwhelm
the cluster.

Thanks
Best Regards

On Sat, Oct 24, 2015 at 7:05 AM, Andy Davidson <
a...@santacruzintegration.com> wrote:

> I need to save the twitter status I receive so that I can do additional
> batch based processing on them in the future. Is it safe to assume HDFS is
> the best way to go?
>
> Any idea what is the best way to save twitter status to HDFS?
>
> JavaStreamingContext ssc = new JavaStreamingContext(jsc, new
> Duration(1000));
>
> Authorization twitterAuth = setupTwitterAuthorization();
>
> JavaDStream tweets = TwitterFilterQueryUtils.createStream(
> ssc, twitterAuth, query);
>
>
>
> http://spark.apache.org/docs/latest/streaming-programming-guide.html#output-operations-on-dstreams
>
>
>
> *saveAsHadoopFiles*(*prefix*, [*suffix*])Save this DStream's contents as
> Hadoop files. The file name at each batch interval is generated based on
> *prefix* and *suffix*: *"prefix-TIME_IN_MS[.suffix]"*.
> Python API This is not available in the Python API.
>
> How ever JavaDStream<> does not support any savesAs* functions
>
>
> DStream dStream = tweets.dstream();
>
>
> http://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/dstream/DStream.html
>
> DStream only supports *saveAsObjectFiles
> ()and
>  **saveAsTextFiles
> *
> (()
>
>
> saveAsTextFiles
>
> public void saveAsTextFiles(java.lang.String prefix,
>java.lang.String suffix)
>
> Save each RDD in this DStream as at text file, using string representation
> of elements. The file name at each batch interval is generated based on
> prefix andsuffix: "prefix-TIME_IN_MS.suffix”.
>
>
> Any idea where I would find these files? I assume they will be spread out
> all over my cluster?
>
>
> Also I wonder if using the saveAs*() functions are going to cause other
> problems. My duration is set to 1 sec. Am I going to overwhelm the system
> with a bunch of tiny files? Many of them will be empty
>
>
> Kind regards
>
>
> Andy
>


streaming.twitter.TwitterUtils what is the best way to save twitter status to HDFS?

2015-10-23 Thread Andy Davidson
I need to save the twitter status I receive so that I can do additional
batch based processing on them in the future. Is it safe to assume HDFS is
the best way to go?

Any idea what is the best way to save twitter status to HDFS?

JavaStreamingContext ssc = new JavaStreamingContext(jsc, new
Duration(1000));

Authorization twitterAuth = setupTwitterAuthorization();

JavaDStream tweets =
TwitterFilterQueryUtils.createStream(ssc, twitterAuth, query);



http://spark.apache.org/docs/latest/streaming-programming-guide.html#output-
operations-on-dstreams



saveAsHadoopFiles(prefix, [suffix])Save this DStream's contents as Hadoop
files. The file name at each batch interval is generated based on prefix and
suffix: "prefix-TIME_IN_MS[.suffix]".
Python API This is not available in the Python API.


How ever JavaDStream<> does not support any savesAs* functions



DStream dStream = tweets.dstream();


http://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/dstr
eam/DStream.html
DStream only supports saveAsObjectFiles

()and saveAsTextFiles
 (()


saveAsTextFiles
public void saveAsTextFiles(java.lang.String prefix,
   java.lang.String suffix)
Save each RDD in this DStream as at text file, using string representation
of elements. The file name at each batch interval is generated based on
prefix andsuffix: "prefix-TIME_IN_MS.suffix².


Any idea where I would find these files? I assume they will be spread out
all over my cluster?


Also I wonder if using the saveAs*() functions are going to cause other
problems. My duration is set to 1 sec. Am I going to overwhelm the system
with a bunch of tiny files? Many of them will be empty



Kind regards



Andy