Hi

I just started a new spark streaming project. In this phase of the system
all we want to do is save the data we received to hdfs. I after running for
a couple of days it looks like I am missing a lot of data. I wonder if
saveAsTextFile("hdfs:///rawSteamingData²); is overwriting the data I capture
in previous window? I noticed that after running for a couple of days  my
hdfs file system has 25 file. The names are something like ³part-00006². I
used 'hadoop fs ­dus¹ to check the total data captured. While the system was
running I would periodically call Œdus¹ I was surprised sometimes the
numbers of total bytes actually dropped.


Is there a better way to save write my data to disk?

Any suggestions would be appreciated

Andy


   public static void main(String[] args) {

      SparkConf conf = new SparkConf().setAppName(appName);

        JavaSparkContext jsc = new JavaSparkContext(conf);

        JavaStreamingContext ssc = new JavaStreamingContext(jsc, new
Duration(5 * 1000));



[ deleted code Š]



data.foreachRDD(new Function<JavaRDD<String>, Void>(){

            private static final long serialVersionUID =
-7957854392903581284L;



            @Override

            public Void call(JavaRDD<String> jsonStr) throws Exception {

                jsonStr.saveAsTextFile("hdfs:///rawSteamingData²); //
/rawSteamingData is a directory

                return null;

            }      

        });

        

        ssc.checkpoint(checkPointUri);

        

        ssc.start();

        ssc.awaitTermination();

}


Reply via email to