I have one hdfs dir, which contains many files: /user/root/1.txt /user/root/2.txt /user/root/3.txt /user/root/4.txt
and there is a daemon process which add one file per minute to this dir. (e.g., 5.txt, 6.txt, 7.txt...) I want to start a spark streaming job which load 3.txt, 4.txt and then detect all the new files after 4.txt. Please pay attention that because these files are large, processing these files will take a long time. So if I process 3.txt and 4.txt before launching the streaming task, maybe the 5.txt, 6.txt will be produced into this dir during processing 3.txt and 4.txt. And when the streaming task start, 5.txt and 6.txt will be missed for processing because it will only process from new file(from 7.txt) I'm not sure if I describe the problem clearly, if you have any question, please ask me -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-load-some-of-the-files-in-a-dir-and-monitor-new-file-in-that-dir-in-spark-streaming-without-m-tp22841.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org