I have one hdfs dir, which contains many files:

/user/root/1.txt
/user/root/2.txt
/user/root/3.txt
/user/root/4.txt


and there is a daemon process which add one file per minute to this dir.
(e.g., 5.txt, 6.txt, 7.txt...)

I want to start a spark streaming job which load 3.txt, 4.txt and then
detect all the new files after 4.txt.

Please pay attention that because these files are large, processing these
files will take a long time. So if I process 3.txt and 4.txt before
launching the streaming task, maybe the 5.txt, 6.txt will be produced into
this dir during processing 3.txt and 4.txt. And when the streaming task
start, 5.txt and 6.txt will be missed for processing because it will only
process from new file(from 7.txt)

I'm not sure if I describe the problem clearly, if you have any question,
please ask me



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-load-some-of-the-files-in-a-dir-and-monitor-new-file-in-that-dir-in-spark-streaming-without-m-tp22841.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to