how to load some of the files in a dir and monitor new file in that dir in spark streaming without missing?

lisendong Mon, 11 May 2015 06:03:07 -0700

I have one hdfs dir, which contains many files:

/user/root/1.txt
/user/root/2.txt
/user/root/3.txt
/user/root/4.txt



and there is a daemon process which add one file per minute to this dir.
(e.g., 5.txt, 6.txt, 7.txt...)

I want to start a spark streaming job which load 3.txt, 4.txt and then
detect all the new files after 4.txt.

Please pay attention that because these files are large, processing these
files will take a long time. So if I process 3.txt and 4.txt before
launching the streaming task, maybe the 5.txt, 6.txt will be produced into
this dir during processing 3.txt and 4.txt. And when the streaming task
start, 5.txt and 6.txt will be missed for processing because it will only
process from new file(from 7.txt)

I'm not sure if I describe the problem clearly, if you have any question,
please ask me



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-load-some-of-the-files-in-a-dir-and-monitor-new-file-in-that-dir-in-spark-streaming-without-m-tp22841.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

how to load some of the files in a dir and monitor new file in that dir in spark streaming without missing?

Reply via email to