Re: Spark Streaming-- for each new file in HDFS

Steve Loughran Fri, 16 Sep 2016 04:06:13 -0700

On 16 Sep 2016, at 01:03, Peyman Mohajerian 
<mohaj...@gmail.com<mailto:mohaj...@gmail.com>> wrote:


You can listen to files in a specific directory using:
Take a look at: 
http://spark.apache.org/docs/latest/streaming-programming-guide.html

streamingContext.fileStream



yes, this works

here's an example I'm using to test using object stores like s3 & azure as 
sources of data

https://github.com/steveloughran/spark/blob/c2b7d885f91bb447ace8fbac427b2fdf9c84b4ef/cloud/src/main/scala/org/apache/spark/cloud/examples/CloudStreaming.scala#L83

SparkStreamingContext.textFileStream(streamGlobPath.toUri.toString) takes a 
directory ("/incoming/") or a glob path to directories ("incoming/2016/09/*) 
and will scan for data


-It will scan every window, looking for files with a modified time within that 
window
-you can then just hook up  a map to the output, start the ssc, evalu


      val lines = ssc.textFileStream(streamGlobPath.toUri.toString)

      val matches = lines.filter(_.endsWith("3")).map(line => {
        sightings.add(1)
        line
      })

      matches.print()
      ssc.start()

Once a file has been processed, it will not been scanned again, even if its 
modtime is updated. (ignoring executor failure/restart, and the bits in the 
code about remember durations). That means updates to a file within a window 
can be missed.

If you are writing to files from separate code, it is safest to write elsewhere 
and then copy/rename the file once complete.


(things are slightly complicated by the fact that HDFS doesn' t update modtimes 
until (a) the file is closed or (b) enough data has been written that the write 
spans a block boundary. That means that incremental writes to HDFS may appear 
to work, but once you write > 64 MB, or work with a different FS, changes may 
get lost.

But: it does work, lets you glue up streaming code to any workflow which 
generates output in files



On Thu, Sep 15, 2016 at 10:31 AM, Jörn Franke 
<jornfra...@gmail.com<mailto:jornfra...@gmail.com>> wrote:
Hi,
I recommend that the third party application puts an empty file with the same 
filename as the original file, but the extension ".uploaded". This is an 
indicator that the file has been fully (!) written to the fs. Otherwise you 
risk only reading parts of the file.
Then, you can have a file system listener for this .upload file.

Spark streaming or Kafka are not needed/suitable, if the server is a file 
server. You can use oozie (maybe with a simple custom action) to poll for 
.uploaded files and transmit them.

Re: Spark Streaming-- for each new file in HDFS

Reply via email to