Hello folks,

Our intended use case is:

-          Spark Streaming app #1 reads from RabbitMQ and output to HDFS

-          Spark Streaming app #2 reads #1's output and stores the data into 
Elasticsearch

The idea behind this architecture is that if Elasticsearch is down due to an 
upgrade or system error we don't have to stop reading messages from the queue. 
We could also scale each process separately as needed.

After a few hours research my understanding is that Spark Streaming outputs 
files in a *directory* for which you provide the prefix and suffix. This is 
despite the ScalaDoc for DStream saveAsObjectFiles suggesting otherwise:

  /**
   * Save each RDD in this DStream as a Sequence file of serialized objects.
   * The file name at each batch interval is generated based on `prefix` and
   * `suffix`: "prefix-TIME_IN_MS.suffix".
   */

Spark Streaming can monitor an HDFS directory for files but subfolders are not 
supported. So as far as I can tell, it is not possible to use Spark Streaming 
output as input for a different Spark Streaming app without somehow performing 
a separate operation in the middle.

Am I missing something obvious? I've read some suggestions like using Hadoop to 
merge the directories (whose names I don't see how you would know) and to 
reduce the partitions to 1 (which wouldn't help).

Any other suggestions? What is the expected pattern a developer would follow 
that would make Spark Streaming's output format usable?

</pre><font face="arial" size="2" color="#736F6E">



<a 
href="http://www.sdl.com/?utm_source=Email&utm_medium=Email%2BSignature&utm_campaign=SDL%2BStandard%2BEmail%2BSignature";>
<img src="http://www.sdl.com/Content/images/SDLlogo2014.png"; 
border=0><br><br>www.sdl.com
</a><br><br>

<font face="arial" size="1" color="#736F6E">

<b>SDL PLC confidential, all rights reserved.</b>

If you are not the intended recipient of this mail SDL requests and requires 
that you delete it without acting upon or copying any of its contents, 
and we further request that you advise us.<BR><BR>
SDL PLC is a public limited company registered in England and Wales.  
Registered number: 02675207.

<br>

Registered address: Globe House, Clivemont Road, Maidenhead, Berkshire SL6 7DY, 
UK.</font>


This message has been scanned for malware by Websense. www.websense.com

Reply via email to