Hello folks, Our intended use case is:
- Spark Streaming app #1 reads from RabbitMQ and output to HDFS - Spark Streaming app #2 reads #1's output and stores the data into Elasticsearch The idea behind this architecture is that if Elasticsearch is down due to an upgrade or system error we don't have to stop reading messages from the queue. We could also scale each process separately as needed. After a few hours research my understanding is that Spark Streaming outputs files in a *directory* for which you provide the prefix and suffix. This is despite the ScalaDoc for DStream saveAsObjectFiles suggesting otherwise: /** * Save each RDD in this DStream as a Sequence file of serialized objects. * The file name at each batch interval is generated based on `prefix` and * `suffix`: "prefix-TIME_IN_MS.suffix". */ Spark Streaming can monitor an HDFS directory for files but subfolders are not supported. So as far as I can tell, it is not possible to use Spark Streaming output as input for a different Spark Streaming app without somehow performing a separate operation in the middle. Am I missing something obvious? I've read some suggestions like using Hadoop to merge the directories (whose names I don't see how you would know) and to reduce the partitions to 1 (which wouldn't help). Any other suggestions? What is the expected pattern a developer would follow that would make Spark Streaming's output format usable? </pre><font face="arial" size="2" color="#736F6E"> <a href="http://www.sdl.com/?utm_source=Email&utm_medium=Email%2BSignature&utm_campaign=SDL%2BStandard%2BEmail%2BSignature"> <img src="http://www.sdl.com/Content/images/SDLlogo2014.png" border=0><br><br>www.sdl.com </a><br><br> <font face="arial" size="1" color="#736F6E"> <b>SDL PLC confidential, all rights reserved.</b> If you are not the intended recipient of this mail SDL requests and requires that you delete it without acting upon or copying any of its contents, and we further request that you advise us.<BR><BR> SDL PLC is a public limited company registered in England and Wales. Registered number: 02675207. <br> Registered address: Globe House, Clivemont Road, Maidenhead, Berkshire SL6 7DY, UK.</font> This message has been scanned for malware by Websense. www.websense.com