Ya Also I think I need to enable the checkpointing and rather then building the lineage DAG need to store the RDD data into HDFS.
On 23 September 2015 at 01:04, Adrian Tanase <atan...@adobe.com> wrote: > btw I re-read the docs and I want to clarify that reliable receiver + WAL > gives you at least once, not exactly once semantics. > > Sent from my iPhone > > On 21 Sep 2015, at 21:50, Adrian Tanase <atan...@adobe.com> wrote: > > I'm wondering, isn't this the canonical use case for WAL + reliable > receiver? > > As far as I know you can tune Mqtt server to wait for ack on messages (qos > level 2?). > With some support from the client libray you could achieve exactly once > semantics on the read side, if you ack message only after writing it to > WAL, correct? > > -adrian > > Sent from my iPhone > > On 21 Sep 2015, at 12:35, Petr Novak <oss.mli...@gmail.com> wrote: > > In short there is no direct support for it in Spark AFAIK. You will either > manage it in MQTT or have to add another layer of indirection - either > in-memory based (observable streams, in-mem db) or disk based (Kafka, hdfs > files, db) which will keep you unprocessed events. > > Now realizing, there is support for backpressure in v1.5.0 but I don't > know if it could be exploited aka I don't know if it is possible to > decouple event reading into memory and actual processing code in Spark > which could be swapped on the fly. Probably not without some custom built > facility for it. > > Petr > > On Mon, Sep 21, 2015 at 11:26 AM, Petr Novak <oss.mli...@gmail.com> wrote: > >> I should read my posts at least once to avoid so many typos. Hopefully >> you are brave enough to read through. >> >> Petr >> >> On Mon, Sep 21, 2015 at 11:23 AM, Petr Novak <oss.mli...@gmail.com> >> wrote: >> >>> I think you would have to persist events somehow if you don't want to >>> miss them. I don't see any other option there. Either in MQTT if it is >>> supported there or routing them through Kafka. >>> >>> There is WriteAheadLog in Spark but you would have decouple stream MQTT >>> reading and processing into 2 separate job so that you could upgrade the >>> processing one assuming the reading one would be stable (without changes) >>> across versions. But it is problematic because there is no easy way how to >>> share DStreams between jobs - you would have develop your own facility for >>> it. >>> >>> Alternatively the reading job could could save MQTT event in its the >>> most raw form into files - to limit need to change code - and then the >>> processing job would work on top of it using Spark streaming based on >>> files. I this is inefficient and can get quite complex if you would like to >>> make it reliable. >>> >>> Basically either MQTT supports prsistence (which I don't know) or there >>> is Kafka for these use case. >>> >>> Another option would be I think to place observable streams in between >>> MQTT and Spark streaming with bakcpressure as far as you could perform >>> upgrade till buffers fills up. >>> >>> I'm sorry that it is not thought out well from my side, it is just a >>> brainstorm but it might lead you somewhere. >>> >>> Regards, >>> Petr >>> >>> On Mon, Sep 21, 2015 at 10:09 AM, Jeetendra Gangele < >>> gangele...@gmail.com> wrote: >>> >>>> Hi All, >>>> >>>> I have an spark streaming application with batch (10 ms) which is >>>> reading the MQTT channel and dumping the data from MQTT to HDFS. >>>> >>>> So suppose if I have to deploy new application jar(with changes in >>>> spark streaming application) what is the best way to deploy, currently I am >>>> doing as below >>>> >>>> 1.killing the running streaming app using yarn application -kill ID >>>> 2. and then starting the application again >>>> >>>> Problem with above approach is since we are not persisting the events >>>> in MQTT we will miss the events for the period of deploy. >>>> >>>> how to handle this case? >>>> >>>> regards >>>> jeeetndra >>>> >>>