Spark standalone is not Yarn… or secure for that matter… ;-)
> On Sep 29, 2016, at 11:18 AM, Cody Koeninger <c...@koeninger.org> wrote:
>
> Spark streaming helps with aggregation because
>
> A. raw kafka consumers have no built in framework for shuffling
> amongst nodes, short of writing into an intermediate topic (I'm not
> touching Kafka Streams here, I don't have experience), and
>
> B. it deals with batches, so you can transactionally decide to commit
> or rollback your aggregate data and your offsets. Otherwise your
> offsets and data store can get out of sync, leading to lost /
> duplicate data.
>
> Regarding long running spark jobs, I have streaming jobs in the
> standalone manager that have been running for 6 months or more.
>
> On Thu, Sep 29, 2016 at 11:01 AM, Michael Segel
> <msegel_had...@hotmail.com> wrote:
>> Ok… so what’s the tricky part?
>> Spark Streaming isn’t real time so if you don’t mind a slight delay in
>> processing… it would work.
>>
>> The drawback is that you now have a long running Spark Job (assuming under
>> YARN) and that could become a problem in terms of security and resources.
>> (How well does Yarn handle long running jobs these days in a secured
>> Cluster? Steve L. may have some insight… )
>>
>> Raw HDFS would become a problem because Apache HDFS is still a worm. (Do you
>> want to write your own compaction code? Or use Hive 1.x+?)
>>
>> HBase? Depending on your admin… stability could be a problem.
>> Cassandra? That would be a separate cluster and that in itself could be a
>> problem…
>>
>> YMMV so you need to address the pros/cons of each tool specific to your
>> environment and skill level.
>>
>> HTH
>>
>> -Mike
>>
>>> On Sep 29, 2016, at 8:54 AM, Ali Akhtar <ali.rac...@gmail.com> wrote:
>>>
>>> I have a somewhat tricky use case, and I'm looking for ideas.
>>>
>>> I have 5-6 Kafka producers, reading various APIs, and writing their raw
>>> data into Kafka.
>>>
>>> I need to:
>>>
>>> - Do ETL on the data, and standardize it.
>>>
>>> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS /
>>> ElasticSearch / Postgres)
>>>
>>> - Query this data to generate reports / analytics (There will be a web UI
>>> which will be the front-end to the data, and will show the reports)
>>>
>>> Java is being used as the backend language for everything (backend of the
>>> web UI, as well as the ETL layer)
>>>
>>> I'm considering:
>>>
>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer (receive
>>> raw data from Kafka, standardize & store it)
>>>
>>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized data,
>>> and to allow queries
>>>
>>> - In the backend of the web UI, I could either use Spark to run queries
>>> across the data (mostly filters), or directly run queries against Cassandra
>>> / HBase
>>>
>>> I'd appreciate some thoughts / suggestions on which of these alternatives I
>>> should go with (e.g, using raw Kafka consumers vs Spark for ETL, which
>>> persistent data store to use, and how to query that data store in the
>>> backend of the web UI, for displaying the reports).
>>>
>>>
>>> Thanks.
>>