Spark standalone is not Yarn… or secure for that matter… ;-)

> On Sep 29, 2016, at 11:18 AM, Cody Koeninger <c...@koeninger.org> wrote:
> 
> Spark streaming helps with aggregation because
> 
> A. raw kafka consumers have no built in framework for shuffling
> amongst nodes, short of writing into an intermediate topic (I'm not
> touching Kafka Streams here, I don't have experience), and
> 
> B. it deals with batches, so you can transactionally decide to commit
> or rollback your aggregate data and your offsets.  Otherwise your
> offsets and data store can get out of sync, leading to lost /
> duplicate data.
> 
> Regarding long running spark jobs, I have streaming jobs in the
> standalone manager that have been running for 6 months or more.
> 
> On Thu, Sep 29, 2016 at 11:01 AM, Michael Segel
> <msegel_had...@hotmail.com> wrote:
>> Ok… so what’s the tricky part?
>> Spark Streaming isn’t real time so if you don’t mind a slight delay in 
>> processing… it would work.
>> 
>> The drawback is that you now have a long running Spark Job (assuming under 
>> YARN) and that could become a problem in terms of security and resources.
>> (How well does Yarn handle long running jobs these days in a secured 
>> Cluster? Steve L. may have some insight… )
>> 
>> Raw HDFS would become a problem because Apache HDFS is still a worm. (Do you 
>> want to write your own compaction code? Or use Hive 1.x+?)
>> 
>> HBase? Depending on your admin… stability could be a problem.
>> Cassandra? That would be a separate cluster and that in itself could be a 
>> problem…
>> 
>> YMMV so you need to address the pros/cons of each tool specific to your 
>> environment and skill level.
>> 
>> HTH
>> 
>> -Mike
>> 
>>> On Sep 29, 2016, at 8:54 AM, Ali Akhtar <ali.rac...@gmail.com> wrote:
>>> 
>>> I have a somewhat tricky use case, and I'm looking for ideas.
>>> 
>>> I have 5-6 Kafka producers, reading various APIs, and writing their raw 
>>> data into Kafka.
>>> 
>>> I need to:
>>> 
>>> - Do ETL on the data, and standardize it.
>>> 
>>> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS / 
>>> ElasticSearch / Postgres)
>>> 
>>> - Query this data to generate reports / analytics (There will be a web UI 
>>> which will be the front-end to the data, and will show the reports)
>>> 
>>> Java is being used as the backend language for everything (backend of the 
>>> web UI, as well as the ETL layer)
>>> 
>>> I'm considering:
>>> 
>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer (receive 
>>> raw data from Kafka, standardize & store it)
>>> 
>>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized data, 
>>> and to allow queries
>>> 
>>> - In the backend of the web UI, I could either use Spark to run queries 
>>> across the data (mostly filters), or directly run queries against Cassandra 
>>> / HBase
>>> 
>>> I'd appreciate some thoughts / suggestions on which of these alternatives I 
>>> should go with (e.g, using raw Kafka consumers vs Spark for ETL, which 
>>> persistent data store to use, and how to query that data store in the 
>>> backend of the web UI, for displaying the reports).
>>> 
>>> 
>>> Thanks.
>> 

Reply via email to