Re: Architecture recommendations for a tricky use case

Michael Segel Thu, 29 Sep 2016 12:44:35 -0700

OP mentioned HBase or HDFS as persisted storage. Therefore they have to be 
running YARN if they are considering spark. 
(Assuming that you’re not trying to do a storage / compute model and use 
standalone spark outside your cluster. You can, but you have more moving 
parts…)


I never said anything about putting something on a public network. I mentioned 
running a secured cluster.
You don’t deal with PII or other regulated data, do you? 


If you read my original post, you are correct we don’t have a lot, if any real 
information. 
Based on what the OP said, there are design considerations since every tool he 
mentioned has pluses and minuses and the problem isn’t really that challenging 
unless you have something extraordinary like high velocity or some other 
constraint that makes this challenging. 

BTW, depending on scale and velocity… your relational engines may become 
problematic. 
HTH

-Mike


> On Sep 29, 2016, at 1:51 PM, Cody Koeninger <c...@koeninger.org> wrote:
> 
> The OP didn't say anything about Yarn, and why are you contemplating
> putting Kafka or Spark on public networks to begin with?
> 
> Gwen's right, absent any actual requirements this is kind of pointless.
> 
> On Thu, Sep 29, 2016 at 1:27 PM, Michael Segel
> <msegel_had...@hotmail.com> wrote:
>> Spark standalone is not Yarn… or secure for that matter… ;-)
>> 
>>> On Sep 29, 2016, at 11:18 AM, Cody Koeninger <c...@koeninger.org> wrote:
>>> 
>>> Spark streaming helps with aggregation because
>>> 
>>> A. raw kafka consumers have no built in framework for shuffling
>>> amongst nodes, short of writing into an intermediate topic (I'm not
>>> touching Kafka Streams here, I don't have experience), and
>>> 
>>> B. it deals with batches, so you can transactionally decide to commit
>>> or rollback your aggregate data and your offsets.  Otherwise your
>>> offsets and data store can get out of sync, leading to lost /
>>> duplicate data.
>>> 
>>> Regarding long running spark jobs, I have streaming jobs in the
>>> standalone manager that have been running for 6 months or more.
>>> 
>>> On Thu, Sep 29, 2016 at 11:01 AM, Michael Segel
>>> <msegel_had...@hotmail.com> wrote:
>>>> Ok… so what’s the tricky part?
>>>> Spark Streaming isn’t real time so if you don’t mind a slight delay in 
>>>> processing… it would work.
>>>> 
>>>> The drawback is that you now have a long running Spark Job (assuming under 
>>>> YARN) and that could become a problem in terms of security and resources.
>>>> (How well does Yarn handle long running jobs these days in a secured 
>>>> Cluster? Steve L. may have some insight… )
>>>> 
>>>> Raw HDFS would become a problem because Apache HDFS is still a worm. (Do 
>>>> you want to write your own compaction code? Or use Hive 1.x+?)
>>>> 
>>>> HBase? Depending on your admin… stability could be a problem.
>>>> Cassandra? That would be a separate cluster and that in itself could be a 
>>>> problem…
>>>> 
>>>> YMMV so you need to address the pros/cons of each tool specific to your 
>>>> environment and skill level.
>>>> 
>>>> HTH
>>>> 
>>>> -Mike
>>>> 
>>>>> On Sep 29, 2016, at 8:54 AM, Ali Akhtar <ali.rac...@gmail.com> wrote:
>>>>> 
>>>>> I have a somewhat tricky use case, and I'm looking for ideas.
>>>>> 
>>>>> I have 5-6 Kafka producers, reading various APIs, and writing their raw 
>>>>> data into Kafka.
>>>>> 
>>>>> I need to:
>>>>> 
>>>>> - Do ETL on the data, and standardize it.
>>>>> 
>>>>> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS / 
>>>>> ElasticSearch / Postgres)
>>>>> 
>>>>> - Query this data to generate reports / analytics (There will be a web UI 
>>>>> which will be the front-end to the data, and will show the reports)
>>>>> 
>>>>> Java is being used as the backend language for everything (backend of the 
>>>>> web UI, as well as the ETL layer)
>>>>> 
>>>>> I'm considering:
>>>>> 
>>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer 
>>>>> (receive raw data from Kafka, standardize & store it)
>>>>> 
>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized data, 
>>>>> and to allow queries
>>>>> 
>>>>> - In the backend of the web UI, I could either use Spark to run queries 
>>>>> across the data (mostly filters), or directly run queries against 
>>>>> Cassandra / HBase
>>>>> 
>>>>> I'd appreciate some thoughts / suggestions on which of these alternatives 
>>>>> I should go with (e.g, using raw Kafka consumers vs Spark for ETL, which 
>>>>> persistent data store to use, and how to query that data store in the 
>>>>> backend of the web UI, for displaying the reports).
>>>>> 
>>>>> 
>>>>> Thanks.
>>>> 
>> 


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Architecture recommendations for a tricky use case

Reply via email to