Is there an advantage to that vs directly consuming from Kafka? Nothing is
being done to the data except some light ETL and then storing it in
Cassandra

On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma <deepakmc...@gmail.com>
wrote:

> Its better you use spark's direct stream to ingest from kafka.
>
> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar <ali.rac...@gmail.com> wrote:
>
>> I don't think I need a different speed storage and batch storage. Just
>> taking in raw data from Kafka, standardizing, and storing it somewhere
>> where the web UI can query it, seems like it will be enough.
>>
>> I'm thinking about:
>>
>> - Reading data from Kafka via Spark Streaming
>> - Standardizing, then storing it in Cassandra
>> - Querying Cassandra from the web ui
>>
>> That seems like it will work. My question now is whether to use Spark
>> Streaming to read Kafka, or use Kafka consumers directly.
>>
>>
>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> - Spark Streaming to read data from Kafka
>>> - Storing the data on HDFS using Flume
>>>
>>> You don't need Spark streaming to read data from Kafka and store on
>>> HDFS. It is a waste of resources.
>>>
>>> Couple Flume to use Kafka as source and HDFS as sink directly
>>>
>>> KafkaAgent.sources = kafka-sources
>>> KafkaAgent.sinks.hdfs-sinks.type = hdfs
>>>
>>> That will be for your batch layer. To analyse you can directly read from
>>> hdfs files with Spark or simply store data in a database of your choice via
>>> cron or something. Do not mix your batch layer with speed layer.
>>>
>>> Your speed layer will ingest the same data directly from Kafka into
>>> spark streaming and that will be  online or near real time (defined by your
>>> window).
>>>
>>> Then you have a a serving layer to present data from both speed  (the
>>> one from SS) and batch layer.
>>>
>>> HTH
>>>
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 29 September 2016 at 15:15, Ali Akhtar <ali.rac...@gmail.com> wrote:
>>>
>>>> The web UI is actually the speed layer, it needs to be able to query
>>>> the data online, and show the results in real-time.
>>>>
>>>> It also needs a custom front-end, so a system like Tableau can't be
>>>> used, it must have a custom backend + front-end.
>>>>
>>>> Thanks for the recommendation of Flume. Do you think this will work:
>>>>
>>>> - Spark Streaming to read data from Kafka
>>>> - Storing the data on HDFS using Flume
>>>> - Using Spark to query the data in the backend of the web UI?
>>>>
>>>>
>>>>
>>>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>> You need a batch layer and a speed layer. Data from Kafka can be
>>>>> stored on HDFS using flume.
>>>>>
>>>>> -  Query this data to generate reports / analytics (There will be a
>>>>> web UI which will be the front-end to the data, and will show the reports)
>>>>>
>>>>> This is basically batch layer and you need something like Tableau or
>>>>> Zeppelin to query data
>>>>>
>>>>> You will also need spark streaming to query data online for speed
>>>>> layer. That data could be stored in some transient fabric like ignite or
>>>>> even druid.
>>>>>
>>>>> HTH
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> LinkedIn * 
>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>> On 29 September 2016 at 15:01, Ali Akhtar <ali.rac...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> It needs to be able to scale to a very large amount of data, yes.
>>>>>>
>>>>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma <deepakmc...@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> What is the message inflow ?
>>>>>>> If it's really high , definitely spark will be of great use .
>>>>>>>
>>>>>>> Thanks
>>>>>>> Deepak
>>>>>>>
>>>>>>> On Sep 29, 2016 19:24, "Ali Akhtar" <ali.rac...@gmail.com> wrote:
>>>>>>>
>>>>>>>> I have a somewhat tricky use case, and I'm looking for ideas.
>>>>>>>>
>>>>>>>> I have 5-6 Kafka producers, reading various APIs, and writing their
>>>>>>>> raw data into Kafka.
>>>>>>>>
>>>>>>>> I need to:
>>>>>>>>
>>>>>>>> - Do ETL on the data, and standardize it.
>>>>>>>>
>>>>>>>> - Store the standardized data somewhere (HBase / Cassandra / Raw
>>>>>>>> HDFS / ElasticSearch / Postgres)
>>>>>>>>
>>>>>>>> - Query this data to generate reports / analytics (There will be a
>>>>>>>> web UI which will be the front-end to the data, and will show the 
>>>>>>>> reports)
>>>>>>>>
>>>>>>>> Java is being used as the backend language for everything (backend
>>>>>>>> of the web UI, as well as the ETL layer)
>>>>>>>>
>>>>>>>> I'm considering:
>>>>>>>>
>>>>>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer
>>>>>>>> (receive raw data from Kafka, standardize & store it)
>>>>>>>>
>>>>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized
>>>>>>>> data, and to allow queries
>>>>>>>>
>>>>>>>> - In the backend of the web UI, I could either use Spark to run
>>>>>>>> queries across the data (mostly filters), or directly run queries 
>>>>>>>> against
>>>>>>>> Cassandra / HBase
>>>>>>>>
>>>>>>>> I'd appreciate some thoughts / suggestions on which of these
>>>>>>>> alternatives I should go with (e.g, using raw Kafka consumers vs Spark 
>>>>>>>> for
>>>>>>>> ETL, which persistent data store to use, and how to query that data 
>>>>>>>> store
>>>>>>>> in the backend of the web UI, for displaying the reports).
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks.
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
>
> --
> Thanks
> Deepak
> www.bigdatabig.com
> www.keosha.net
>

Reply via email to