If your core requirement is ad-hoc real-time queries over the data,
then the standard Hadoop-centric answer would be:

Ingest via Kafka,
maybe using Flume, or possibly Spark Streaming, to read and land the data, in...
Parquet on HDFS or possibly Kudu, and
Impala to query

>> On 15 September 2016 at 09:35, Mich Talebzadeh <mich.talebza...@gmail.com>
>> wrote:
>>>
>>> Hi,
>>>
>>> This is for fishing for some ideas.
>>>
>>> In the design we get prices directly through Kafka into Flume and store
>>> it on HDFS as text files
>>> We can then use Spark with Zeppelin to present data to the users.
>>>
>>> This works. However, I am aware that once the volume of flat files rises
>>> one needs to do housekeeping. You don't want to read all files every time.
>>>
>>> A more viable alternative would be to read data into some form of tables
>>> (Hive etc) periodically through an hourly cron set up so batch process will
>>> have up to date and accurate data up to last hour.
>>>
>>> That certainly be an easier option for the users as well.
>>>
>>> I was wondering what would be the best strategy here. Druid, Hive others?
>>>
>>> The business case here is that users may want to access older data so a
>>> database of some sort will be a better solution? In all likelihood they want
>>> a week's data.
>>>
>>> Thanks
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> Disclaimer: Use it at your own risk. Any and all responsibility for any
>>> loss, damage or destruction of data or any other property which may arise
>>> from relying on this email's technical content is explicitly disclaimed. The
>>> author will in no case be liable for any monetary damages arising from such
>>> loss, damage or destruction.
>>>
>>>
>>
>>
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to