If your core requirement is ad-hoc real-time queries over the data, then the standard Hadoop-centric answer would be:
Ingest via Kafka, maybe using Flume, or possibly Spark Streaming, to read and land the data, in... Parquet on HDFS or possibly Kudu, and Impala to query >> On 15 September 2016 at 09:35, Mich Talebzadeh <mich.talebza...@gmail.com> >> wrote: >>> >>> Hi, >>> >>> This is for fishing for some ideas. >>> >>> In the design we get prices directly through Kafka into Flume and store >>> it on HDFS as text files >>> We can then use Spark with Zeppelin to present data to the users. >>> >>> This works. However, I am aware that once the volume of flat files rises >>> one needs to do housekeeping. You don't want to read all files every time. >>> >>> A more viable alternative would be to read data into some form of tables >>> (Hive etc) periodically through an hourly cron set up so batch process will >>> have up to date and accurate data up to last hour. >>> >>> That certainly be an easier option for the users as well. >>> >>> I was wondering what would be the best strategy here. Druid, Hive others? >>> >>> The business case here is that users may want to access older data so a >>> database of some sort will be a better solution? In all likelihood they want >>> a week's data. >>> >>> Thanks >>> >>> Dr Mich Talebzadeh >>> >>> >>> >>> LinkedIn >>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>> >>> >>> >>> http://talebzadehmich.wordpress.com >>> >>> >>> Disclaimer: Use it at your own risk. Any and all responsibility for any >>> loss, damage or destruction of data or any other property which may arise >>> from relying on this email's technical content is explicitly disclaimed. The >>> author will in no case be liable for any monetary damages arising from such >>> loss, damage or destruction. >>> >>> >> >> > --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org