Re: Hadoop Realtime Queries

Alex Kamil Thu, 31 Jul 2014 11:28:48 -0700

NP,

we use Hbase+Phoenix for "real time" SQL queries in prod:
 http://phoenix.apache.org/


by real time I mean milliseconds for small queries, or seconds for
 hundreds of millions of rows. The speed mostly depends on how many
nodes/ hbase regionservers are in in the cluster.  Hbase is great  for
parallel scanning of TBs of data and Phoenix adds the standard SQL and the
capability to run JOINs on multiple tables. It's using Hbase co-processors
to optimize aggregated queries. It's a breeze to install (just a standard
JDBC driver) and so far been very stable.

Language reference: http://phoenix.apache.org/language/index.html
Performance and comparison with Hive and Impala:
http://phoenix.apache.org/performance.html

Alex


On Thu, Jul 31, 2014 at 12:22 PM, Nitin Pawar <nitinpawar...@gmail.com>
wrote:

> Before you read the entire answer, i will advise you to wait for hive
> experts to answer.
>
> you are looking at a wrong system then.
>
> Hive is more batch oriented and bring a near real time scenario with
> ORC/Paraquet fileformats along with tez and stringer.
>
> You may want to design your system in a way where you can take the help of
> batch oriented nature and merge it with real stream processing and making
> those data available for reporting.
>
> I am not sure if anyone has done tests for sizes for 50TB.
> What's the size of your cluster? what is cluster capacity on running maps
> or reducers in parallel ?
>
> I remember doing more than 150TB data processing when RCFile was just
> released and hive was in 0.7 or something like that. My cluster size was
> more than 800 nodes and I could run around 1600 maps. But still we needed
> many hours to consume the data cause the nature of queries was complex and
> it was for pattern matching and pattern recognition
>
>
> On Thu, Jul 31, 2014 at 9:37 PM, Natarajan, Prabakaran 1. (NSN -
> IN/Bangalore) <prabakaran.1.natara...@nsn.com> wrote:
>
>>  Hi Nitin,
>>
>>
>>
>> I want queries to return within a second
>>
>>
>>
>> Hive table DataSize is 50TB – Snappy RC file
>>
>>
>>
>>
>> *Thanks and Regards*
>>
>> Prabakaran.N  aka NP
>>
>> nsn, Bangalore
>>
>> *When "I" is replaced by "We" - even Illness becomes "Wellness"*
>>
>>
>>
>>
>>
>> *From:* ext Nitin Pawar [mailto:nitinpawar...@gmail.com]
>> *Sent:* Thursday, July 31, 2014 6:25 PM
>>
>> *To:* user@hadoop.apache.org
>> *Subject:* Re: Hadoop Realtime Queries
>>
>>
>>
>> I want quick response for SQL queries .
>>
>>
>>
>> how quick is quick for you ?
>>
>> what's the data size?
>>
>> what kind of queries you want to run?
>>
>> what is the frequency of running the query on same dataset again and
>> again?
>>
>>
>>
>>
>>
>> On Thu, Jul 31, 2014 at 6:20 PM, Natarajan, Prabakaran 1. (NSN -
>> IN/Bangalore) <prabakaran.1.natara...@nsn.com> wrote:
>>
>> Hi,
>>
>>
>>
>> Thank you all for the reply.
>>
>>
>>
>> I want quick response for SQL queries .
>>
>>
>>
>> *Thanks and Regards*
>>
>> Prabakaran.N
>>
>>
>>
>> *From:* ext Bertrand Dechoux [mailto:decho...@gmail.com]
>> *Sent:* Thursday, July 31, 2014 1:28 PM
>> *To:* user@hadoop.apache.org
>> *Subject:* Re: Hadoop Realtime Queries
>>
>>
>>
>> It all depends on the context and what is really meant by realtime.
>> Impala (and other concurrent alternatives) are not listed among the tools
>> you have tried.
>>
>> Maybe you should not focus only on batch frameworks for providing a
>> realtime access? The results are not surprising.
>>
>>
>>  Bertrand Dechoux
>>
>>
>>
>> On Thu, Jul 31, 2014 at 9:38 AM, Kumar, Deepak8 <deepak8.ku...@citi.com>
>> wrote:
>>
>> Hi,
>>
>> As far as I know, real time queries are only possible using HBase &
>> cloudera search. Hive would be a batch process, it is not real time. So
>> instead of tuning different parameters , may be you could look for
>> different architecture design so that you could use HBase.
>>
>>
>>
>> Regards,
>>
>> Deepak
>>
>>
>>
>> *From:* Natarajan, Prabakaran 1. (NSN - IN/Bangalore) [mailto:
>> prabakaran.1.natara...@nsn.com]
>> *Sent:* Thursday, July 31, 2014 3:32 AM
>> *To:* user@hadoop.apache.org
>> *Subject:* Hadoop Realtime Queries
>>
>>
>>
>> Hi
>>
>>
>>
>> I want to perform realtime query on HDFS data.   I tried
>> hadoop/yarnt/hive, shark on spark, Tez, etc.,
>>
>> But still I couldn’t get subsecond performance on the large data that I
>> have.
>>
>> I understand hadoop is not meant for this, but still want to achieve as
>> max as possible
>>
>>
>>
>> 1.       How can we tune RHEL OS for this?
>>
>> 2.       How can we tune yarn?
>>
>> 3.       Is there is any stable framework like Tez which can perform
>> much better
>>
>> 4.       Is there is any caching strategy that we can adopt?
>>
>> 5.       Any articles related to this are welcome
>>
>>
>>
>> Thanks in Advance
>>
>>
>>
>> Prabakaran.N
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> --
>> Nitin Pawar
>>
>
>
>
> --
> Nitin Pawar
>

Re: Hadoop Realtime Queries

Reply via email to