Re: Hadoop Realtime Queries

Nitin Pawar Thu, 31 Jul 2014 09:23:35 -0700

Before you read the entire answer, i will advise you to wait for hive
experts to answer.


you are looking at a wrong system then.

Hive is more batch oriented and bring a near real time scenario with
ORC/Paraquet fileformats along with tez and stringer.

You may want to design your system in a way where you can take the help of
batch oriented nature and merge it with real stream processing and making
those data available for reporting.

I am not sure if anyone has done tests for sizes for 50TB.
What's the size of your cluster? what is cluster capacity on running maps
or reducers in parallel ?

I remember doing more than 150TB data processing when RCFile was just
released and hive was in 0.7 or something like that. My cluster size was
more than 800 nodes and I could run around 1600 maps. But still we needed
many hours to consume the data cause the nature of queries was complex and
it was for pattern matching and pattern recognition


On Thu, Jul 31, 2014 at 9:37 PM, Natarajan, Prabakaran 1. (NSN -
IN/Bangalore) <prabakaran.1.natara...@nsn.com> wrote:

>  Hi Nitin,
>
>
>
> I want queries to return within a second
>
>
>
> Hive table DataSize is 50TB – Snappy RC file
>
>
>
>
> *Thanks and Regards*
>
> Prabakaran.N  aka NP
>
> nsn, Bangalore
>
> *When "I" is replaced by "We" - even Illness becomes "Wellness"*
>
>
>
>
>
> *From:* ext Nitin Pawar [mailto:nitinpawar...@gmail.com]
> *Sent:* Thursday, July 31, 2014 6:25 PM
>
> *To:* user@hadoop.apache.org
> *Subject:* Re: Hadoop Realtime Queries
>
>
>
> I want quick response for SQL queries .
>
>
>
> how quick is quick for you ?
>
> what's the data size?
>
> what kind of queries you want to run?
>
> what is the frequency of running the query on same dataset again and
> again?
>
>
>
>
>
> On Thu, Jul 31, 2014 at 6:20 PM, Natarajan, Prabakaran 1. (NSN -
> IN/Bangalore) <prabakaran.1.natara...@nsn.com> wrote:
>
> Hi,
>
>
>
> Thank you all for the reply.
>
>
>
> I want quick response for SQL queries .
>
>
>
> *Thanks and Regards*
>
> Prabakaran.N
>
>
>
> *From:* ext Bertrand Dechoux [mailto:decho...@gmail.com]
> *Sent:* Thursday, July 31, 2014 1:28 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Hadoop Realtime Queries
>
>
>
> It all depends on the context and what is really meant by realtime. Impala
> (and other concurrent alternatives) are not listed among the tools you have
> tried.
>
> Maybe you should not focus only on batch frameworks for providing a
> realtime access? The results are not surprising.
>
>
>  Bertrand Dechoux
>
>
>
> On Thu, Jul 31, 2014 at 9:38 AM, Kumar, Deepak8 <deepak8.ku...@citi.com>
> wrote:
>
> Hi,
>
> As far as I know, real time queries are only possible using HBase &
> cloudera search. Hive would be a batch process, it is not real time. So
> instead of tuning different parameters , may be you could look for
> different architecture design so that you could use HBase.
>
>
>
> Regards,
>
> Deepak
>
>
>
> *From:* Natarajan, Prabakaran 1. (NSN - IN/Bangalore) [mailto:
> prabakaran.1.natara...@nsn.com]
> *Sent:* Thursday, July 31, 2014 3:32 AM
> *To:* user@hadoop.apache.org
> *Subject:* Hadoop Realtime Queries
>
>
>
> Hi
>
>
>
> I want to perform realtime query on HDFS data.   I tried
> hadoop/yarnt/hive, shark on spark, Tez, etc.,
>
> But still I couldn’t get subsecond performance on the large data that I
> have.
>
> I understand hadoop is not meant for this, but still want to achieve as
> max as possible
>
>
>
> 1.       How can we tune RHEL OS for this?
>
> 2.       How can we tune yarn?
>
> 3.       Is there is any stable framework like Tez which can perform much
> better
>
> 4.       Is there is any caching strategy that we can adopt?
>
> 5.       Any articles related to this are welcome
>
>
>
> Thanks in Advance
>
>
>
> Prabakaran.N
>
>
>
>
>
>
>
>
>
>
>
>
>
> --
> Nitin Pawar
>



-- 
Nitin Pawar

Re: Hadoop Realtime Queries

Reply via email to