Re: Hive and Impala

Jörn Franke Wed, 02 Mar 2016 10:34:55 -0800

It always depends on what you want to do and thus from experience I cannot 
agree with your comment. Do you have any reasoning for this statement?


> On 02 Mar 2016, at 19:14, Dayong <will...@gmail.com> wrote:
> 
> Tez is kind of outdated and Orc is so dedicated on hive. In addition, hive 
> metadata store can be decoupled from hive as well. In reality, we do suffer 
> from hive's performance even for ETL job. As result, we'll switch to implala 
> + spark/ flink. 
> 
> Thanks,
> Dayong
> 
>> On Mar 2, 2016, at 10:35 AM, Mich Talebzadeh <mich.talebza...@gmail.com> 
>> wrote:
>> 
>> I forgot besides LLAP you are going to have Hive Hybrid Procedural SQL On 
>> Hadoop (HPL/SQL) which is going to add another dimension to Hive 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> http://talebzadehmich.wordpress.com
>>  
>> 
>>> On 2 March 2016 at 15:30, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
>>> SQL plays an increasing important role on Hadoop. As of today Hive IMO 
>>> provides the best and most robust solution to anything resembling to Data 
>>> Warehouse "solution" on Hadoop, chiefly by means of its powerful metastore 
>>> which can be hosted on a variety of mission critical databases plus Hive's 
>>> ever increasing support for a variety of file types on HDFs from humble 
>>> textfile to ORC. The remaining tools are little more than query tools that 
>>> crucially rely on Hive Metastore for their needs. Take away Hive component 
>>> and they are more and less lame ducks.
>>> 
>>> Hive on MR speed was perceived to be slow but what the hec we are talking 
>>> about a Data Warehouse here which in most part should be batch oriented  
>>> and not user-facing and batch oriented. In Hive 0.14 and 2.0 you can use 
>>> Spark and Tez as the execution engine and if you are well into functional 
>>> programming, you can deploy Spark on Hive. If you look around from Impala 
>>> to Spark the architecture is essentially a query tool.
>>> 
>>> 
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> http://talebzadehmich.wordpress.com
>>>  
>>> 
>>>> On 2 March 2016 at 13:52, Dayong <will...@gmail.com> wrote:
>>>> As I remember of few weeks before in Hadoop weekly news feed, cloudera has 
>>>> a benchmark showing implala is a little better than spark SQL and hive 
>>>> with tez. You can check that. From my experience, hive is still leading 
>>>> tool for regular ETL job since it is stable. The other tool are better for 
>>>> adhoc and interactive query use case. Cloudera bet on implala especially 
>>>> with its new kudo project. 
>>>> 
>>>> Thanks,
>>>> Dayong
>>>> 
>>>>> On Mar 1, 2016, at 5:14 PM, Edward Capriolo <edlinuxg...@gmail.com> wrote:
>>>>> 
>>>>> My nocks on impala. (not intended to be a post knocking impala)
>>>>> 
>>>>> Impala really has not delivered on the complex types that hive has (after 
>>>>> promising it for quite a while), also it only works with the 'blessed' 
>>>>> input formats, parquet, avro, text.
>>>>> 
>>>>> It is very annoying to work with impala, In my version if you create a 
>>>>> partition in hive impala does not see it. You have to run "refresh". 
>>>>> 
>>>>> In impala I do not have all the UDFS that hive has like percentile, etc. 
>>>>> 
>>>>> Impala is fast. Many data-analysts / data-scientist types that can't wait 
>>>>> 10 seconds for a query so when I need top produce something for them I 
>>>>> make sure the data has no complex types and uses a table type that impala 
>>>>> understands. 
>>>>> 
>>>>> But for my work I still work primarily in hive, because I do not want to 
>>>>> deal with all the things that impala does not have/might have/ and when I 
>>>>> need something special like my own UDFs it is easier to whip up the 
>>>>> solution in hive. 
>>>>> 
>>>>> Having worked with M$ SQL server, and vertica, Impala is on par with them 
>>>>> but I don'think of it like i think of hive. To me it just feels like a 
>>>>> vertica that I can cheat loading sometimes because it is backed by hdfs. 
>>>>> 
>>>>> Hive is something different, I am making pipelines, I am transforming 
>>>>> data, doing streaming, writing custom udfs, querying JSON directly. Its 
>>>>> not != impala.
>>>>> 
>>>>> ::random message of the day::
>>>>> 
>>>>> 
>>>>>  
>>>>> 
>>>>>> On Tue, Mar 1, 2016 at 4:38 PM, Ashok Kumar <ashok34...@yahoo.com> wrote:
>>>>>> 
>>>>>> Dr Mitch,
>>>>>> 
>>>>>> My two cents here.
>>>>>> 
>>>>>> I don't have direct experience of Impala but in my humble opinion I 
>>>>>> share your views that Hive provides the best metastore of all Big Data 
>>>>>> systems. Looking around almost every product in one form and shape use 
>>>>>> Hive code somewhere. My colleagues inform me that Hive is one of the 
>>>>>> most stable Big Data products.
>>>>>> 
>>>>>> With the capabilities of Spark on Hive and Hive on Spark or Tez plus of 
>>>>>> course MR, there is really little need for many other products in the 
>>>>>> same space. It is good to keep things simple.
>>>>>> 
>>>>>> Warmest
>>>>>> 
>>>>>> 
>>>>>> On Tuesday, 1 March 2016, 11:33, Mich Talebzadeh 
>>>>>> <mich.talebza...@gmail.com> wrote:
>>>>>> 
>>>>>> 
>>>>>> I have not heard of Impala anymore. I saw an article in LinkedIn titled
>>>>>> 
>>>>>> "Apache Hive Or Cloudera Impala? What is Best for me?"
>>>>>> 
>>>>>> "We can access all objects from Hive data warehouse with HiveQL which 
>>>>>> leverages the map-reduce architecture in background for data retrieval 
>>>>>> and transformation and this results in latency."
>>>>>> 
>>>>>> My response was
>>>>>> 
>>>>>> This statement is no longer valid as you have choices of three engines 
>>>>>> now with MR, Spark and Tez. I have not used Impala myself as I don't 
>>>>>> think there is a need for it with Hive on Spark or Spark using Hive 
>>>>>> metastore providing whatever needed. Hive is for Data Warehouse and 
>>>>>> provides what is says on the tin. Please also bear in mind that Hive 
>>>>>> offers ORC storage files that provide store Index capabilities further 
>>>>>> optimizing the queries with additional stats at file, stripe and row 
>>>>>> group levels. 
>>>>>> 
>>>>>> Anyway the question is with Hive on Spark or Spark using Hive metastore 
>>>>>> what we cannot achieve that we can achieve with Impala?
>>>>>> 
>>>>>> 
>>>>>> Dr Mich Talebzadeh
>>>>>>  
>>>>>> LinkedIn  
>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>  
>>>>>> http://talebzadehmich.wordpress.com
>>

Re: Hive and Impala

Reply via email to