Re: Hive and Impala

Jörn Franke Wed, 02 Mar 2016 06:38:18 -0800

I think you can always make a benchmark that has this and this result. You 
always have to see what is evaluated and generally I recommend to always try 
yourself for your data and your queries.


There is also a lot of change within the projects. Impala may have Kudo, but 
Hive has ORC, Tez and Spark in combination with LLAP. 

As I said I always recommend to understand and try out the different 
technologies. 

> On 02 Mar 2016, at 14:52, Dayong <will...@gmail.com> wrote:
> 
> As I remember of few weeks before in Hadoop weekly news feed, cloudera has a 
> benchmark showing implala is a little better than spark SQL and hive with 
> tez. You can check that. From my experience, hive is still leading tool for 
> regular ETL job since it is stable. The other tool are better for adhoc and 
> interactive query use case. Cloudera bet on implala especially with its new 
> kudo project. 
> 
> Thanks,
> Dayong
> 
>> On Mar 1, 2016, at 5:14 PM, Edward Capriolo <edlinuxg...@gmail.com> wrote:
>> 
>> My nocks on impala. (not intended to be a post knocking impala)
>> 
>> Impala really has not delivered on the complex types that hive has (after 
>> promising it for quite a while), also it only works with the 'blessed' input 
>> formats, parquet, avro, text.
>> 
>> It is very annoying to work with impala, In my version if you create a 
>> partition in hive impala does not see it. You have to run "refresh". 
>> 
>> In impala I do not have all the UDFS that hive has like percentile, etc. 
>> 
>> Impala is fast. Many data-analysts / data-scientist types that can't wait 10 
>> seconds for a query so when I need top produce something for them I make 
>> sure the data has no complex types and uses a table type that impala 
>> understands. 
>> 
>> But for my work I still work primarily in hive, because I do not want to 
>> deal with all the things that impala does not have/might have/ and when I 
>> need something special like my own UDFs it is easier to whip up the solution 
>> in hive. 
>> 
>> Having worked with M$ SQL server, and vertica, Impala is on par with them 
>> but I don'think of it like i think of hive. To me it just feels like a 
>> vertica that I can cheat loading sometimes because it is backed by hdfs. 
>> 
>> Hive is something different, I am making pipelines, I am transforming data, 
>> doing streaming, writing custom udfs, querying JSON directly. Its not != 
>> impala.
>> 
>> ::random message of the day::
>> 
>> 
>>  
>> 
>>> On Tue, Mar 1, 2016 at 4:38 PM, Ashok Kumar <ashok34...@yahoo.com> wrote:
>>> 
>>> Dr Mitch,
>>> 
>>> My two cents here.
>>> 
>>> I don't have direct experience of Impala but in my humble opinion I share 
>>> your views that Hive provides the best metastore of all Big Data systems. 
>>> Looking around almost every product in one form and shape use Hive code 
>>> somewhere. My colleagues inform me that Hive is one of the most stable Big 
>>> Data products.
>>> 
>>> With the capabilities of Spark on Hive and Hive on Spark or Tez plus of 
>>> course MR, there is really little need for many other products in the same 
>>> space. It is good to keep things simple.
>>> 
>>> Warmest
>>> 
>>> 
>>> On Tuesday, 1 March 2016, 11:33, Mich Talebzadeh 
>>> <mich.talebza...@gmail.com> wrote:
>>> 
>>> 
>>> I have not heard of Impala anymore. I saw an article in LinkedIn titled
>>> 
>>> "Apache Hive Or Cloudera Impala? What is Best for me?"
>>> 
>>> "We can access all objects from Hive data warehouse with HiveQL which 
>>> leverages the map-reduce architecture in background for data retrieval and 
>>> transformation and this results in latency."
>>> 
>>> My response was
>>> 
>>> This statement is no longer valid as you have choices of three engines now 
>>> with MR, Spark and Tez. I have not used Impala myself as I don't think 
>>> there is a need for it with Hive on Spark or Spark using Hive metastore 
>>> providing whatever needed. Hive is for Data Warehouse and provides what is 
>>> says on the tin. Please also bear in mind that Hive offers ORC storage 
>>> files that provide store Index capabilities further optimizing the queries 
>>> with additional stats at file, stripe and row group levels. 
>>> 
>>> Anyway the question is with Hive on Spark or Spark using Hive metastore 
>>> what we cannot achieve that we can achieve with Impala?
>>> 
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> http://talebzadehmich.wordpress.com
>>

Re: Hive and Impala

Reply via email to