Re: Hive parquet vs Vertica vs Impala

Shashidhar Rao Sun, 04 Jan 2015 01:41:05 -0800

Thanks Edward , you summed it all in your reply.

On Sun, Jan 4, 2015 at 4:15 AM, Edward Capriolo <[email protected]>
wrote:


> "how does vertica compare with Hive in similar settings"
> Vertical in hive do no have similar settings. Vertica is a columnar MPP
> analytic database. Hive is an SQL on hadoop platform. Depending on your
> usage patterns you can use these things interchangeably but not all.
>
> Vertica can do low latency < 1 second. select queries based on
> "PROJECTIONS" which are something like materialized views. Hive is a batch
> processing system and not built for low latency queries.
>
> Hive use cases typically chop up very large datasets and produce other
> large data sets, or smaller datasets which are put into online systems for
> low latency analysis.
>
> Even if you use one of the columnar style formats like Parquet and the
> faster execution engines you are not going to get the type of experience
> you get from vertica.
>
> Vertical to Mysql. They both do queries, but vertica does not do primary
> key enforcement and is non-optimal for small row at a time insertions.
> Certain datasets mysql queries faster some Vertical will query faster.
>
> Basically hive and vertica are very different even though they are both
> considered data warehouses.
>
> On Sat, Jan 3, 2015 at 4:30 PM, Shashidhar Rao <[email protected]
> > wrote:
>
>> Sorry Edward, I mentioned that I didn't have access to vertica , but yes
>> I was given vertica query retrieval time  . Query  is kind of select a.x,
>> b.y from t as a , t1 as b where a.id = b.id  etc and the schema for
>> those tables required for the join were given.
>> A subset of the data was given and I just need to find a open source
>> framework like Hive and simulate it by storing and executing it.
>> Yes, you are right by saying that it is impossible and I do agree, but
>> unfortunately, I have to come up with this difficult task.
>>
>> The hard part is that the rows in vertica will not have similar number
>> of  rows in Hive , it is only a subset  , but it seems query time will be
>> derived based on the rows in hive compared to rows in vertica through some
>> calculation.
>>
>>  I only wanted to know  how does vertica compare with Hive in similar
>> settings in case if someone has done some benchmarking, it need not be my
>> case. Some links to some blogs would certainly help.
>>
>> But thanks anyway for your reply
>>
>> Thanks
>> shashi
>>
>> On Sun, Jan 4, 2015 at 2:21 AM, Edward Capriolo <[email protected]>
>> wrote:
>>
>>> Shashi,
>>>
>>> Your questions are too broad, and you are asking questions that are
>>> impossible to answer.
>>> Q. "What is faster X or Y?".
>>> A. "This depends on countless variables and can not be answered."
>>>
>>> For one example even databases that are very similar in nature like
>>> mysql/postgres might execute a query a different way based on it's query
>>> planner or even the characteristics of the data.
>>>
>>> How can you show if a query is "faster then vertica" if you do not have
>>> access vertica to prove it?
>>>
>>> I understand some of what you are trying to determine, but you should
>>> really attempt to install these things and build a prototype to determine
>>> what is the best fit for your application. This will grow your
>>> understanding of the systems, help you ask better questions, and
>>> potentially give you the ability to answer those questions yourself and
>>> make better decisions.
>>>
>>> The right way to ask this question might be "Hello, I have loaded
>>> 50Million rows of data into hive and I am running this query 'select X,
>>> from bla bla'. My vertica instances runs this query in X seconds and hive
>>> runs this in Y seconds. Can this be optimized further?"
>>>
>>> The software license for Impala is included here:
>>> https://github.com/cloudera/Impala/blob/master/LICENSE.txt
>>>
>>> Edward
>>>
>>>
>>> On Sat, Jan 3, 2015 at 3:29 PM, Shashidhar Rao <
>>> [email protected]> wrote:
>>>
>>>> Edward,
>>>>
>>>> Thanks for your reply.
>>>> Can you please tell me the query performance of Hive-parquet against
>>>> Vertica. Can Hive -parquet match against Vertica's retrieval performance,
>>>> as I have been told Vertica is also compressed columnar format and is fast?
>>>> What if I query against some 50 millions of rows , which one will be
>>>> faster?
>>>>
>>>> And moreover is Impala open source ? In some blogs I have seen Impala
>>>> as open source but in some it says Impala as Cloudera proprietary engine.
>>>>
>>>> Ultimately, I want to use Hive -parquet but need to show that it is
>>>> better than Vertica, a few microseconds here and there would be fine. I
>>>> don't have access to Vertica.
>>>>
>>>> Thanks
>>>> shashi
>>>>
>>>> On Sun, Jan 4, 2015 at 1:07 AM, Edward Capriolo <[email protected]>
>>>> wrote:
>>>>
>>>>>  Hive is the only system that can store and query xml directly, with
>>>>> the help of different serde's or input formats.
>>>>>
>>>>> Impala and Vertical have more standard schema systems that do not
>>>>> support Collections like List, Map, Struct or nested collections you might
>>>>> need to store and process a complex XML document.
>>>>>
>>>>> Parquet (A storage format that works with Hive and Impala can support
>>>>> List,Map, Structs) but he the Impala engine can not access these at the
>>>>> moment. Last I checked impala refuses to read tables that have one of 
>>>>> these
>>>>> elements ( instead of skipping them).
>>>>>
>>>>> It sounds like you want to do one of a few things:
>>>>> 1) Normalize your xml into a table and then you can use Vertica, Hive,
>>>>> or Imapa
>>>>> 2) Write your data using using an Parquet (to handle nested objects )
>>>>> and Hive to query it.(Hopefully then when Impala adds collection support
>>>>> you can switch over.
>>>>>
>>>>> But mostly you need to do more research.
>>>>>
>>>>> Edward
>>>>>
>>>>> On Sat, Jan 3, 2015 at 2:15 PM, Shashidhar Rao <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Can someone help me with insights into Hive with parquet vs Vertica
>>>>>> comparison.
>>>>>>
>>>>>> I need to store large xml data into one these database so please help
>>>>>> me with query performance.
>>>>>>
>>>>>> Is Impala opensource and can we use it without Cloudera license.
>>>>>>
>>>>>> Thanks
>>>>>> Shashi
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Hive parquet vs Vertica vs Impala

Reply via email to