Re: Hive parquet vs Vertica vs Impala

2015-01-04 Thread Shashidhar Rao
Thanks Edward , you summed it all in your reply.

On Sun, Jan 4, 2015 at 4:15 AM, Edward Capriolo 
wrote:

> "how does vertica compare with Hive in similar settings"
> Vertical in hive do no have similar settings. Vertica is a columnar MPP
> analytic database. Hive is an SQL on hadoop platform. Depending on your
> usage patterns you can use these things interchangeably but not all.
>
> Vertica can do low latency < 1 second. select queries based on
> "PROJECTIONS" which are something like materialized views. Hive is a batch
> processing system and not built for low latency queries.
>
> Hive use cases typically chop up very large datasets and produce other
> large data sets, or smaller datasets which are put into online systems for
> low latency analysis.
>
> Even if you use one of the columnar style formats like Parquet and the
> faster execution engines you are not going to get the type of experience
> you get from vertica.
>
> Vertical to Mysql. They both do queries, but vertica does not do primary
> key enforcement and is non-optimal for small row at a time insertions.
> Certain datasets mysql queries faster some Vertical will query faster.
>
> Basically hive and vertica are very different even though they are both
> considered data warehouses.
>
> On Sat, Jan 3, 2015 at 4:30 PM, Shashidhar Rao  > wrote:
>
>> Sorry Edward, I mentioned that I didn't have access to vertica , but yes
>> I was given vertica query retrieval time  . Query  is kind of select a.x,
>> b.y from t as a , t1 as b where a.id = b.id  etc and the schema for
>> those tables required for the join were given.
>> A subset of the data was given and I just need to find a open source
>> framework like Hive and simulate it by storing and executing it.
>> Yes, you are right by saying that it is impossible and I do agree, but
>> unfortunately, I have to come up with this difficult task.
>>
>> The hard part is that the rows in vertica will not have similar number
>> of  rows in Hive , it is only a subset  , but it seems query time will be
>> derived based on the rows in hive compared to rows in vertica through some
>> calculation.
>>
>>  I only wanted to know  how does vertica compare with Hive in similar
>> settings in case if someone has done some benchmarking, it need not be my
>> case. Some links to some blogs would certainly help.
>>
>> But thanks anyway for your reply
>>
>> Thanks
>> shashi
>>
>> On Sun, Jan 4, 2015 at 2:21 AM, Edward Capriolo 
>> wrote:
>>
>>> Shashi,
>>>
>>> Your questions are too broad, and you are asking questions that are
>>> impossible to answer.
>>> Q. "What is faster X or Y?".
>>> A. "This depends on countless variables and can not be answered."
>>>
>>> For one example even databases that are very similar in nature like
>>> mysql/postgres might execute a query a different way based on it's query
>>> planner or even the characteristics of the data.
>>>
>>> How can you show if a query is "faster then vertica" if you do not have
>>> access vertica to prove it?
>>>
>>> I understand some of what you are trying to determine, but you should
>>> really attempt to install these things and build a prototype to determine
>>> what is the best fit for your application. This will grow your
>>> understanding of the systems, help you ask better questions, and
>>> potentially give you the ability to answer those questions yourself and
>>> make better decisions.
>>>
>>> The right way to ask this question might be "Hello, I have loaded
>>> 50Million rows of data into hive and I am running this query 'select X,
>>> from bla bla'. My vertica instances runs this query in X seconds and hive
>>> runs this in Y seconds. Can this be optimized further?"
>>>
>>> The software license for Impala is included here:
>>> https://github.com/cloudera/Impala/blob/master/LICENSE.txt
>>>
>>> Edward
>>>
>>>
>>> On Sat, Jan 3, 2015 at 3:29 PM, Shashidhar Rao <
>>> raoshashidhar...@gmail.com> wrote:
>>>
 Edward,

 Thanks for your reply.
 Can you please tell me the query performance of Hive-parquet against
 Vertica. Can Hive -parquet match against Vertica's retrieval performance,
 as I have been told Vertica is also compressed columnar format and is fast?
 What if I query against some 50 millions of rows , which one will be
 faster?

 And moreover is Impala open source ? In some blogs I have seen Impala
 as open source but in some it says Impala as Cloudera proprietary engine.

 Ultimately, I want to use Hive -parquet but need to show that it is
 better than Vertica, a few microseconds here and there would be fine. I
 don't have access to Vertica.

 Thanks
 shashi

 On Sun, Jan 4, 2015 at 1:07 AM, Edward Capriolo 
 wrote:

>  Hive is the only system that can store and query xml directly, with
> the help of different serde's or input formats.
>
> Impala and Vertical have more standard schema systems that do not
> support Collections like List, Map, Stru

Re: Hive parquet vs Vertica vs Impala

2015-01-03 Thread Edward Capriolo
"how does vertica compare with Hive in similar settings"
Vertical in hive do no have similar settings. Vertica is a columnar MPP
analytic database. Hive is an SQL on hadoop platform. Depending on your
usage patterns you can use these things interchangeably but not all.

Vertica can do low latency < 1 second. select queries based on
"PROJECTIONS" which are something like materialized views. Hive is a batch
processing system and not built for low latency queries.

Hive use cases typically chop up very large datasets and produce other
large data sets, or smaller datasets which are put into online systems for
low latency analysis.

Even if you use one of the columnar style formats like Parquet and the
faster execution engines you are not going to get the type of experience
you get from vertica.

Vertical to Mysql. They both do queries, but vertica does not do primary
key enforcement and is non-optimal for small row at a time insertions.
Certain datasets mysql queries faster some Vertical will query faster.

Basically hive and vertica are very different even though they are both
considered data warehouses.

On Sat, Jan 3, 2015 at 4:30 PM, Shashidhar Rao 
wrote:

> Sorry Edward, I mentioned that I didn't have access to vertica , but yes I
> was given vertica query retrieval time  . Query  is kind of select a.x, b.y
> from t as a , t1 as b where a.id = b.id  etc and the schema for those
> tables required for the join were given.
> A subset of the data was given and I just need to find a open source
> framework like Hive and simulate it by storing and executing it.
> Yes, you are right by saying that it is impossible and I do agree, but
> unfortunately, I have to come up with this difficult task.
>
> The hard part is that the rows in vertica will not have similar number of
> rows in Hive , it is only a subset  , but it seems query time will be
> derived based on the rows in hive compared to rows in vertica through some
> calculation.
>
>  I only wanted to know  how does vertica compare with Hive in similar
> settings in case if someone has done some benchmarking, it need not be my
> case. Some links to some blogs would certainly help.
>
> But thanks anyway for your reply
>
> Thanks
> shashi
>
> On Sun, Jan 4, 2015 at 2:21 AM, Edward Capriolo 
> wrote:
>
>> Shashi,
>>
>> Your questions are too broad, and you are asking questions that are
>> impossible to answer.
>> Q. "What is faster X or Y?".
>> A. "This depends on countless variables and can not be answered."
>>
>> For one example even databases that are very similar in nature like
>> mysql/postgres might execute a query a different way based on it's query
>> planner or even the characteristics of the data.
>>
>> How can you show if a query is "faster then vertica" if you do not have
>> access vertica to prove it?
>>
>> I understand some of what you are trying to determine, but you should
>> really attempt to install these things and build a prototype to determine
>> what is the best fit for your application. This will grow your
>> understanding of the systems, help you ask better questions, and
>> potentially give you the ability to answer those questions yourself and
>> make better decisions.
>>
>> The right way to ask this question might be "Hello, I have loaded
>> 50Million rows of data into hive and I am running this query 'select X,
>> from bla bla'. My vertica instances runs this query in X seconds and hive
>> runs this in Y seconds. Can this be optimized further?"
>>
>> The software license for Impala is included here:
>> https://github.com/cloudera/Impala/blob/master/LICENSE.txt
>>
>> Edward
>>
>>
>> On Sat, Jan 3, 2015 at 3:29 PM, Shashidhar Rao <
>> raoshashidhar...@gmail.com> wrote:
>>
>>> Edward,
>>>
>>> Thanks for your reply.
>>> Can you please tell me the query performance of Hive-parquet against
>>> Vertica. Can Hive -parquet match against Vertica's retrieval performance,
>>> as I have been told Vertica is also compressed columnar format and is fast?
>>> What if I query against some 50 millions of rows , which one will be
>>> faster?
>>>
>>> And moreover is Impala open source ? In some blogs I have seen Impala as
>>> open source but in some it says Impala as Cloudera proprietary engine.
>>>
>>> Ultimately, I want to use Hive -parquet but need to show that it is
>>> better than Vertica, a few microseconds here and there would be fine. I
>>> don't have access to Vertica.
>>>
>>> Thanks
>>> shashi
>>>
>>> On Sun, Jan 4, 2015 at 1:07 AM, Edward Capriolo 
>>> wrote:
>>>
  Hive is the only system that can store and query xml directly, with
 the help of different serde's or input formats.

 Impala and Vertical have more standard schema systems that do not
 support Collections like List, Map, Struct or nested collections you might
 need to store and process a complex XML document.

 Parquet (A storage format that works with Hive and Impala can support
 List,Map, Structs) but he the Impala engine can not access these at 

Re: Hive parquet vs Vertica vs Impala

2015-01-03 Thread Shashidhar Rao
Sorry Edward, I mentioned that I didn't have access to vertica , but yes I
was given vertica query retrieval time  . Query  is kind of select a.x, b.y
from t as a , t1 as b where a.id = b.id  etc and the schema for those
tables required for the join were given.
A subset of the data was given and I just need to find a open source
framework like Hive and simulate it by storing and executing it.
Yes, you are right by saying that it is impossible and I do agree, but
unfortunately, I have to come up with this difficult task.

The hard part is that the rows in vertica will not have similar number of
rows in Hive , it is only a subset  , but it seems query time will be
derived based on the rows in hive compared to rows in vertica through some
calculation.

 I only wanted to know  how does vertica compare with Hive in similar
settings in case if someone has done some benchmarking, it need not be my
case. Some links to some blogs would certainly help.

But thanks anyway for your reply

Thanks
shashi

On Sun, Jan 4, 2015 at 2:21 AM, Edward Capriolo 
wrote:

> Shashi,
>
> Your questions are too broad, and you are asking questions that are
> impossible to answer.
> Q. "What is faster X or Y?".
> A. "This depends on countless variables and can not be answered."
>
> For one example even databases that are very similar in nature like
> mysql/postgres might execute a query a different way based on it's query
> planner or even the characteristics of the data.
>
> How can you show if a query is "faster then vertica" if you do not have
> access vertica to prove it?
>
> I understand some of what you are trying to determine, but you should
> really attempt to install these things and build a prototype to determine
> what is the best fit for your application. This will grow your
> understanding of the systems, help you ask better questions, and
> potentially give you the ability to answer those questions yourself and
> make better decisions.
>
> The right way to ask this question might be "Hello, I have loaded
> 50Million rows of data into hive and I am running this query 'select X,
> from bla bla'. My vertica instances runs this query in X seconds and hive
> runs this in Y seconds. Can this be optimized further?"
>
> The software license for Impala is included here:
> https://github.com/cloudera/Impala/blob/master/LICENSE.txt
>
> Edward
>
>
> On Sat, Jan 3, 2015 at 3:29 PM, Shashidhar Rao  > wrote:
>
>> Edward,
>>
>> Thanks for your reply.
>> Can you please tell me the query performance of Hive-parquet against
>> Vertica. Can Hive -parquet match against Vertica's retrieval performance,
>> as I have been told Vertica is also compressed columnar format and is fast?
>> What if I query against some 50 millions of rows , which one will be
>> faster?
>>
>> And moreover is Impala open source ? In some blogs I have seen Impala as
>> open source but in some it says Impala as Cloudera proprietary engine.
>>
>> Ultimately, I want to use Hive -parquet but need to show that it is
>> better than Vertica, a few microseconds here and there would be fine. I
>> don't have access to Vertica.
>>
>> Thanks
>> shashi
>>
>> On Sun, Jan 4, 2015 at 1:07 AM, Edward Capriolo 
>> wrote:
>>
>>>  Hive is the only system that can store and query xml directly, with the
>>> help of different serde's or input formats.
>>>
>>> Impala and Vertical have more standard schema systems that do not
>>> support Collections like List, Map, Struct or nested collections you might
>>> need to store and process a complex XML document.
>>>
>>> Parquet (A storage format that works with Hive and Impala can support
>>> List,Map, Structs) but he the Impala engine can not access these at the
>>> moment. Last I checked impala refuses to read tables that have one of these
>>> elements ( instead of skipping them).
>>>
>>> It sounds like you want to do one of a few things:
>>> 1) Normalize your xml into a table and then you can use Vertica, Hive,
>>> or Imapa
>>> 2) Write your data using using an Parquet (to handle nested objects )
>>> and Hive to query it.(Hopefully then when Impala adds collection support
>>> you can switch over.
>>>
>>> But mostly you need to do more research.
>>>
>>> Edward
>>>
>>> On Sat, Jan 3, 2015 at 2:15 PM, Shashidhar Rao <
>>> raoshashidhar...@gmail.com> wrote:
>>>
 Hi,

 Can someone help me with insights into Hive with parquet vs Vertica
 comparison.

 I need to store large xml data into one these database so please help
 me with query performance.

 Is Impala opensource and can we use it without Cloudera license.

 Thanks
 Shashi



>>>
>>
>


Re: Hive parquet vs Vertica vs Impala

2015-01-03 Thread Edward Capriolo
Shashi,

Your questions are too broad, and you are asking questions that are
impossible to answer.
Q. "What is faster X or Y?".
A. "This depends on countless variables and can not be answered."

For one example even databases that are very similar in nature like
mysql/postgres might execute a query a different way based on it's query
planner or even the characteristics of the data.

How can you show if a query is "faster then vertica" if you do not have
access vertica to prove it?

I understand some of what you are trying to determine, but you should
really attempt to install these things and build a prototype to determine
what is the best fit for your application. This will grow your
understanding of the systems, help you ask better questions, and
potentially give you the ability to answer those questions yourself and
make better decisions.

The right way to ask this question might be "Hello, I have loaded 50Million
rows of data into hive and I am running this query 'select X, from bla
bla'. My vertica instances runs this query in X seconds and hive runs this
in Y seconds. Can this be optimized further?"

The software license for Impala is included here:
https://github.com/cloudera/Impala/blob/master/LICENSE.txt

Edward


On Sat, Jan 3, 2015 at 3:29 PM, Shashidhar Rao 
wrote:

> Edward,
>
> Thanks for your reply.
> Can you please tell me the query performance of Hive-parquet against
> Vertica. Can Hive -parquet match against Vertica's retrieval performance,
> as I have been told Vertica is also compressed columnar format and is fast?
> What if I query against some 50 millions of rows , which one will be
> faster?
>
> And moreover is Impala open source ? In some blogs I have seen Impala as
> open source but in some it says Impala as Cloudera proprietary engine.
>
> Ultimately, I want to use Hive -parquet but need to show that it is better
> than Vertica, a few microseconds here and there would be fine. I don't have
> access to Vertica.
>
> Thanks
> shashi
>
> On Sun, Jan 4, 2015 at 1:07 AM, Edward Capriolo 
> wrote:
>
>>  Hive is the only system that can store and query xml directly, with the
>> help of different serde's or input formats.
>>
>> Impala and Vertical have more standard schema systems that do not support
>> Collections like List, Map, Struct or nested collections you might need to
>> store and process a complex XML document.
>>
>> Parquet (A storage format that works with Hive and Impala can support
>> List,Map, Structs) but he the Impala engine can not access these at the
>> moment. Last I checked impala refuses to read tables that have one of these
>> elements ( instead of skipping them).
>>
>> It sounds like you want to do one of a few things:
>> 1) Normalize your xml into a table and then you can use Vertica, Hive, or
>> Imapa
>> 2) Write your data using using an Parquet (to handle nested objects ) and
>> Hive to query it.(Hopefully then when Impala adds collection support you
>> can switch over.
>>
>> But mostly you need to do more research.
>>
>> Edward
>>
>> On Sat, Jan 3, 2015 at 2:15 PM, Shashidhar Rao <
>> raoshashidhar...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Can someone help me with insights into Hive with parquet vs Vertica
>>> comparison.
>>>
>>> I need to store large xml data into one these database so please help me
>>> with query performance.
>>>
>>> Is Impala opensource and can we use it without Cloudera license.
>>>
>>> Thanks
>>> Shashi
>>>
>>>
>>>
>>
>


Re: Hive parquet vs Vertica vs Impala

2015-01-03 Thread Shashidhar Rao
Edward,

Thanks for your reply.
Can you please tell me the query performance of Hive-parquet against
Vertica. Can Hive -parquet match against Vertica's retrieval performance,
as I have been told Vertica is also compressed columnar format and is fast?
What if I query against some 50 millions of rows , which one will be
faster?

And moreover is Impala open source ? In some blogs I have seen Impala as
open source but in some it says Impala as Cloudera proprietary engine.

Ultimately, I want to use Hive -parquet but need to show that it is better
than Vertica, a few microseconds here and there would be fine. I don't have
access to Vertica.

Thanks
shashi

On Sun, Jan 4, 2015 at 1:07 AM, Edward Capriolo 
wrote:

>  Hive is the only system that can store and query xml directly, with the
> help of different serde's or input formats.
>
> Impala and Vertical have more standard schema systems that do not support
> Collections like List, Map, Struct or nested collections you might need to
> store and process a complex XML document.
>
> Parquet (A storage format that works with Hive and Impala can support
> List,Map, Structs) but he the Impala engine can not access these at the
> moment. Last I checked impala refuses to read tables that have one of these
> elements ( instead of skipping them).
>
> It sounds like you want to do one of a few things:
> 1) Normalize your xml into a table and then you can use Vertica, Hive, or
> Imapa
> 2) Write your data using using an Parquet (to handle nested objects ) and
> Hive to query it.(Hopefully then when Impala adds collection support you
> can switch over.
>
> But mostly you need to do more research.
>
> Edward
>
> On Sat, Jan 3, 2015 at 2:15 PM, Shashidhar Rao  > wrote:
>
>> Hi,
>>
>> Can someone help me with insights into Hive with parquet vs Vertica
>> comparison.
>>
>> I need to store large xml data into one these database so please help me
>> with query performance.
>>
>> Is Impala opensource and can we use it without Cloudera license.
>>
>> Thanks
>> Shashi
>>
>>
>>
>


Re: Hive parquet vs Vertica vs Impala

2015-01-03 Thread Edward Capriolo
 Hive is the only system that can store and query xml directly, with the
help of different serde's or input formats.

Impala and Vertical have more standard schema systems that do not support
Collections like List, Map, Struct or nested collections you might need to
store and process a complex XML document.

Parquet (A storage format that works with Hive and Impala can support
List,Map, Structs) but he the Impala engine can not access these at the
moment. Last I checked impala refuses to read tables that have one of these
elements ( instead of skipping them).

It sounds like you want to do one of a few things:
1) Normalize your xml into a table and then you can use Vertica, Hive, or
Imapa
2) Write your data using using an Parquet (to handle nested objects ) and
Hive to query it.(Hopefully then when Impala adds collection support you
can switch over.

But mostly you need to do more research.

Edward

On Sat, Jan 3, 2015 at 2:15 PM, Shashidhar Rao 
wrote:

> Hi,
>
> Can someone help me with insights into Hive with parquet vs Vertica
> comparison.
>
> I need to store large xml data into one these database so please help me
> with query performance.
>
> Is Impala opensource and can we use it without Cloudera license.
>
> Thanks
> Shashi
>
>
>


Hive parquet vs Vertica vs Impala

2015-01-03 Thread Shashidhar Rao
Hi,

Can someone help me with insights into Hive with parquet vs Vertica
comparison.

I need to store large xml data into one these database so please help me
with query performance.

Is Impala opensource and can we use it without Cloudera license.

Thanks
Shashi