Re: Difference between RC file format & Parquet file format

2016-02-18 Thread Koert Kuipers
ORC was created inside hive instead of (as it should have been i think) as
a file format library that hive can depend on, and other frameworks as
well. it seems to be part of hive's annoying tendency not to think of
itself as a java library.

On Thu, Feb 18, 2016 at 2:38 AM, Abhishek Dubey 
wrote:

> I think it's fair to say that one of the main differences is the
> representation of nesting structure.
>
>
>
> *Parquet* uses Dremel's repetition and definition levels, which is an
> extremely efficient representation of nested structure that has the
>
> added benefit of being easy to embed into the column data itself;
>
>
>
> Julien wrote an excellent blog post that explains the details:
> https://blog.twitter.com/2013/dremel-made-simple-with-parquet
>
>
>
> *Orcfile* on the other hand uses separate "counter" columns, which means
> that for nested structures you need to read those counter columns in
>
> addition to the data columns you care about in order to recreate the
> nesting structure; this increases the required amount of random I/O.
>
>
>
> Also, Parquet is natively supported in a number of popular Hadoop
> frameworks: Pig, Impala, Hive, MR, Cascading.
>
>
>
> Source : https://groups.google.com/forum/#!topic/parquet-dev/0IdtSLdIINQ
>
>
>
>
>
> *Thanks & Regards,*
> *Abhishek Dubey*
>
>
>
> *From:* Ravi Prasad [mailto:raviprasa...@gmail.com]
> *Sent:* Thursday, February 18, 2016 9:06 AM
> *To:* user@hive.apache.org
> *Subject:* Difference between RC file format & Parquet file format
>
>
>
> Hi all,
>
>   Can you please let me know,
>
> How the RC file format is different from the Parquet file format.
>
> Both are column oriented file format, then what are the difference.
>
>
> --
>
> --
> Regards,
> RAVI PRASAD. T
>


Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-04 Thread Koert Kuipers
fair enough

On Thu, Feb 4, 2016 at 12:41 PM, Edward Capriolo 
wrote:

> Hive is not the correct tool for every problem. Use the tool that makes
> the most sense for your problem and your experience.
>
> Many people like hive because it is generally applicable. In my case study
> for the hive book I highlighted many smart capably organizations use hive.
>
> Your argument is totally valid. You like X better because X works for you.
> You don't need to 'preach' hear we all know hive has it's limits.
>
> On Thu, Feb 4, 2016 at 10:55 AM, Koert Kuipers  wrote:
>
>> Is the sky the limit? I know udfs can be used inside hive, like lambas
>> basically i assume, and i will assume you have something similar for
>> aggregations. But that's just abstractions inside a single map or reduce
>> phase, pretty low level stuff. What you really need is abstractions around
>> many map and reduce phases, because that is the level an algo is expressed
>> at.
>>
>> For example when doing logistic regression you want to be able to do
>> something like:
>> read("somefile").train(settings).write("model")
>> Here train is an eternally defined method that is well tested and could
>> do many map and reduce steps internally (or even be defined at a higher
>> level and compile into those steps). What is the equivalent in hive? Copy
>> pasting crucial parts of the algo around while using udfs is just not the
>> same thing in terms of reusability and abstraction. Its the opposite of
>> keeping it DRY.
>> On Feb 3, 2016 1:06 AM, "Ryan Harris" 
>> wrote:
>>
>>> https://github.com/myui/hivemall
>>>
>>>
>>>
>>> as long as you are comfortable with java UDFs, the sky is really the
>>> limit...it's not for everyone and spark does have many advantages, but they
>>> are two tools that can complement each other in numerous ways.
>>>
>>>
>>>
>>> I don't know that there is necessarily a universal "better" for how to
>>> use spark as an execution engine (or if spark is necessarily the **best**
>>> execution engine for any given hive job).
>>>
>>>
>>>
>>> The reality is that once you start factoring in the numerous tuning
>>> parameters of the systems and jobs there probably isn't a clear answer.
>>> For some queries, the Catalyst optimizer may do a better job...is it going
>>> to do a better job with ORC based data? less likely IMO.
>>>
>>>
>>>
>>> *From:* Koert Kuipers [mailto:ko...@tresata.com]
>>> *Sent:* Tuesday, February 02, 2016 9:50 PM
>>> *To:* user@hive.apache.org
>>> *Subject:* Re: Hive on Spark Engine versus Spark using Hive metastore
>>>
>>>
>>>
>>> yeah but have you ever seen somewhat write a real analytical program in
>>> hive? how? where are the basic abstractions to wrap up a large amount of
>>> operations (joins, groupby's) into a single function call? where are the
>>> tools to write nice unit test for that?
>>>
>>> for example in spark i can write a DataFrame => DataFrame that
>>> internally does many joins, groupBys and complex operations. all unit
>>> tested and perfectly re-usable. and in hive? copy paste round sql queries?
>>> thats just dangerous.
>>>
>>>
>>>
>>> On Tue, Feb 2, 2016 at 8:09 PM, Edward Capriolo 
>>> wrote:
>>>
>>> Hive has numerous extension points, you are not boxed in by a long shot.
>>>
>>>
>>>
>>> On Tuesday, February 2, 2016, Koert Kuipers  wrote:
>>>
>>> uuuhm with spark using Hive metastore you actually have a real
>>> programming environment and you can write real functions, versus just being
>>> boxed into some version of sql and limited udfs?
>>>
>>>
>>>
>>> On Tue, Feb 2, 2016 at 6:46 PM, Xuefu Zhang  wrote:
>>>
>>> When comparing the performance, you need to do it apple vs apple. In
>>> another thread, you mentioned that Hive on Spark is much slower than Spark
>>> SQL. However, you configured Hive such that only two tasks can run in
>>> parallel. However, you didn't provide information on how much Spark SQL is
>>> utilizing. Thus, it's hard to tell whether it's just a configuration
>>> problem in your Hive or Spark SQL is indeed faster. You should be able to
>>> see the resource usage in YARN resource manage URL.
>>>
>>> --Xuefu
>>>
>>>

RE: Hive on Spark Engine versus Spark using Hive metastore

2016-02-04 Thread Koert Kuipers
Is the sky the limit? I know udfs can be used inside hive, like lambas
basically i assume, and i will assume you have something similar for
aggregations. But that's just abstractions inside a single map or reduce
phase, pretty low level stuff. What you really need is abstractions around
many map and reduce phases, because that is the level an algo is expressed
at.

For example when doing logistic regression you want to be able to do
something like:
read("somefile").train(settings).write("model")
Here train is an eternally defined method that is well tested and could do
many map and reduce steps internally (or even be defined at a higher level
and compile into those steps). What is the equivalent in hive? Copy pasting
crucial parts of the algo around while using udfs is just not the same
thing in terms of reusability and abstraction. Its the opposite of keeping
it DRY.
On Feb 3, 2016 1:06 AM, "Ryan Harris"  wrote:

> https://github.com/myui/hivemall
>
>
>
> as long as you are comfortable with java UDFs, the sky is really the
> limit...it's not for everyone and spark does have many advantages, but they
> are two tools that can complement each other in numerous ways.
>
>
>
> I don't know that there is necessarily a universal "better" for how to use
> spark as an execution engine (or if spark is necessarily the **best**
> execution engine for any given hive job).
>
>
>
> The reality is that once you start factoring in the numerous tuning
> parameters of the systems and jobs there probably isn't a clear answer.
> For some queries, the Catalyst optimizer may do a better job...is it going
> to do a better job with ORC based data? less likely IMO.
>
>
>
> *From:* Koert Kuipers [mailto:ko...@tresata.com]
> *Sent:* Tuesday, February 02, 2016 9:50 PM
> *To:* user@hive.apache.org
> *Subject:* Re: Hive on Spark Engine versus Spark using Hive metastore
>
>
>
> yeah but have you ever seen somewhat write a real analytical program in
> hive? how? where are the basic abstractions to wrap up a large amount of
> operations (joins, groupby's) into a single function call? where are the
> tools to write nice unit test for that?
>
> for example in spark i can write a DataFrame => DataFrame that internally
> does many joins, groupBys and complex operations. all unit tested and
> perfectly re-usable. and in hive? copy paste round sql queries? thats just
> dangerous.
>
>
>
> On Tue, Feb 2, 2016 at 8:09 PM, Edward Capriolo 
> wrote:
>
> Hive has numerous extension points, you are not boxed in by a long shot.
>
>
>
> On Tuesday, February 2, 2016, Koert Kuipers  wrote:
>
> uuuhm with spark using Hive metastore you actually have a real
> programming environment and you can write real functions, versus just being
> boxed into some version of sql and limited udfs?
>
>
>
> On Tue, Feb 2, 2016 at 6:46 PM, Xuefu Zhang  wrote:
>
> When comparing the performance, you need to do it apple vs apple. In
> another thread, you mentioned that Hive on Spark is much slower than Spark
> SQL. However, you configured Hive such that only two tasks can run in
> parallel. However, you didn't provide information on how much Spark SQL is
> utilizing. Thus, it's hard to tell whether it's just a configuration
> problem in your Hive or Spark SQL is indeed faster. You should be able to
> see the resource usage in YARN resource manage URL.
>
> --Xuefu
>
>
>
> On Tue, Feb 2, 2016 at 3:31 PM, Mich Talebzadeh 
> wrote:
>
> Thanks Jeff.
>
>
>
> Obviously Hive is much more feature rich compared to Spark. Having said
> that in certain areas for example where the SQL feature is available in
> Spark, Spark seems to deliver faster.
>
>
>
> This may be:
>
>
>
> 1.Spark does both the optimisation and execution seamlessly
>
> 2.Hive on Spark has to invoke YARN that adds another layer to the
> process
>
>
>
> Now I did some simple tests on a 100Million rows ORC table available
> through Hive to both.
>
>
>
> *Spark 1.5.2 on Hive 1.2.1 Metastore*
>
>
>
>
>
> spark-sql> select * from dummy where id in (1, 5, 10);
>
> 1   0   0   63
> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1
> xx
>
> 5   0   4   31
> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5
> xx
>
> 10  99  999 188
> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10
> xx
>
> Time taken: 50.805 seconds, Fetched 3 row(s)
>
> spark-sql> select * from dummy where id in (1, 5, 10);
>
> 1   0 

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-03 Thread Koert Kuipers
1) spark bundles hive-metastore and hive-exec to get access to the
metastore and serdes. and i am pretty sure they would like to reduce this
if they could given the kitchensink of dependencies that hive is, but that
is not easy since hive was never written as re-usable java libraries. i
imagine that ideally spark would use hcatalog.
2) i dont know much about catalyst sauce... i do think scala lends itself
somewhat better than java to writing such a thing. tez is interesting to me
as well but again i would avoid hive, since there is more interesting stuff
to do on the world than ETL and data warehousing. scalding on tez would be
my choice.


On Wed, Feb 3, 2016 at 9:27 AM, Edward Capriolo 
wrote:

> Thank you for the speech. There is an infinite list of things hive does
> not do/cant to well.
> There is an infinite list of things spark does not do /cant do well.
>
> Some facts:
> 1) spark has a complete fork of hive inside it. So its hard to trash hive
> without at least noting the fact that its a portion of sparks guts.
> 2) there were lots of people touting benchmarks about spark sql beating
> hive, lots of fud about catalyst awesome sause. But then it seems like hive
> and tez made spark say uncle...
>
> https://www.slideshare.net/mobile/hortonworks/hive-on-spark-is-blazing-fast-or-is-it-final
>
>
> On Wednesday, February 3, 2016, Koert Kuipers  wrote:
>
>> ok i am sure there is some way to do it. i am going to guess snippets of
>> hive code stuck together with oozie jobs or whatever. the oozie jobs become
>> the re-usable pieces perhaps? now you got sql and xml, completely lacking
>> any benefits of a compiler to catch errors. unit tests will be slow if even
>> available at all. so yeah
>> yeah i am sure it can be made to *work*. just like you can get a nail
>> into a wall with a screwdriver if you really want.
>>
>> On Tue, Feb 2, 2016 at 11:49 PM, Koert Kuipers  wrote:
>>
>>> yeah but have you ever seen somewhat write a real analytical program in
>>> hive? how? where are the basic abstractions to wrap up a large amount of
>>> operations (joins, groupby's) into a single function call? where are the
>>> tools to write nice unit test for that?
>>>
>>> for example in spark i can write a DataFrame => DataFrame that
>>> internally does many joins, groupBys and complex operations. all unit
>>> tested and perfectly re-usable. and in hive? copy paste round sql queries?
>>> thats just dangerous.
>>>
>>> On Tue, Feb 2, 2016 at 8:09 PM, Edward Capriolo 
>>> wrote:
>>>
>>>> Hive has numerous extension points, you are not boxed in by a long shot.
>>>>
>>>>
>>>> On Tuesday, February 2, 2016, Koert Kuipers  wrote:
>>>>
>>>>> uuuhm with spark using Hive metastore you actually have a real
>>>>> programming environment and you can write real functions, versus just 
>>>>> being
>>>>> boxed into some version of sql and limited udfs?
>>>>>
>>>>> On Tue, Feb 2, 2016 at 6:46 PM, Xuefu Zhang 
>>>>> wrote:
>>>>>
>>>>>> When comparing the performance, you need to do it apple vs apple. In
>>>>>> another thread, you mentioned that Hive on Spark is much slower than 
>>>>>> Spark
>>>>>> SQL. However, you configured Hive such that only two tasks can run in
>>>>>> parallel. However, you didn't provide information on how much Spark SQL 
>>>>>> is
>>>>>> utilizing. Thus, it's hard to tell whether it's just a configuration
>>>>>> problem in your Hive or Spark SQL is indeed faster. You should be able to
>>>>>> see the resource usage in YARN resource manage URL.
>>>>>>
>>>>>> --Xuefu
>>>>>>
>>>>>> On Tue, Feb 2, 2016 at 3:31 PM, Mich Talebzadeh 
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks Jeff.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Obviously Hive is much more feature rich compared to Spark. Having
>>>>>>> said that in certain areas for example where the SQL feature is 
>>>>>>> available
>>>>>>> in Spark, Spark seems to deliver faster.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> This may be:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 1.Spark does both

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Koert Kuipers
ok i am sure there is some way to do it. i am going to guess snippets of
hive code stuck together with oozie jobs or whatever. the oozie jobs become
the re-usable pieces perhaps? now you got sql and xml, completely lacking
any benefits of a compiler to catch errors. unit tests will be slow if even
available at all. so yeah
yeah i am sure it can be made to *work*. just like you can get a nail into
a wall with a screwdriver if you really want.

On Tue, Feb 2, 2016 at 11:49 PM, Koert Kuipers  wrote:

> yeah but have you ever seen somewhat write a real analytical program in
> hive? how? where are the basic abstractions to wrap up a large amount of
> operations (joins, groupby's) into a single function call? where are the
> tools to write nice unit test for that?
>
> for example in spark i can write a DataFrame => DataFrame that internally
> does many joins, groupBys and complex operations. all unit tested and
> perfectly re-usable. and in hive? copy paste round sql queries? thats just
> dangerous.
>
> On Tue, Feb 2, 2016 at 8:09 PM, Edward Capriolo 
> wrote:
>
>> Hive has numerous extension points, you are not boxed in by a long shot.
>>
>>
>> On Tuesday, February 2, 2016, Koert Kuipers  wrote:
>>
>>> uuuhm with spark using Hive metastore you actually have a real
>>> programming environment and you can write real functions, versus just being
>>> boxed into some version of sql and limited udfs?
>>>
>>> On Tue, Feb 2, 2016 at 6:46 PM, Xuefu Zhang  wrote:
>>>
>>>> When comparing the performance, you need to do it apple vs apple. In
>>>> another thread, you mentioned that Hive on Spark is much slower than Spark
>>>> SQL. However, you configured Hive such that only two tasks can run in
>>>> parallel. However, you didn't provide information on how much Spark SQL is
>>>> utilizing. Thus, it's hard to tell whether it's just a configuration
>>>> problem in your Hive or Spark SQL is indeed faster. You should be able to
>>>> see the resource usage in YARN resource manage URL.
>>>>
>>>> --Xuefu
>>>>
>>>> On Tue, Feb 2, 2016 at 3:31 PM, Mich Talebzadeh 
>>>> wrote:
>>>>
>>>>> Thanks Jeff.
>>>>>
>>>>>
>>>>>
>>>>> Obviously Hive is much more feature rich compared to Spark. Having
>>>>> said that in certain areas for example where the SQL feature is available
>>>>> in Spark, Spark seems to deliver faster.
>>>>>
>>>>>
>>>>>
>>>>> This may be:
>>>>>
>>>>>
>>>>>
>>>>> 1.Spark does both the optimisation and execution seamlessly
>>>>>
>>>>> 2.Hive on Spark has to invoke YARN that adds another layer to the
>>>>> process
>>>>>
>>>>>
>>>>>
>>>>> Now I did some simple tests on a 100Million rows ORC table available
>>>>> through Hive to both.
>>>>>
>>>>>
>>>>>
>>>>> *Spark 1.5.2 on Hive 1.2.1 Metastore*
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> spark-sql> select * from dummy where id in (1, 5, 10);
>>>>>
>>>>> 1   0   0   63
>>>>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1
>>>>> xx
>>>>>
>>>>> 5   0   4   31
>>>>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5
>>>>> xx
>>>>>
>>>>> 10  99  999 188
>>>>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10
>>>>> xx
>>>>>
>>>>> Time taken: 50.805 seconds, Fetched 3 row(s)
>>>>>
>>>>> spark-sql> select * from dummy where id in (1, 5, 10);
>>>>>
>>>>> 1   0   0   63
>>>>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1
>>>>> xx
>>>>>
>>>>> 5   0   4   31
>>>>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5
>>>>> xx
>>>>>
>>>

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Koert Kuipers
yeah but have you ever seen somewhat write a real analytical program in
hive? how? where are the basic abstractions to wrap up a large amount of
operations (joins, groupby's) into a single function call? where are the
tools to write nice unit test for that?

for example in spark i can write a DataFrame => DataFrame that internally
does many joins, groupBys and complex operations. all unit tested and
perfectly re-usable. and in hive? copy paste round sql queries? thats just
dangerous.

On Tue, Feb 2, 2016 at 8:09 PM, Edward Capriolo 
wrote:

> Hive has numerous extension points, you are not boxed in by a long shot.
>
>
> On Tuesday, February 2, 2016, Koert Kuipers  wrote:
>
>> uuuhm with spark using Hive metastore you actually have a real
>> programming environment and you can write real functions, versus just being
>> boxed into some version of sql and limited udfs?
>>
>> On Tue, Feb 2, 2016 at 6:46 PM, Xuefu Zhang  wrote:
>>
>>> When comparing the performance, you need to do it apple vs apple. In
>>> another thread, you mentioned that Hive on Spark is much slower than Spark
>>> SQL. However, you configured Hive such that only two tasks can run in
>>> parallel. However, you didn't provide information on how much Spark SQL is
>>> utilizing. Thus, it's hard to tell whether it's just a configuration
>>> problem in your Hive or Spark SQL is indeed faster. You should be able to
>>> see the resource usage in YARN resource manage URL.
>>>
>>> --Xuefu
>>>
>>> On Tue, Feb 2, 2016 at 3:31 PM, Mich Talebzadeh 
>>> wrote:
>>>
>>>> Thanks Jeff.
>>>>
>>>>
>>>>
>>>> Obviously Hive is much more feature rich compared to Spark. Having said
>>>> that in certain areas for example where the SQL feature is available in
>>>> Spark, Spark seems to deliver faster.
>>>>
>>>>
>>>>
>>>> This may be:
>>>>
>>>>
>>>>
>>>> 1.Spark does both the optimisation and execution seamlessly
>>>>
>>>> 2.Hive on Spark has to invoke YARN that adds another layer to the
>>>> process
>>>>
>>>>
>>>>
>>>> Now I did some simple tests on a 100Million rows ORC table available
>>>> through Hive to both.
>>>>
>>>>
>>>>
>>>> *Spark 1.5.2 on Hive 1.2.1 Metastore*
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> spark-sql> select * from dummy where id in (1, 5, 10);
>>>>
>>>> 1   0   0   63
>>>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1
>>>> xx
>>>>
>>>> 5   0   4   31
>>>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5
>>>> xx
>>>>
>>>> 10  99  999 188
>>>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10
>>>> xx
>>>>
>>>> Time taken: 50.805 seconds, Fetched 3 row(s)
>>>>
>>>> spark-sql> select * from dummy where id in (1, 5, 10);
>>>>
>>>> 1   0   0   63
>>>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1
>>>> xx
>>>>
>>>> 5   0   4   31
>>>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5
>>>> xx
>>>>
>>>> 10  99  999 188
>>>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10
>>>> xx
>>>>
>>>> Time taken: 50.358 seconds, Fetched 3 row(s)
>>>>
>>>> spark-sql> select * from dummy where id in (1, 5, 10);
>>>>
>>>> 1   0   0   63
>>>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1
>>>> xx
>>>>
>>>> 5   0   4   31
>>>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5
>>>> xx
>>>>
>>>> 10  99  999 188
>>>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10
>>>> xx
>>>>
>>>> Time taken: 50.563 seconds, Fetched 3 row(s)
>>>>
>>>>
>>>>
>>>> So three runs returning three rows just over 50 seconds
>>>>
>>>>
>>>>
&

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Koert Kuipers
yes. the ability to start with sql but when needed expand into more full
blown programming languages, machine learning etc. is a huge plus. after
all this is a cluster, and just querying or extracting data to move it off
the cluster into some other analytics tool is going to be very inefficient
and defeats the purpose to some extend of having a cluster. so you want to
have a capability to do more than queries and etl. and spark is that
ticket. hive is simply not. well not for anything somewhat complex anyhow.


On Tue, Feb 2, 2016 at 8:06 PM, Mich Talebzadeh  wrote:

> Hi,
>
>
>
> Are you referring to spark-shell with Scala, Python and others?
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
> employees accept any responsibility.
>
>
>
> *From:* Koert Kuipers [mailto:ko...@tresata.com]
> *Sent:* 03 February 2016 00:09
>
> *To:* user@hive.apache.org
> *Subject:* Re: Hive on Spark Engine versus Spark using Hive metastore
>
>
>
> uuuhm with spark using Hive metastore you actually have a real
> programming environment and you can write real functions, versus just being
> boxed into some version of sql and limited udfs?
>
>
>
> On Tue, Feb 2, 2016 at 6:46 PM, Xuefu Zhang  wrote:
>
> When comparing the performance, you need to do it apple vs apple. In
> another thread, you mentioned that Hive on Spark is much slower than Spark
> SQL. However, you configured Hive such that only two tasks can run in
> parallel. However, you didn't provide information on how much Spark SQL is
> utilizing. Thus, it's hard to tell whether it's just a configuration
> problem in your Hive or Spark SQL is indeed faster. You should be able to
> see the resource usage in YARN resource manage URL.
>
> --Xuefu
>
>
>
> On Tue, Feb 2, 2016 at 3:31 PM, Mich Talebzadeh 
> wrote:
>
> Thanks Jeff.
>
>
>
> Obviously Hive is much more feature rich compared to Spark. Having said
> that in certain areas for example where the SQL feature is available in
> Spark, Spark seems to deliver faster.
>
>
>
> This may be:
>
>
>
> 1.Spark does both the optimisation and execution seamlessly
>
> 2.Hive on Spark has to invoke YARN that adds another layer to the
> process
>
>
>
> Now I did some simple tests on a 100Million rows ORC table available
> through Hive to both.
>
>
>
> *Spark 1.5.2 on Hive 1.2.1 Metastore*
>
>
>
>
>
> spark-sql> select * from dummy where id in (1, 5, 10);
>
> 1   0   0   63
> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1
> xx
>
> 5   0   4   31
> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5
> xx
>
> 10  99  999 188
> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10
> xx
>
> Time taken: 50.805 seconds, Fetched 3 row(s)
>
> spark-sql> select * from dummy where id in (1, 5, 10);
>
> 1   0   0   63
> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1
> xx
>
> 5   0   4   31
> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5
> xx
>
> 10  99  999 188
> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrK

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Koert Kuipers
uuuhm with spark using Hive metastore you actually have a real programming
environment and you can write real functions, versus just being boxed into
some version of sql and limited udfs?

On Tue, Feb 2, 2016 at 6:46 PM, Xuefu Zhang  wrote:

> When comparing the performance, you need to do it apple vs apple. In
> another thread, you mentioned that Hive on Spark is much slower than Spark
> SQL. However, you configured Hive such that only two tasks can run in
> parallel. However, you didn't provide information on how much Spark SQL is
> utilizing. Thus, it's hard to tell whether it's just a configuration
> problem in your Hive or Spark SQL is indeed faster. You should be able to
> see the resource usage in YARN resource manage URL.
>
> --Xuefu
>
> On Tue, Feb 2, 2016 at 3:31 PM, Mich Talebzadeh 
> wrote:
>
>> Thanks Jeff.
>>
>>
>>
>> Obviously Hive is much more feature rich compared to Spark. Having said
>> that in certain areas for example where the SQL feature is available in
>> Spark, Spark seems to deliver faster.
>>
>>
>>
>> This may be:
>>
>>
>>
>> 1.Spark does both the optimisation and execution seamlessly
>>
>> 2.Hive on Spark has to invoke YARN that adds another layer to the
>> process
>>
>>
>>
>> Now I did some simple tests on a 100Million rows ORC table available
>> through Hive to both.
>>
>>
>>
>> *Spark 1.5.2 on Hive 1.2.1 Metastore*
>>
>>
>>
>>
>>
>> spark-sql> select * from dummy where id in (1, 5, 10);
>>
>> 1   0   0   63
>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1
>> xx
>>
>> 5   0   4   31
>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5
>> xx
>>
>> 10  99  999 188
>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10
>> xx
>>
>> Time taken: 50.805 seconds, Fetched 3 row(s)
>>
>> spark-sql> select * from dummy where id in (1, 5, 10);
>>
>> 1   0   0   63
>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1
>> xx
>>
>> 5   0   4   31
>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5
>> xx
>>
>> 10  99  999 188
>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10
>> xx
>>
>> Time taken: 50.358 seconds, Fetched 3 row(s)
>>
>> spark-sql> select * from dummy where id in (1, 5, 10);
>>
>> 1   0   0   63
>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1
>> xx
>>
>> 5   0   4   31
>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5
>> xx
>>
>> 10  99  999 188
>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10
>> xx
>>
>> Time taken: 50.563 seconds, Fetched 3 row(s)
>>
>>
>>
>> So three runs returning three rows just over 50 seconds
>>
>>
>>
>> *Hive 1.2.1 on spark 1.3.1 execution engine*
>>
>>
>>
>> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in
>> (1, 5, 10);
>>
>> INFO  :
>>
>> Query Hive on Spark job[4] stages:
>>
>> INFO  : 4
>>
>> INFO  :
>>
>> Status: Running (Hive on Spark job[4])
>>
>> INFO  : Status: Finished successfully in 82.49 seconds
>>
>>
>> +---+--+--+---+-+-++--+
>>
>> | dummy.id  | dummy.clustered  | dummy.scattered  | dummy.randomised
>> | dummy.random_string | dummy.small_vc  |
>> dummy.padding  |
>>
>>
>> +---+--+--+---+-+-++--+
>>
>> | 1 | 0| 0| 63|
>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |  1  |
>> xx |
>>
>> | 5 | 0| 4| 31|
>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  |  5  |
>> xx |
>>
>> | 10| 99   | 999  | 188   |
>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  | 10  |
>> xx |
>>
>>
>> +---+--+--+---+-+-++--+
>>
>> 3 rows selected (82.66 seconds)
>>
>> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in
>> (1, 5, 10);
>>
>> INFO  : Status: Finished successfully in 76.67 seconds
>>
>>
>> +---+--+--+---+-+-++--+
>>
>> | dummy.id  | dummy.clustered  | dummy.scattered  | dummy.randomised
>> | dummy.random_string | dummy.small_vc  |
>> dummy.padding  |
>>
>>
>> +---+-

Re: Which [open-souce] SQL engine atop Hadoop?

2015-02-01 Thread Koert Kuipers
i would not exclude spark sql unless you really need something mutable in
which case lingual wont work either

On Sat, Jan 31, 2015 at 8:56 PM, Samuel Marks  wrote:

> Interesting discussion. It looks like the HBase metastore can also be
> configured to use HDFS HA (ex. tutorial
> <http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cdh_hag_hdfs_ha_cdh_components_config.html>
> ).
>
> To get back on topic though, the primary contenders now are: Phoenix,
> Lingual and perhaps Tajo or Drill?
>
> Best,
>
> Samuel Marks
> http://linkedin.com/in/samuelmarks
>
> On Sun, Feb 1, 2015 at 9:38 AM, Edward Capriolo 
> wrote:
>
>> "is the metastore thrift definition stable across hive versions?" I would
>> say yes. Like many API's the core eventually solidifies. No one is saying
>> it will never every change, but basically there are things like "database"
>> and "table" and they have properties like "name". I have some basic scripts
>> that look for table names matching patterns or summarize disk usage by
>> owner. I have not had to touch them very much. Usually if they do change it
>> is something small and if you tie the commit to a jira you can figure out
>> what and why.
>>
>> On Sat, Jan 31, 2015 at 3:02 PM, Koert Kuipers  wrote:
>>
>>> seems the metastore thrift service support SASL. thats great. so if i
>>> understand it correctly all i need is the metastore thrift definition to
>>> query the metastore.
>>> is the metastore thrift definition stable across hive versions? if so,
>>> then i can build my app once without worrying about the hive version
>>> deployed. in that case i admit its not as bad as i thought. lets see!
>>>
>>> On Sat, Jan 31, 2015 at 2:41 PM, Koert Kuipers 
>>> wrote:
>>>
>>>> oh sorry edward, i misread you post. seems we agree that "SQL
>>>> constructs inside hive" are not for other systems.
>>>>
>>>> On Sat, Jan 31, 2015 at 2:38 PM, Koert Kuipers 
>>>> wrote:
>>>>
>>>>> edward,
>>>>> i would not call "SQL constructs inside hive" accessible for other
>>>>> systems. its inside hive after all
>>>>>
>>>>> it is true that i can contact the metastore in java using
>>>>> HiveMetaStoreClient, but then i need to bring in a whole slew of
>>>>> dependencies (the miniumum seems to be hive-metastore, hive-common,
>>>>> hive-shims, libfb303, libthrift and a few hadoop dependencies, by trial 
>>>>> and
>>>>> error). these jars need to be "provided" and added to the classpath on the
>>>>> cluster, unless someone is willing to build versions of an application for
>>>>> every hive version out there. and even when you do all this you can only
>>>>> pray its going to be compatible with the next hive version, since 
>>>>> backwards
>>>>> compatibility is... well lets just say lacking. the attitude seems to be
>>>>> that hive does not have a java api, so there is nothing that needs to be
>>>>> stable.
>>>>>
>>>>> you are right i could go the pure thrift road. i havent tried that
>>>>> yet. that might just be the best option. but how easy is it to do this 
>>>>> with
>>>>> a secure hadoop/hive ecosystem? now i need to handle kerberos myself and
>>>>> somehow pass tokens into thrift i assume?
>>>>>
>>>>> contrast all of this with an avro file on hadoop with metadata baked
>>>>> in, and i think its safe to say hive metadata is not easily accessible.
>>>>>
>>>>> i will take a look at your book. i hope it has an example of using
>>>>> thrift on a secure cluster to contact hive metastore (without using the
>>>>> HiveMetaStoreClient), that would be awesome.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Sat, Jan 31, 2015 at 1:32 PM, Edward Capriolo <
>>>>> edlinuxg...@gmail.com> wrote:
>>>>>
>>>>>> "with the metadata in a special metadata store (not on hdfs), and its
>>>>>> not as easy for all systems to access hive metadata." I disagree.
>>>>>>
>>>>>> Hives metadata is not only accessible through the SQL constructs like
>>>>>> "describe table". But the entire meta-store also is actually a thrif

Re: Which [open-souce] SQL engine atop Hadoop?

2015-01-31 Thread Koert Kuipers
seems the metastore thrift service support SASL. thats great. so if i
understand it correctly all i need is the metastore thrift definition to
query the metastore.
is the metastore thrift definition stable across hive versions? if so, then
i can build my app once without worrying about the hive version deployed.
in that case i admit its not as bad as i thought. lets see!

On Sat, Jan 31, 2015 at 2:41 PM, Koert Kuipers  wrote:

> oh sorry edward, i misread you post. seems we agree that "SQL constructs
> inside hive" are not for other systems.
>
> On Sat, Jan 31, 2015 at 2:38 PM, Koert Kuipers  wrote:
>
>> edward,
>> i would not call "SQL constructs inside hive" accessible for other
>> systems. its inside hive after all
>>
>> it is true that i can contact the metastore in java using
>> HiveMetaStoreClient, but then i need to bring in a whole slew of
>> dependencies (the miniumum seems to be hive-metastore, hive-common,
>> hive-shims, libfb303, libthrift and a few hadoop dependencies, by trial and
>> error). these jars need to be "provided" and added to the classpath on the
>> cluster, unless someone is willing to build versions of an application for
>> every hive version out there. and even when you do all this you can only
>> pray its going to be compatible with the next hive version, since backwards
>> compatibility is... well lets just say lacking. the attitude seems to be
>> that hive does not have a java api, so there is nothing that needs to be
>> stable.
>>
>> you are right i could go the pure thrift road. i havent tried that yet.
>> that might just be the best option. but how easy is it to do this with a
>> secure hadoop/hive ecosystem? now i need to handle kerberos myself and
>> somehow pass tokens into thrift i assume?
>>
>> contrast all of this with an avro file on hadoop with metadata baked in,
>> and i think its safe to say hive metadata is not easily accessible.
>>
>> i will take a look at your book. i hope it has an example of using thrift
>> on a secure cluster to contact hive metastore (without using the
>> HiveMetaStoreClient), that would be awesome.
>>
>>
>>
>>
>> On Sat, Jan 31, 2015 at 1:32 PM, Edward Capriolo 
>> wrote:
>>
>>> "with the metadata in a special metadata store (not on hdfs), and its
>>> not as easy for all systems to access hive metadata." I disagree.
>>>
>>> Hives metadata is not only accessible through the SQL constructs like
>>> "describe table". But the entire meta-store also is actually a thrift
>>> service so you have programmatic access to determine things like what
>>> columns are in a table etc. Thrift creates RPC clients for almost every
>>> major language.
>>>
>>> In the programming hive book
>>> http://www.amazon.com/dp/1449319335/?tag=mh0b-20&hvadid=3521269638&ref=pd_sl_4yiryvbf8k_e
>>> there is even examples where I show how to iterate all the tables inside
>>> the database from a java client.
>>>
>>> On Sat, Jan 31, 2015 at 11:05 AM, Koert Kuipers 
>>> wrote:
>>>
>>>> yes you can run whatever you like with the data in hdfs. keep in mind
>>>> that hive makes this general access pattern just a little harder, since
>>>> hive has a tendency to store data and metadata separately, with the
>>>> metadata in a special metadata store (not on hdfs), and its not as easy for
>>>> all systems to access hive metadata.
>>>>
>>>> i am not familiar at all with tajo or drill.
>>>>
>>>> On Fri, Jan 30, 2015 at 8:27 PM, Samuel Marks 
>>>> wrote:
>>>>
>>>>> Thanks for the advice
>>>>>
>>>>> Koert: when everything is in the same essential data-store (HDFS),
>>>>> can't I just run whatever complex tools I'm whichever paradigm they like?
>>>>>
>>>>> E.g.: GraphX, Mahout &etc.
>>>>>
>>>>> Also, what about Tajo or Drill?
>>>>>
>>>>> Best,
>>>>>
>>>>> Samuel Marks
>>>>> http://linkedin.com/in/samuelmarks
>>>>>
>>>>> PS: Spark-SQL is read-only IIRC, right?
>>>>> On 31 Jan 2015 03:39, "Koert Kuipers"  wrote:
>>>>>
>>>>>> since you require high-powered analytics, and i assume you want to
>>>>>> stay sane while doing so, you require the ability to "drop out of sql" 
>>>>>> when
>>>>>&

Re: Which [open-souce] SQL engine atop Hadoop?

2015-01-31 Thread Koert Kuipers
oh sorry edward, i misread you post. seems we agree that "SQL constructs
inside hive" are not for other systems.

On Sat, Jan 31, 2015 at 2:38 PM, Koert Kuipers  wrote:

> edward,
> i would not call "SQL constructs inside hive" accessible for other
> systems. its inside hive after all
>
> it is true that i can contact the metastore in java using
> HiveMetaStoreClient, but then i need to bring in a whole slew of
> dependencies (the miniumum seems to be hive-metastore, hive-common,
> hive-shims, libfb303, libthrift and a few hadoop dependencies, by trial and
> error). these jars need to be "provided" and added to the classpath on the
> cluster, unless someone is willing to build versions of an application for
> every hive version out there. and even when you do all this you can only
> pray its going to be compatible with the next hive version, since backwards
> compatibility is... well lets just say lacking. the attitude seems to be
> that hive does not have a java api, so there is nothing that needs to be
> stable.
>
> you are right i could go the pure thrift road. i havent tried that yet.
> that might just be the best option. but how easy is it to do this with a
> secure hadoop/hive ecosystem? now i need to handle kerberos myself and
> somehow pass tokens into thrift i assume?
>
> contrast all of this with an avro file on hadoop with metadata baked in,
> and i think its safe to say hive metadata is not easily accessible.
>
> i will take a look at your book. i hope it has an example of using thrift
> on a secure cluster to contact hive metastore (without using the
> HiveMetaStoreClient), that would be awesome.
>
>
>
>
> On Sat, Jan 31, 2015 at 1:32 PM, Edward Capriolo 
> wrote:
>
>> "with the metadata in a special metadata store (not on hdfs), and its not
>> as easy for all systems to access hive metadata." I disagree.
>>
>> Hives metadata is not only accessible through the SQL constructs like
>> "describe table". But the entire meta-store also is actually a thrift
>> service so you have programmatic access to determine things like what
>> columns are in a table etc. Thrift creates RPC clients for almost every
>> major language.
>>
>> In the programming hive book
>> http://www.amazon.com/dp/1449319335/?tag=mh0b-20&hvadid=3521269638&ref=pd_sl_4yiryvbf8k_e
>> there is even examples where I show how to iterate all the tables inside
>> the database from a java client.
>>
>> On Sat, Jan 31, 2015 at 11:05 AM, Koert Kuipers 
>> wrote:
>>
>>> yes you can run whatever you like with the data in hdfs. keep in mind
>>> that hive makes this general access pattern just a little harder, since
>>> hive has a tendency to store data and metadata separately, with the
>>> metadata in a special metadata store (not on hdfs), and its not as easy for
>>> all systems to access hive metadata.
>>>
>>> i am not familiar at all with tajo or drill.
>>>
>>> On Fri, Jan 30, 2015 at 8:27 PM, Samuel Marks 
>>> wrote:
>>>
>>>> Thanks for the advice
>>>>
>>>> Koert: when everything is in the same essential data-store (HDFS),
>>>> can't I just run whatever complex tools I'm whichever paradigm they like?
>>>>
>>>> E.g.: GraphX, Mahout &etc.
>>>>
>>>> Also, what about Tajo or Drill?
>>>>
>>>> Best,
>>>>
>>>> Samuel Marks
>>>> http://linkedin.com/in/samuelmarks
>>>>
>>>> PS: Spark-SQL is read-only IIRC, right?
>>>> On 31 Jan 2015 03:39, "Koert Kuipers"  wrote:
>>>>
>>>>> since you require high-powered analytics, and i assume you want to
>>>>> stay sane while doing so, you require the ability to "drop out of sql" 
>>>>> when
>>>>> needed. so spark-sql and lingual would be my choices.
>>>>>
>>>>> low latency indicates phoenix or spark-sql to me.
>>>>>
>>>>> so i would say spark-sql
>>>>>
>>>>> On Fri, Jan 30, 2015 at 7:56 AM, Samuel Marks 
>>>>> wrote:
>>>>>
>>>>>> HAWQ is pretty nifty due to its full SQL compliance (ANSI 92) and
>>>>>> exposing both JDBC and ODBC interfaces. However, although Pivotal does 
>>>>>> open-source
>>>>>> a lot of software <http://www.pivotal.io/oss>, I don't believe they
>>>>>> open source Pivotal HD: HAWQ.
>>>>>>
>>>>>&

Re: Which [open-souce] SQL engine atop Hadoop?

2015-01-31 Thread Koert Kuipers
edward,
i would not call "SQL constructs inside hive" accessible for other systems.
its inside hive after all

it is true that i can contact the metastore in java using
HiveMetaStoreClient, but then i need to bring in a whole slew of
dependencies (the miniumum seems to be hive-metastore, hive-common,
hive-shims, libfb303, libthrift and a few hadoop dependencies, by trial and
error). these jars need to be "provided" and added to the classpath on the
cluster, unless someone is willing to build versions of an application for
every hive version out there. and even when you do all this you can only
pray its going to be compatible with the next hive version, since backwards
compatibility is... well lets just say lacking. the attitude seems to be
that hive does not have a java api, so there is nothing that needs to be
stable.

you are right i could go the pure thrift road. i havent tried that yet.
that might just be the best option. but how easy is it to do this with a
secure hadoop/hive ecosystem? now i need to handle kerberos myself and
somehow pass tokens into thrift i assume?

contrast all of this with an avro file on hadoop with metadata baked in,
and i think its safe to say hive metadata is not easily accessible.

i will take a look at your book. i hope it has an example of using thrift
on a secure cluster to contact hive metastore (without using the
HiveMetaStoreClient), that would be awesome.




On Sat, Jan 31, 2015 at 1:32 PM, Edward Capriolo 
wrote:

> "with the metadata in a special metadata store (not on hdfs), and its not
> as easy for all systems to access hive metadata." I disagree.
>
> Hives metadata is not only accessible through the SQL constructs like
> "describe table". But the entire meta-store also is actually a thrift
> service so you have programmatic access to determine things like what
> columns are in a table etc. Thrift creates RPC clients for almost every
> major language.
>
> In the programming hive book
> http://www.amazon.com/dp/1449319335/?tag=mh0b-20&hvadid=3521269638&ref=pd_sl_4yiryvbf8k_e
> there is even examples where I show how to iterate all the tables inside
> the database from a java client.
>
> On Sat, Jan 31, 2015 at 11:05 AM, Koert Kuipers  wrote:
>
>> yes you can run whatever you like with the data in hdfs. keep in mind
>> that hive makes this general access pattern just a little harder, since
>> hive has a tendency to store data and metadata separately, with the
>> metadata in a special metadata store (not on hdfs), and its not as easy for
>> all systems to access hive metadata.
>>
>> i am not familiar at all with tajo or drill.
>>
>> On Fri, Jan 30, 2015 at 8:27 PM, Samuel Marks 
>> wrote:
>>
>>> Thanks for the advice
>>>
>>> Koert: when everything is in the same essential data-store (HDFS), can't
>>> I just run whatever complex tools I'm whichever paradigm they like?
>>>
>>> E.g.: GraphX, Mahout &etc.
>>>
>>> Also, what about Tajo or Drill?
>>>
>>> Best,
>>>
>>> Samuel Marks
>>> http://linkedin.com/in/samuelmarks
>>>
>>> PS: Spark-SQL is read-only IIRC, right?
>>> On 31 Jan 2015 03:39, "Koert Kuipers"  wrote:
>>>
>>>> since you require high-powered analytics, and i assume you want to stay
>>>> sane while doing so, you require the ability to "drop out of sql" when
>>>> needed. so spark-sql and lingual would be my choices.
>>>>
>>>> low latency indicates phoenix or spark-sql to me.
>>>>
>>>> so i would say spark-sql
>>>>
>>>> On Fri, Jan 30, 2015 at 7:56 AM, Samuel Marks 
>>>> wrote:
>>>>
>>>>> HAWQ is pretty nifty due to its full SQL compliance (ANSI 92) and
>>>>> exposing both JDBC and ODBC interfaces. However, although Pivotal does 
>>>>> open-source
>>>>> a lot of software <http://www.pivotal.io/oss>, I don't believe they
>>>>> open source Pivotal HD: HAWQ.
>>>>>
>>>>> So that doesn't meet my requirements. I should note that the project I
>>>>> am building will also be open-source, which heightens the importance of
>>>>> having all components also being open-source.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Samuel Marks
>>>>> http://linkedin.com/in/samuelmarks
>>>>>
>>>>> On Fri, Jan 30, 2015 at 11:35 PM, Siddharth Tiwari <
>>>>> siddharth.tiw...@live.com> wrote:
>>>>>
>>>>>> Have you looked at

Re: Which [open-souce] SQL engine atop Hadoop?

2015-01-31 Thread Koert Kuipers
Spark-SQL is read-only yes, in the sense that it does not support mutation
but only transformation to a new dataset that you store separately.

i am not aware of many systems that support mutation. systems that support
mutation will not use HDFS as the datastore. so something like Phoenix
(backed by HBase) will be needed for that.

On Sat, Jan 31, 2015 at 11:05 AM, Koert Kuipers  wrote:

> yes you can run whatever you like with the data in hdfs. keep in mind that
> hive makes this general access pattern just a little harder, since hive has
> a tendency to store data and metadata separately, with the metadata in a
> special metadata store (not on hdfs), and its not as easy for all systems
> to access hive metadata.
>
> i am not familiar at all with tajo or drill.
>
> On Fri, Jan 30, 2015 at 8:27 PM, Samuel Marks 
> wrote:
>
>> Thanks for the advice
>>
>> Koert: when everything is in the same essential data-store (HDFS), can't
>> I just run whatever complex tools I'm whichever paradigm they like?
>>
>> E.g.: GraphX, Mahout &etc.
>>
>> Also, what about Tajo or Drill?
>>
>> Best,
>>
>> Samuel Marks
>> http://linkedin.com/in/samuelmarks
>>
>> PS: Spark-SQL is read-only IIRC, right?
>> On 31 Jan 2015 03:39, "Koert Kuipers"  wrote:
>>
>>> since you require high-powered analytics, and i assume you want to stay
>>> sane while doing so, you require the ability to "drop out of sql" when
>>> needed. so spark-sql and lingual would be my choices.
>>>
>>> low latency indicates phoenix or spark-sql to me.
>>>
>>> so i would say spark-sql
>>>
>>> On Fri, Jan 30, 2015 at 7:56 AM, Samuel Marks 
>>> wrote:
>>>
>>>> HAWQ is pretty nifty due to its full SQL compliance (ANSI 92) and
>>>> exposing both JDBC and ODBC interfaces. However, although Pivotal does 
>>>> open-source
>>>> a lot of software <http://www.pivotal.io/oss>, I don't believe they
>>>> open source Pivotal HD: HAWQ.
>>>>
>>>> So that doesn't meet my requirements. I should note that the project I
>>>> am building will also be open-source, which heightens the importance of
>>>> having all components also being open-source.
>>>>
>>>> Cheers,
>>>>
>>>> Samuel Marks
>>>> http://linkedin.com/in/samuelmarks
>>>>
>>>> On Fri, Jan 30, 2015 at 11:35 PM, Siddharth Tiwari <
>>>> siddharth.tiw...@live.com> wrote:
>>>>
>>>>> Have you looked at HAWQ from Pivotal ?
>>>>>
>>>>> Sent from my iPhone
>>>>>
>>>>> On Jan 30, 2015, at 4:27 AM, Samuel Marks 
>>>>> wrote:
>>>>>
>>>>> Since Hadoop <https://hive.apache.org> came out, there have been
>>>>> various commercial and/or open-source attempts to expose some 
>>>>> compatibility
>>>>> with SQL <http://drill.apache.org>. Obviously by posting here I am
>>>>> not expecting an unbiased answer.
>>>>>
>>>>> Seeking an SQL-on-Hadoop offering which provides: low-latency
>>>>> querying, and supports the most common CRUD <https://spark.apache.org>,
>>>>> including [the basics!] along these lines: CREATE TABLE, INSERT INTO, 
>>>>> SELECT
>>>>> * FROM, UPDATE Table SET C1=2 WHERE, DELETE FROM, and DROP TABLE.
>>>>> Transactional support would be nice also, but is not a must-have.
>>>>>
>>>>> Essentially I want a full replacement for the more traditional RDBMS,
>>>>> one which can scale from 1 node to a serious Hadoop cluster.
>>>>>
>>>>> Python is my language of choice for interfacing, however there does
>>>>> seem to be a Python JDBC wrapper <https://spark.apache.org/sql>.
>>>>>
>>>>> Here is what I've found thus far:
>>>>>
>>>>>- Apache Hive <https://hive.apache.org> (SQL-like, with
>>>>>interactive SQL thanks to the Stinger initiative)
>>>>>- Apache Drill <http://drill.apache.org> (ANSI SQL support)
>>>>>- Apache Spark <https://spark.apache.org> (Spark SQL
>>>>><https://spark.apache.org/sql>, queries only, add data via Hive,
>>>>>RDD
>>>>>
>>>>> <https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SchemaRD

Re: Which [open-souce] SQL engine atop Hadoop?

2015-01-31 Thread Koert Kuipers
yes you can run whatever you like with the data in hdfs. keep in mind that
hive makes this general access pattern just a little harder, since hive has
a tendency to store data and metadata separately, with the metadata in a
special metadata store (not on hdfs), and its not as easy for all systems
to access hive metadata.

i am not familiar at all with tajo or drill.

On Fri, Jan 30, 2015 at 8:27 PM, Samuel Marks  wrote:

> Thanks for the advice
>
> Koert: when everything is in the same essential data-store (HDFS), can't I
> just run whatever complex tools I'm whichever paradigm they like?
>
> E.g.: GraphX, Mahout &etc.
>
> Also, what about Tajo or Drill?
>
> Best,
>
> Samuel Marks
> http://linkedin.com/in/samuelmarks
>
> PS: Spark-SQL is read-only IIRC, right?
> On 31 Jan 2015 03:39, "Koert Kuipers"  wrote:
>
>> since you require high-powered analytics, and i assume you want to stay
>> sane while doing so, you require the ability to "drop out of sql" when
>> needed. so spark-sql and lingual would be my choices.
>>
>> low latency indicates phoenix or spark-sql to me.
>>
>> so i would say spark-sql
>>
>> On Fri, Jan 30, 2015 at 7:56 AM, Samuel Marks 
>> wrote:
>>
>>> HAWQ is pretty nifty due to its full SQL compliance (ANSI 92) and
>>> exposing both JDBC and ODBC interfaces. However, although Pivotal does 
>>> open-source
>>> a lot of software <http://www.pivotal.io/oss>, I don't believe they
>>> open source Pivotal HD: HAWQ.
>>>
>>> So that doesn't meet my requirements. I should note that the project I
>>> am building will also be open-source, which heightens the importance of
>>> having all components also being open-source.
>>>
>>> Cheers,
>>>
>>> Samuel Marks
>>> http://linkedin.com/in/samuelmarks
>>>
>>> On Fri, Jan 30, 2015 at 11:35 PM, Siddharth Tiwari <
>>> siddharth.tiw...@live.com> wrote:
>>>
>>>> Have you looked at HAWQ from Pivotal ?
>>>>
>>>> Sent from my iPhone
>>>>
>>>> On Jan 30, 2015, at 4:27 AM, Samuel Marks 
>>>> wrote:
>>>>
>>>> Since Hadoop <https://hive.apache.org> came out, there have been
>>>> various commercial and/or open-source attempts to expose some compatibility
>>>> with SQL <http://drill.apache.org>. Obviously by posting here I am not
>>>> expecting an unbiased answer.
>>>>
>>>> Seeking an SQL-on-Hadoop offering which provides: low-latency querying,
>>>> and supports the most common CRUD <https://spark.apache.org>,
>>>> including [the basics!] along these lines: CREATE TABLE, INSERT INTO, 
>>>> SELECT
>>>> * FROM, UPDATE Table SET C1=2 WHERE, DELETE FROM, and DROP TABLE.
>>>> Transactional support would be nice also, but is not a must-have.
>>>>
>>>> Essentially I want a full replacement for the more traditional RDBMS,
>>>> one which can scale from 1 node to a serious Hadoop cluster.
>>>>
>>>> Python is my language of choice for interfacing, however there does
>>>> seem to be a Python JDBC wrapper <https://spark.apache.org/sql>.
>>>>
>>>> Here is what I've found thus far:
>>>>
>>>>- Apache Hive <https://hive.apache.org> (SQL-like, with interactive
>>>>SQL thanks to the Stinger initiative)
>>>>- Apache Drill <http://drill.apache.org> (ANSI SQL support)
>>>>- Apache Spark <https://spark.apache.org> (Spark SQL
>>>><https://spark.apache.org/sql>, queries only, add data via Hive, RDD
>>>>
>>>> <https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SchemaRDD>
>>>>or Paraquet <http://parquet.io/>)
>>>>- Apache Phoenix <http://phoenix.apache.org> (built atop Apache
>>>>HBase <http://hbase.apache.org>, lacks full transaction
>>>><http://en.wikipedia.org/wiki/Database_transaction> support, relational
>>>>operators <http://en.wikipedia.org/wiki/Relational_operators> and
>>>>some built-in functions)
>>>>- Cloudera Impala
>>>>
>>>> <http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html>
>>>>(significant HiveQL support, some SQL language support, no support for
>>>>indexes on its tables, importantly missing DELETE,

Re: Which [open-souce] SQL engine atop Hadoop?

2015-01-30 Thread Koert Kuipers
since you require high-powered analytics, and i assume you want to stay
sane while doing so, you require the ability to "drop out of sql" when
needed. so spark-sql and lingual would be my choices.

low latency indicates phoenix or spark-sql to me.

so i would say spark-sql

On Fri, Jan 30, 2015 at 7:56 AM, Samuel Marks  wrote:

> HAWQ is pretty nifty due to its full SQL compliance (ANSI 92) and exposing
> both JDBC and ODBC interfaces. However, although Pivotal does open-source
> a lot of software , I don't believe they open
> source Pivotal HD: HAWQ.
>
> So that doesn't meet my requirements. I should note that the project I am
> building will also be open-source, which heightens the importance of having
> all components also being open-source.
>
> Cheers,
>
> Samuel Marks
> http://linkedin.com/in/samuelmarks
>
> On Fri, Jan 30, 2015 at 11:35 PM, Siddharth Tiwari <
> siddharth.tiw...@live.com> wrote:
>
>> Have you looked at HAWQ from Pivotal ?
>>
>> Sent from my iPhone
>>
>> On Jan 30, 2015, at 4:27 AM, Samuel Marks  wrote:
>>
>> Since Hadoop  came out, there have been various
>> commercial and/or open-source attempts to expose some compatibility with
>> SQL . Obviously by posting here I am not
>> expecting an unbiased answer.
>>
>> Seeking an SQL-on-Hadoop offering which provides: low-latency querying,
>> and supports the most common CRUD , including
>> [the basics!] along these lines: CREATE TABLE, INSERT INTO, SELECT * FROM,
>> UPDATE Table SET C1=2 WHERE, DELETE FROM, and DROP TABLE. Transactional
>> support would be nice also, but is not a must-have.
>>
>> Essentially I want a full replacement for the more traditional RDBMS, one
>> which can scale from 1 node to a serious Hadoop cluster.
>>
>> Python is my language of choice for interfacing, however there does seem
>> to be a Python JDBC wrapper .
>>
>> Here is what I've found thus far:
>>
>>- Apache Hive  (SQL-like, with interactive
>>SQL thanks to the Stinger initiative)
>>- Apache Drill  (ANSI SQL support)
>>- Apache Spark  (Spark SQL
>>, queries only, add data via Hive, RDD
>>
>> 
>>or Paraquet )
>>- Apache Phoenix  (built atop Apache HBase
>>, lacks full transaction
>> support, relational
>>operators  and
>>some built-in functions)
>>- Cloudera Impala
>>
>> 
>>(significant HiveQL support, some SQL language support, no support for
>>indexes on its tables, importantly missing DELETE, UPDATE and INTERSECT;
>>amongst others)
>>- Presto  from Facebook (can
>>query Hive, Cassandra , relational DBs
>>&etc. Doesn't seem to be designed for low-latency responses across small
>>clusters, or support UPDATE operations. It is optimized for data
>>warehousing or analytics¹
>>)
>>- SQL-Hadoop  via MapR
>>community edition 
>>(seems to be a packaging of Hive, HP Vertica
>>, SparkSQL,
>>Drill and a native ODBC wrapper
>>)
>>- Apache Kylin  from Ebay (provides an SQL
>>interface and multi-dimensional analysis [OLAP
>>], "… offers ANSI SQL on Hadoop
>>and supports most ANSI SQL query functions". It depends on HDFS, 
>> MapReduce,
>>Hive and HBase; and seems targeted at very large data-sets though 
>> maintains
>>low query latency)
>>- Apache Tajo  (ANSI/ISO SQL standard
>>compliance with JDBC  driver
>>support [benchmarks against Hive and Impala
>>
>> 
>>])
>>- Cascading 's
>>Lingual ²
>> ("Lingual
>>provides JDBC Drivers, a SQL command shell, and a catalog manager for
>>publishing files [or any resource] as schemas and tables.")
>>
>> Which—from this list or elsewhere—would you recommend, and why?
>> Thanks for all suggestions,
>>
>> Samuel Marks
>> http://link

SerDe loading external scheme

2012-04-05 Thread Koert Kuipers
I am working on a hive SerDe where both SerDe and RecordReader need to have
access to an external resource with information.
This external resource could be on hdfs, in hbase, or on a http server.
This situation is very similar to what haivvreo does.

The way i go about it right now is that i store the uri for the external
resource in the SERDEPROPERTIES and then both SerDe and RecordReader use
that to load the resource. I had to jump through some hoops to retrieve the
Properties object (the SERDEPROPERTIES) in the RecordReader, but now it
works. However this is far from optimal, since on a large cluster this
leads to a lot of read request on the external resource.

Since SerDe gets called at least once on the client before the mapreduce
job is started, i would like to load my external resource there, and then
stuff it in the Configuration object, the Properties object or in the
Distributed Cache. Then the SerDes and RecordReaders on the cluster could
get it from there and wouldn't have to access the external resource.

I made the changes. But whatever modification i make to Configuration
object or Properties object on the client in SerDe doesn't make it to the
cluster! Is there a way to do this?


SerDe and InputFormat

2012-02-21 Thread Koert Kuipers
I make changes to the Configuration in my SerDe expecting those to be
passed to the InputFormat (and OutputFormat). Yet the InputFormat seems to
get an unchanged JobConf? Is this a known limitation?

I find it very confusing since the Configuration is the main way to
communicate with the MapReduce process... So i assume i must be doing
something wrong and this is possible.

Thanks for your help. Koert


2 questions about SerDe

2012-02-21 Thread Koert Kuipers
1) Is there a way in initialize() of a SerDe to know if it is being used as
a Serializer or a Deserializer. If not, can i define the Serializer and
Deserializer separately instead of defining a SerDe (so i have two
initialize methods)?

2) Is there a way to find out which columns are being used? say if someone
does select a,b,c from test, and my SerDe gets initialized for usage in
that query how can i know that only a,b,c are being needed? i would like to
take advantage of this information so i dont deserialize unnecessary
information, without having to resort to more complex lazy deserialization
tactics.


external partitioned table

2012-02-08 Thread Koert Kuipers
hello all,

we have an external partitioned table in hive.

we add to this table by having map-reduce jobs (so not from hive) create
new subdirectories with the right format (partitionid=partitionvalue).

however hive doesn't pick them up automatically. we have to go into hive
shell and run "alter table sometable add partition
(partitionid=partitionvalue)". to make matter worse hive doesnt really lend
itself to running such an add-partition-operation from java (or for that
matter: hive doesn't lend itself to any easy programmatic manipulations...
grrr. but i will stop now before i go on a a rant).

any suggestions how to approach this? thanks!

best, koert


Re: why 1 reducer on simple join?

2012-01-12 Thread Koert Kuipers
good point... i should have used ON... with ON it runs fine as a map-join,
and if i set hive.auto.convert.join=false then it runs with my specified
number of reducers.

with right number of reducers

On Thu, Jan 12, 2012 at 6:12 PM, Edward Capriolo wrote:

> You should do joins using the ON clause.
> https://cwiki.apache.org/Hive/languagemanual-joins.html
> be careful if you do the joins wrong hive does a Cartesian product
> followed by a really long reduce phase rather then the optimal join process.
>
> On Thu, Jan 12, 2012 at 6:04 PM, Aaron McCurry  wrote:
>
>> I see that your query is kinda generic and probably not the original
>> query.  I have seen this behavior with a simple typo like:
>>
>> Notice col3.
>>
>> create table z as select x.* from table1 x join table2 y where (
>> x.col1 = y.col1 and
>> x.col2 = y.col2 and
>> y.col3 = y.col3 and
>> x.col4 = y.col4 and
>> x.col5 = y.col5
>> );
>>
>> Just a thought.
>>
>> Aaron
>>
>> On Thu, Jan 12, 2012 at 6:00 PM, Wojciech Langiewicz <
>> wlangiew...@gmail.com> wrote:
>>
>>> Hello,
>>> Have you tried running only select, without creating table? What are
>>> results?
>>> How did you tried to set number of reducers? Have you used this:
>>> set mapred.reduce.tasks = xyz;
>>> How many mappers does this query use?
>>>
>>>
>>> On 12.01.2012 23:53, Koert Kuipers wrote:
>>>
>>>> I am running a basic join of 2 tables and it will only run with 1
>>>> reducer.
>>>> why is that? i tried to set the number of reducers and it didn't work.
>>>> hive
>>>> just ignored it.
>>>>
>>>> create table z as select x.* from table1 x join table2 y where (
>>>> x.col1 = y.col1 and
>>>> x.col2 = y.col2 and
>>>> x.col3 = y.col3 and
>>>> x.col4 = y.col4 and
>>>> x.col5 = y.col5
>>>> );
>>>>
>>>> both tables are backed by multiple files / blocks / chunks
>>>>
>>>>
>>> --
>>> Wojciech Langiewicz
>>>
>>
>>
>


Re: why 1 reducer on simple join?

2012-01-12 Thread Koert Kuipers
that query without the create table turns into a map-join and runs fast
without any reducers.
if i turn map-join off then it goes back to map-reduce with 1 reducer and
ignores mapred.reduce.tasks again.
i am using hive 0.7

On Thu, Jan 12, 2012 at 6:28 PM, Wojciech Langiewicz
wrote:

> I ment this query (without create table):
>
> select x.* from table1 x join table2 y where (
> x.col1 = y.col1 and
> x.col2 = y.col2 and
> x.col3 = y.col3 and
> x.col4 = y.col4 and
> x.col5 = y.col5
> );
>
> this document might be useful: https://cwiki.apache.org/Hive/**
> joinoptimization.html<https://cwiki.apache.org/Hive/joinoptimization.html>
>
> Especially try this setting:
> set hive.auto.convert.join = true; (or false)
>
> Which version of Hive are you using?
>
>
> On 13.01.2012 00:24, Koert Kuipers wrote:
>
>> hive>  set mapred.reduce.tasks = 3;
>> hive>  select count(*) from table1 group by column1 limit 10;
>> query runs with 38 mappers and 3 reducers
>>
>> hive>  select count(*) from table2 group by column1 limit 10;
>> query runs with 6 mappers and 3 reducers
>>
>> On Thu, Jan 12, 2012 at 6:09 PM, Wojciech Langiewicz
>> wrote:
>>
>>  What do you mean by "Select runs fine" - is it using number of reducers
>>> that you set?
>>> It might help if you could show actual query.
>>>
>>>
>>> On 13.01.2012 00:03, Koert Kuipers wrote:
>>>
>>>  I tried set mapred.reduce.tasks = xyz; hive ignored it.
>>>> Selects run fine. The query uses 44 mappers.
>>>>
>>>> On Thu, Jan 12, 2012 at 6:00 PM, Wojciech Langiewicz
>>>> wrote:
>>>>
>>>>  Hello,
>>>>
>>>>> Have you tried running only select, without creating table? What are
>>>>> results?
>>>>> How did you tried to set number of reducers? Have you used this:
>>>>> set mapred.reduce.tasks = xyz;
>>>>> How many mappers does this query use?
>>>>>
>>>>>
>>>>> On 12.01.2012 23:53, Koert Kuipers wrote:
>>>>>
>>>>>  I am running a basic join of 2 tables and it will only run with 1
>>>>>
>>>>>> reducer.
>>>>>> why is that? i tried to set the number of reducers and it didn't work.
>>>>>> hive
>>>>>> just ignored it.
>>>>>>
>>>>>> create table z as select x.* from table1 x join table2 y where (
>>>>>> x.col1 = y.col1 and
>>>>>> x.col2 = y.col2 and
>>>>>> x.col3 = y.col3 and
>>>>>> x.col4 = y.col4 and
>>>>>> x.col5 = y.col5
>>>>>> );
>>>>>>
>>>>>> both tables are backed by multiple files / blocks / chunks
>>>>>>
>>>>>>
>>>>>>  --
>>>>>>
>>>>> Wojciech Langiewicz
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>


Re: why 1 reducer on simple join?

2012-01-12 Thread Koert Kuipers
hive> set mapred.reduce.tasks = 3;
hive> select count(*) from table1 group by column1 limit 10;
query runs with 38 mappers and 3 reducers

hive> select count(*) from table2 group by column1 limit 10;
query runs with 6 mappers and 3 reducers

On Thu, Jan 12, 2012 at 6:09 PM, Wojciech Langiewicz
wrote:

> What do you mean by "Select runs fine" - is it using number of reducers
> that you set?
> It might help if you could show actual query.
>
>
> On 13.01.2012 00:03, Koert Kuipers wrote:
>
>> I tried set mapred.reduce.tasks = xyz; hive ignored it.
>> Selects run fine. The query uses 44 mappers.
>>
>> On Thu, Jan 12, 2012 at 6:00 PM, Wojciech Langiewicz
>> wrote:
>>
>>  Hello,
>>> Have you tried running only select, without creating table? What are
>>> results?
>>> How did you tried to set number of reducers? Have you used this:
>>> set mapred.reduce.tasks = xyz;
>>> How many mappers does this query use?
>>>
>>>
>>> On 12.01.2012 23:53, Koert Kuipers wrote:
>>>
>>>  I am running a basic join of 2 tables and it will only run with 1
>>>> reducer.
>>>> why is that? i tried to set the number of reducers and it didn't work.
>>>> hive
>>>> just ignored it.
>>>>
>>>> create table z as select x.* from table1 x join table2 y where (
>>>> x.col1 = y.col1 and
>>>> x.col2 = y.col2 and
>>>> x.col3 = y.col3 and
>>>> x.col4 = y.col4 and
>>>> x.col5 = y.col5
>>>> );
>>>>
>>>> both tables are backed by multiple files / blocks / chunks
>>>>
>>>>
>>>>  --
>>> Wojciech Langiewicz
>>>
>>>
>>
>


Re: why 1 reducer on simple join?

2012-01-12 Thread Koert Kuipers
thanks for that tip. i checked and unfortunately no typo like that.

would be kind of weird if an identity like that caused a single reducer!

On Thu, Jan 12, 2012 at 6:04 PM, Aaron McCurry  wrote:

> I see that your query is kinda generic and probably not the original
> query.  I have seen this behavior with a simple typo like:
>
> Notice col3.
>
> create table z as select x.* from table1 x join table2 y where (
> x.col1 = y.col1 and
> x.col2 = y.col2 and
> y.col3 = y.col3 and
> x.col4 = y.col4 and
> x.col5 = y.col5
> );
>
> Just a thought.
>
> Aaron
>
> On Thu, Jan 12, 2012 at 6:00 PM, Wojciech Langiewicz <
> wlangiew...@gmail.com> wrote:
>
>> Hello,
>> Have you tried running only select, without creating table? What are
>> results?
>> How did you tried to set number of reducers? Have you used this:
>> set mapred.reduce.tasks = xyz;
>> How many mappers does this query use?
>>
>>
>> On 12.01.2012 23:53, Koert Kuipers wrote:
>>
>>> I am running a basic join of 2 tables and it will only run with 1
>>> reducer.
>>> why is that? i tried to set the number of reducers and it didn't work.
>>> hive
>>> just ignored it.
>>>
>>> create table z as select x.* from table1 x join table2 y where (
>>> x.col1 = y.col1 and
>>> x.col2 = y.col2 and
>>> x.col3 = y.col3 and
>>> x.col4 = y.col4 and
>>> x.col5 = y.col5
>>> );
>>>
>>> both tables are backed by multiple files / blocks / chunks
>>>
>>>
>> --
>> Wojciech Langiewicz
>>
>
>


Re: why 1 reducer on simple join?

2012-01-12 Thread Koert Kuipers
I tried set mapred.reduce.tasks = xyz; hive ignored it.
Selects run fine. The query uses 44 mappers.

On Thu, Jan 12, 2012 at 6:00 PM, Wojciech Langiewicz
wrote:

> Hello,
> Have you tried running only select, without creating table? What are
> results?
> How did you tried to set number of reducers? Have you used this:
> set mapred.reduce.tasks = xyz;
> How many mappers does this query use?
>
>
> On 12.01.2012 23:53, Koert Kuipers wrote:
>
>> I am running a basic join of 2 tables and it will only run with 1 reducer.
>> why is that? i tried to set the number of reducers and it didn't work.
>> hive
>> just ignored it.
>>
>> create table z as select x.* from table1 x join table2 y where (
>> x.col1 = y.col1 and
>> x.col2 = y.col2 and
>> x.col3 = y.col3 and
>> x.col4 = y.col4 and
>> x.col5 = y.col5
>> );
>>
>> both tables are backed by multiple files / blocks / chunks
>>
>>
> --
> Wojciech Langiewicz
>


why 1 reducer on simple join?

2012-01-12 Thread Koert Kuipers
I am running a basic join of 2 tables and it will only run with 1 reducer.
why is that? i tried to set the number of reducers and it didn't work. hive
just ignored it.

create table z as select x.* from table1 x join table2 y where (
x.col1 = y.col1 and
x.col2 = y.col2 and
x.col3 = y.col3 and
x.col4 = y.col4 and
x.col5 = y.col5
);

both tables are backed by multiple files / blocks / chunks


Re: Do some basic operations on my hive warehouse from java

2011-10-19 Thread Koert Kuipers
Thanks for your reply Edward.

I see you use CommandProcessorFactory, and then you pass the actual commands
in as strings, such as:
doHiveCommand("create table bla (id int)", c)

However i assume this somewhere gets translated into normal Java commands
inside of hive? Something like Someclass.createTable(someinput). I would
prefer to work at that level instead of having to translate my commands
first into Strings which then get passed into a CommandProcessor.

With a remote connection (to a hive server) i can do just that:
TTransport transport = new TSocket(host, port);
TProtocol protocol = new TBinaryProtocol(transport);
HiveClient client = new HiveClient(protocol);
client.drop_table(database, table, true)

Do you know of any similar interface for a local hive? Thanks

On Wed, Oct 19, 2011 at 5:28 PM, Edward Capriolo wrote:

>
>
> On Wed, Oct 19, 2011 at 5:18 PM, Koert Kuipers  wrote:
>
>> I have the need to do some cleanup on my hive warehouse from java, such as
>> deleting tables (both in metastore and the files on hdfs)
>>
>> I found out how to do this using remote connection:
>> org.apache.hadoop.hive.service.HiveClient connects to a hive server with
>> only a few lines of code, and it provides simple java methods such a
>> drop_table to do my tasks. Neat!
>>
>> But how do i archieve the same result if i do not want a remote connection
>> via hive server, but instead i want to use a direct connection?
>>
>> Thanks! Koert
>>
>
> You have to be careful when you do this. The reason is because the code
> below the CLI is not designed direct use. The API can change etc although
> they do not change much. That being said I do this :). The way to do it is
> you gut the CLI and make a stand alone application.
>
> You can then launch that application with 'hive --service jar /path/to/jar
> my.class.name '
>
> An example is a tool I use to generate show create table statments.
> https://issues.apache.org/jira/browse/HIVE-967
>
> Or my github project for unit testing hive work.
> https://github.com/edwardcapriolo/hive_test
>
> Again, hive was not designed to work this way.
>
>
>


Re: authorization and remote connection (on cdh3u1)

2011-10-19 Thread Koert Kuipers
If i give grants to the user that is specified in my hive-site.xml to
connect to metastore (javax.jdo.option.ConnectionUserName) then i can create
tables and such using remote hive connection. So it seems it is doing the
authorization checks against that user, and not the user that is actually
logged in?

I  thought the actual username was passed along in thrift if authorization
was enabled, and that the actual username would be used for authorization.
Am i wrong about this?


On Wed, Oct 19, 2011 at 6:01 PM, Koert Kuipers  wrote:

> Using a normal hive connection and authorization it seems to work for me:
> hive> revoke all on database default from user koert;
> OK
> Time taken: 0.043 seconds
> hive> create table tmp(x string);
> Authorization failed:No privilege 'Create' found for outputs {
> database:default}. Use show grant to get more details.
> hive> grant all on database default to user koert;
> OK
> Time taken: 0.052 seconds
> hive> create table tmp(x string);
> OK
> Time taken: 0.187 seconds
>
> However when i now switch to a remote connection, it does not work for me:
> [node01:1] hive> create table tmp123(x string);
> [Hive Error]: Query returned non-zero code: 403, cause: null
>
> The logs for the hive server show:
> Authorization failed:No privilege 'Create' found for outputs {
> database:default}. Use show grant to get more details.
>
> What am i doing wrong? Both the hive server and my local hive have in their
> site.xml:
>   
> hive.security.authorization.enabled
> true
> true
>   
>
>   
> hive.security.authorization.createtable.owner.grants
> ALL
> true
>   
>
>


authorization and remote connection (on cdh3u1)

2011-10-19 Thread Koert Kuipers
Using a normal hive connection and authorization it seems to work for me:
hive> revoke all on database default from user koert;
OK
Time taken: 0.043 seconds
hive> create table tmp(x string);
Authorization failed:No privilege 'Create' found for outputs {
database:default}. Use show grant to get more details.
hive> grant all on database default to user koert;
OK
Time taken: 0.052 seconds
hive> create table tmp(x string);
OK
Time taken: 0.187 seconds

However when i now switch to a remote connection, it does not work for me:
[node01:1] hive> create table tmp123(x string);
[Hive Error]: Query returned non-zero code: 403, cause: null

The logs for the hive server show:
Authorization failed:No privilege 'Create' found for outputs {
database:default}. Use show grant to get more details.

What am i doing wrong? Both the hive server and my local hive have in their
site.xml:
  
hive.security.authorization.enabled
true
true
  

  
hive.security.authorization.createtable.owner.grants
ALL
true
  


Do some basic operations on my hive warehouse from java

2011-10-19 Thread Koert Kuipers
I have the need to do some cleanup on my hive warehouse from java, such as
deleting tables (both in metastore and the files on hdfs)

I found out how to do this using remote connection:
org.apache.hadoop.hive.service.HiveClient connects to a hive server with
only a few lines of code, and it provides simple java methods such a
drop_table to do my tasks. Neat!

But how do i archieve the same result if i do not want a remote connection
via hive server, but instead i want to use a direct connection?

Thanks! Koert


Re: Exception when joining HIVE tables

2011-09-21 Thread Koert Kuipers
"select * from table" does not use map-reduce
so it seems your error has to do with hadoop/map-reduce, not hive
i would run some test for map-reduce

On Wed, Sep 21, 2011 at 4:11 PM, Krish Khambadkone
wrote:

> Hi,  I get this exception when I try to join two hive tables or even when I
> use a specific WHERE clause.   "SELECT *" from any individual table seems to
> work fine.  Any idea what is missing here.  I am on hive version
> hive-0.7.0-cdh3u0.
>
> java.lang.IllegalArgumentException: Can not create a Path from an empty
> string
>at org.apache.hadoop.fs.Path.checkPathArg(Path.java:82)
>at org.apache.hadoop.fs.Path.(Path.java:90)
>at org.apache.hadoop.fs.Path.(Path.java:50)
>at
> org.apache.hadoop.mapred.JobClient.copyRemoteFiles(JobClient.java:608)
>at
> org.apache.hadoop.mapred.JobClient.copyAndConfigureFiles(JobClient.java:713)
>at
> org.apache.hadoop.mapred.JobClient.copyAndConfigureFiles(JobClient.java:637)
>at org.apache.hadoop.mapred.JobClient.access$300(JobClient.java:170)
>at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:848)
>at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:833)
>at java.security.AccessController.doPrivileged(Native Method)
>at javax.security.auth.Subject.doAs(Subject.java:396)
>at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
>at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:833)
>at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:807)
>at
> org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:657)
>at
> org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:123)
>at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:130)
>at
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57)
>at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1063)
>at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:900)
>at org.apache.hadoop.hive.ql.Driver.run(Driver.java:748)
>at
> org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:164)
>at
> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:241)
>at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:456)
>at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>at java.lang.reflect.Method.invoke(Method.java:597)
>at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
> Job Submission failed with exception
> 'java.lang.IllegalArgumentException(Can not create a Path from an empty
> string)'
> FAILED: Execution Error, return code 1 from
> org.apache.hadoop.hive.ql.exec.MapRedTask
>


remove duplicates based on one (or a few) columns

2011-09-14 Thread Koert Kuipers
what is the easiest way to remove rows which are considered duplicates based
upon a few columns in the rows?
so "create table deduped as select distinct * from table" won't do...


Re: Hive thrift interface and user permissions / user auditing

2011-09-07 Thread Koert Kuipers
ah sorry the user "thrift" that was a typo by me. it actually says ugi=hive
in my logs. i missed that the first time you asked.

regarding JDBC, i was mainly interested if JDBC would be able to make a
connection to a secure cluster at all, and if so yes then my question would
be if it uses the credentials of the jdbc connection.

thanks again! best koert

On Wed, Sep 7, 2011 at 11:29 AM, Ashutosh Chauhan wrote:

> I assume when you say thrift interface, you mean a separate metastore
> process running. If so,
>
> >> Do i understand it correctly that in the thrift interface does provide a
> way to communicate the identity but in unsecured mode it is not being used?
> Yes. Better way to say this is identity is communicated only in case of
> secure mode.
>
> >> And does this mean that if i care about seeing the correct user execute
> the query in the logs, i have to use secure hadoop (with Kerberos)?
> Yes. Though, technically it is possible to achieve this even without secure
> hadoop, its not the case currently mainly because logging identities in
> unsecure environment is anyway useless since one can easily impersonate
> another and whole point of logging is lost then.
>
> >> Does secure mode suport hive JDBC?
> I am not sure about this. Do you mean users and their roles as they exist
> in hive metastore and if you make a jdbc connection using credentials stored
> in it?
>
> By the way, I am still confused about user "thrift". Is there any process
> run by user "thrift"
>
> Hope it helps,
> Ashutosh
>
> On Tue, Sep 6, 2011 at 09:09, Koert Kuipers  wrote:
>
>> The metastore is running as user "hive", and we are indeed running
>> unsecured mode.
>> Do i understand it correctly that in the thrift interface does provide a
>> way to communicate the identity but in unsecured mode it is not being used?
>> And does this mean that if i care about seeing the correct user execute
>> the query in the logs, i have to use secure hadoop (with Kerberos)?
>> Does secure mode suport hive JDBC?
>> Thanks! Koert
>>
>>
>> On Tue, Sep 6, 2011 at 11:47 AM, Ashutosh Chauhan 
>> wrote:
>>
>>> Hey Koert,
>>>
>>> I am assuming 'thrift' is the name of user through which thrift metastore
>>> is running. I also assume you are running in unsecure mode. If you run with
>>> security turned on, meaning secure hadoop cluster with secure thrift server,
>>> you will see the name of the original user. This is so because in secure
>>> mode, metastore server  proxies the original user through doAs() which
>>> preserves the identity which is not the case in unsecure mode.
>>> Through hive client you see the usernames correctly even In unsecure mode
>>> because its a hive client process (which is run as koert) which does the
>>> filesystem operations.
>>>
>>> Hope it helps,
>>>  Ashutosh
>>>
>>>
>>> On Tue, Sep 6, 2011 at 08:22, Koert Kuipers  wrote:
>>>
>>>> When i run a query from the hive command line client i can see that it
>>>> is being run as me (for example, in HDFS log i see INFO
>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=koert).
>>>>
>>>> But when i do anything with the thrift interface my username is lost (i
>>>> see ugi=thrift in HDFS logs). Is there a way in the thrift interface to
>>>> communicate/preserve the username?
>>>> And if this is possible in thrift, then what about jdbc? i tried
>>>> creating a jdbc connection with username and password passed in but as far
>>>> as i can see it is ignored (ugi=thrift again in the HDFS logs).
>>>>
>>>>
>>>
>>
>


Re: Hive thrift interface and user permissions / user auditing

2011-09-06 Thread Koert Kuipers
The metastore is running as user "hive", and we are indeed running unsecured
mode.
Do i understand it correctly that in the thrift interface does provide a way
to communicate the identity but in unsecured mode it is not being used?
And does this mean that if i care about seeing the correct user execute the
query in the logs, i have to use secure hadoop (with Kerberos)?
Does secure mode suport hive JDBC?
Thanks! Koert

On Tue, Sep 6, 2011 at 11:47 AM, Ashutosh Chauhan wrote:

> Hey Koert,
>
> I am assuming 'thrift' is the name of user through which thrift metastore
> is running. I also assume you are running in unsecure mode. If you run with
> security turned on, meaning secure hadoop cluster with secure thrift server,
> you will see the name of the original user. This is so because in secure
> mode, metastore server  proxies the original user through doAs() which
> preserves the identity which is not the case in unsecure mode.
> Through hive client you see the usernames correctly even In unsecure mode
> because its a hive client process (which is run as koert) which does the
> filesystem operations.
>
> Hope it helps,
>  Ashutosh
>
>
> On Tue, Sep 6, 2011 at 08:22, Koert Kuipers  wrote:
>
>> When i run a query from the hive command line client i can see that it is
>> being run as me (for example, in HDFS log i see INFO
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=koert).
>>
>> But when i do anything with the thrift interface my username is lost (i
>> see ugi=thrift in HDFS logs). Is there a way in the thrift interface to
>> communicate/preserve the username?
>> And if this is possible in thrift, then what about jdbc? i tried creating
>> a jdbc connection with username and password passed in but as far as i can
>> see it is ignored (ugi=thrift again in the HDFS logs).
>>
>>
>


Hive thrift interface and user permissions / user auditing

2011-09-06 Thread Koert Kuipers
When i run a query from the hive command line client i can see that it is
being run as me (for example, in HDFS log i see INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=koert).

But when i do anything with the thrift interface my username is lost (i see
ugi=thrift in HDFS logs). Is there a way in the thrift interface to
communicate/preserve the username?
And if this is possible in thrift, then what about jdbc? i tried creating a
jdbc connection with username and password passed in but as far as i can see
it is ignored (ugi=thrift again in the HDFS logs).


Re: UDAF and group by

2011-09-05 Thread Koert Kuipers
Thanks for your response.
My thinking was that by turning off hive.map.aggr hive would do the
following:
col3 becomes the key in mapping. All rows with same col3 go to same reducer.
In the reducer the values (=col1,col2) are sorted by key (=col3) and myUdf
iterates over the over the values, with terminate() and init() being called
when the key changes. If this is how it is implemented then what would be
the situation where merge() and terminatePartial() would be called?
I have run this with 200mm+ rows of data (but with all groups less than 10k
in size) so far without any calls to merge() or terminatePartial().
Best, Koert

On Sun, Sep 4, 2011 at 11:10 PM, Huan Li  wrote:

> Setting hive.map.aggr false will reduce the chance of terminatePartial()
> and merge() being called. Though I don't think it will eliminate the
> possibility. If your data is large, it's still possible that a group of data
> is processed by multiple reducers and those two methods are needed.
>
> If you need to process records in each group in a single method, you can
> first use collect_set to collect your group data and process them in a UDF.
>
>
> 2011/9/4 Koert Kuipers 
>
>> Hey, my question wasn't very clear. I have a UDAF that I apply per group.
>> The UDAF does not support terminatePartial() and merge(). So to do this i
>> run:
>>
>> set hive.map.aggr=false;
>> select myUdf(col1, col2) from table group by col3;
>>
>> Now this seems to work. But are my assumptions correct that this will
>> never call terminatePartial() or merge()?
>> Thanks Koert
>>
>>
>> On Thu, Sep 1, 2011 at 11:59 PM, Huan Li  wrote:
>>
>>> Koert, Not sure what you mean by "results can be merged between groups".
>>> UDAF should be used to aggregated records by group. Why need to merge
>>> between groups?
>>>
>>> Can you give some examples of what kind of query you'd like to run?
>>>
>>>
>>> 2011/8/30 Koert Kuipers 
>>>
>>>> If i run my own UDAF with group by, can i be sure that a single UDAF
>>>> instance initialized once will process all members in a group? Or should i
>>>> code so as to take into account the situation where even within a group
>>>> multiple UDAFs could run, and i would have to deal with terminatePartial()
>>>> and merge() even within a group?
>>>> My problem is that my results within a group are not easily merged, but
>>>> between groups they are.
>>>>
>>>
>>>
>>
>


Re: UDAF and group by

2011-09-04 Thread Koert Kuipers
Hey, my question wasn't very clear. I have a UDAF that I apply per group.
The UDAF does not support terminatePartial() and merge(). So to do this i
run:

set hive.map.aggr=false;
select myUdf(col1, col2) from table group by col3;

Now this seems to work. But are my assumptions correct that this will never
call terminatePartial() or merge()?
Thanks Koert

On Thu, Sep 1, 2011 at 11:59 PM, Huan Li  wrote:

> Koert, Not sure what you mean by "results can be merged between groups".
> UDAF should be used to aggregated records by group. Why need to merge
> between groups?
>
> Can you give some examples of what kind of query you'd like to run?
>
>
> 2011/8/30 Koert Kuipers 
>
>> If i run my own UDAF with group by, can i be sure that a single UDAF
>> instance initialized once will process all members in a group? Or should i
>> code so as to take into account the situation where even within a group
>> multiple UDAFs could run, and i would have to deal with terminatePartial()
>> and merge() even within a group?
>> My problem is that my results within a group are not easily merged, but
>> between groups they are.
>>
>
>


UDAF and group by

2011-08-30 Thread Koert Kuipers
If i run my own UDAF with group by, can i be sure that a single UDAF
instance initialized once will process all members in a group? Or should i
code so as to take into account the situation where even within a group
multiple UDAFs could run, and i would have to deal with terminatePartial()
and merge() even within a group?
My problem is that my results within a group are not easily merged, but
between groups they are.


Re: Re: multiple tables join with only one hug table.

2011-08-13 Thread Koert Kuipers
I am not aware of any optimization that does something like that. Anyone?
Also your suggestion means 10 hash tables would have to be in memory.

I think that with a normal map-reduce join in hive you can join 10 tables at
once (meaning in a single map-reduce) if they all join on the same key.

2011/8/13 Daniel,Wu 

> Thanks, it works, but not as effective as possible:
>
> suppose we join 10 small tables (s1,s2...s10) with one huge table (big) in
> a database house system (the join is between big table and small table, like
> star schema), after I set the parameters as you set, it will have 10 mapside
> join, after one mapside join competes, it will write huge data to file
> system (as one table is huge), then the next mapside join need to read the
> hug data written to do another mapside join, so totally we need to read the
> huge data 11 times and write it 10 times(as the last write only return small
> data volume).   The best execution plan I can think of is: first build 10
> hash table: one for each small table, and loop each row in the big table, if
> the row survive, just output, if not then discard, in this way we only need
> to read the big data once, instead of read big data, write big data, read
> big data, ...
>
> flow is:
> 1: build 10 hash table
> 2: foreach row in big table
>  probe the row with each of these 10 hash table
>  if match all these 10 hash table, go to next step (output, etc)
>  else discard the row.
> end loop
>
>
> At 2011-08-13 01:17:16,"Koert Kuipers"  wrote:
>
> A mapjoin does what you described: it builds hash tables for the smaller
> tables. In recent versions of hive (like the one i am using with cloudera
> cdh3u1) a mapjoin will be done for you automatically if you have your
> parameters set correctly. The relevant parameters in hive-site.xml are:
> hive.auto.convert.join, hive.mapjoin.maxsize and
> hive.mapjoin.smalltable.filesize. On the hive command line it will tell you
> that it is building the hashtable, and it will not run a reducer.
>
> On Thu, Aug 11, 2011 at 10:25 PM, Ayon Sinha  wrote:
>
>> The Mapjoin hint syntax help optimize by loading the smaller tables
>> specified in the Mapjoin hint into memory. Then every small table is in
>> memory of each mapper.
>>
>> -Ayon
>> See My Photos on Flickr <http://www.flickr.com/photos/ayonsinha/>
>> Also check out my Blog for answers to commonly asked 
>> questions.<http://dailyadvisor.blogspot.com>
>>
>> --
>> *From:* "Daniel,Wu" 
>> *To:* hive 
>> *Sent:* Thursday, August 11, 2011 7:01 PM
>> *Subject:* multiple tables join with only one hug table.
>>
>> if the retailer fact table is sale_fact with 10B rows, and join with 3
>> small tables: stores (10K), products(10K), period (1K). What's the best join
>> solution?
>>
>> In oracle, it can first build hash for stores, and hash for products, and
>> hash for stores. Then probe using the fact table, if the row matched in
>> stores, that row can go up further to map with products by hashing check, if
>> pass, then go up further to try to match period. In this way, the sale_fact
>> only needs to be scanned once which save lots of disk IO.  Is this doable in
>> hive, if doable, what hint need to use?
>>
>>
>>
>>
>>
>
>
>


Re: multiple tables join with only one hug table.

2011-08-12 Thread Koert Kuipers
A mapjoin does what you described: it builds hash tables for the smaller
tables. In recent versions of hive (like the one i am using with cloudera
cdh3u1) a mapjoin will be done for you automatically if you have your
parameters set correctly. The relevant parameters in hive-site.xml are:
hive.auto.convert.join, hive.mapjoin.maxsize and
hive.mapjoin.smalltable.filesize. On the hive command line it will tell you
that it is building the hashtable, and it will not run a reducer.

On Thu, Aug 11, 2011 at 10:25 PM, Ayon Sinha  wrote:

> The Mapjoin hint syntax help optimize by loading the smaller tables
> specified in the Mapjoin hint into memory. Then every small table is in
> memory of each mapper.
>
> -Ayon
> See My Photos on Flickr 
> Also check out my Blog for answers to commonly asked 
> questions.
>
> --
> *From:* "Daniel,Wu" 
> *To:* hive 
> *Sent:* Thursday, August 11, 2011 7:01 PM
> *Subject:* multiple tables join with only one hug table.
>
> if the retailer fact table is sale_fact with 10B rows, and join with 3
> small tables: stores (10K), products(10K), period (1K). What's the best join
> solution?
>
> In oracle, it can first build hash for stores, and hash for products, and
> hash for stores. Then probe using the fact table, if the row matched in
> stores, that row can go up further to map with products by hashing check, if
> pass, then go up further to try to match period. In this way, the sale_fact
> only needs to be scanned once which save lots of disk IO.  Is this doable in
> hive, if doable, what hint need to use?
>
>
>
>
>


Re: Lzo Compression

2011-07-27 Thread Koert Kuipers
i just tried your step 10 and 11 on my setup and it works. See my earlier
message about my setup.

On Wed, Jul 27, 2011 at 9:37 AM, Ankit Jain  wrote:

> Hi all,
> I tried to index the lzo file but got the following error while indexing
> the lzo file :
>
> java.lang.ClassCastException:
> com.hadoop.compression.lzo.LzopCodec$LzopDecompressor cannot be cast to
> com.hadoop.compression.lzo.LzopDecompressor
>
> I have performed following steps:-
>
> 1. $sudo apt-get install liblzo2-dev
> 2. Download the Hadoop-lzo clone from github repo (
> https://github.com/kevinweil/hadoop-lzo )
> 3. Build the Hadoop-lzo project.
> 4. Copy the hadoop-lzo-*.jar file at $HADOOP_HOME/lib dir of cluster nodes
> 5. Copy the hadoop-lzo-install-dir/build/hadoop-lzo-*/native library at
> $HADOOP_HOME/lib dir of cluster nodes.
> 6. core-site.xml:
>
> 
> io.compression.codecs
> org.apache.hadoop.io.compress.GzipCodec
> ,org.apache.hadoop.io.compress.DefaultCodec,
> com.hadoop.compression.lzo.LzoCodec,
> com.hadoop.compression.lzo.LzopCodec,
> org.apache.hadoop.io.compress.BZip2Codec
> 
>   
>   
> io.compression.codec.lzo.class
> com.hadoop.compression.lzo.LzoCodec
>   
> 7. mapred-site.xml:
>
> 
> mapred.child.env
>
> JAVA_LIBRARY_PATH=/opt/ladap/common/hadoop-0.20.2/lib/native/Linux-i386-32/*
>
> 
>
>   
> mapred.map.output.compression.codec
> com.hadoop.compression.lzo.LzoCodec
>   
> 8. hadoop-env.sh
>
> export
> HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/home/ankit/hadoop-0.20.1/lib/hadoop-lzo-0.4.12.jar
> export
> JAVA_LIBRARY_PATH=/home/ankit/hadoop-0.20.1/lib/native/Linux-i386-32/
>
> 9. Restart the cluster
>
> 10. uploaded lzo file into hdfs
>
> 11. Runned the following command for indexing:
> bin/hadoop jar path/to/hadoop-lzo-*.jar
> com.hadoop.compression.lzo.LzoIndexer lzofile.lzo
>
>
>
>
> On Tue, Jul 26, 2011 at 1:39 PM, Koert Kuipers  wrote:
>
>> my installation notes for lzo-hadoop (might be wrong or incomplete):
>>
>> we run centos 5.6 and cdh3
>>
>> yum -y install lzo
>> git checkout https://github.com/toddlipcon/hadoop-lzo.git
>> cd hadoop-lzo
>> ant
>> cd build
>> cp hadoop-lzo-0.4.10/hadoop-lzo-0.4.10.jar /usr/lib/hadoop/lib
>> cp -r hadoop-lzo-0.4.10/lib/native /usr/lib/hadoop/lib
>>
>>
>> in core.site.xml:
>>  
>> io.compression.codecs
>>
>> org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.compress.BZip2Codec
>> true
>>   
>>
>>   
>> io.compression.codec.lzo.class
>> com.hadoop.compression.lzo.LzoCodec
>> true
>>   
>>
>>
>> in mapred.site.xml:
>>   
>> mapred.compress.map.output
>> true
>> false
>>   
>>
>>   
>> mapred.map.output.compression.codec
>> com.hadoop.compression.lzo.LzoCodec
>> false
>>   
>>
>>   
>> mapred.output.compress
>> true
>> false
>>   
>>
>>   
>> mapred.output.compression.codec
>> com.hadoop.compression.lzo.LzoCodec
>> false
>>   
>>
>>   
>> mapred.output.compression.type
>> BLOCK
>> false
>>   
>>
>>
>> On Mon, Jul 25, 2011 at 7:14 PM, Alejandro Abdelnur wrote:
>>
>>> Vikas,
>>>
>>> You should be able to use the Snappy codec doing some minor tweaks
>>> from http://code.google.com/p/hadoop-snappy/ until a Hadoop releases
>>> with Snappy support.
>>>
>>> Thxs.
>>>
>>> Alejandro.
>>>
>>> On Mon, Jul 25, 2011 at 4:04 AM, Vikas Srivastava
>>>  wrote:
>>> > Hey ,
>>> >
>>> > i just want to use any compression in hadoop so i heard about lzo which
>>> is
>>> > best among all the compression (after snappy)
>>> >
>>> > please any1 tell me who is already using any kind of compression in
>>> hadoop
>>> > 0.20.2
>>> >
>>> >
>>> >
>>> > --
>>> > With Regards
>>> > Vikas Srivastava
>>> >
>>> > DWH & Analytics Team
>>> > Mob:+91 9560885900
>>> > One97 | Let's get talking !
>>> >
>>>
>>
>>
>


Re: Lzo Compression

2011-07-26 Thread Koert Kuipers
my installation notes for lzo-hadoop (might be wrong or incomplete):

we run centos 5.6 and cdh3

yum -y install lzo
git checkout https://github.com/toddlipcon/hadoop-lzo.git
cd hadoop-lzo
ant
cd build
cp hadoop-lzo-0.4.10/hadoop-lzo-0.4.10.jar /usr/lib/hadoop/lib
cp -r hadoop-lzo-0.4.10/lib/native /usr/lib/hadoop/lib


in core.site.xml:
 
io.compression.codecs

org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.compress.BZip2Codec
true
  

  
io.compression.codec.lzo.class
com.hadoop.compression.lzo.LzoCodec
true
  


in mapred.site.xml:
  
mapred.compress.map.output
true
false
  

  
mapred.map.output.compression.codec
com.hadoop.compression.lzo.LzoCodec
false
  

  
mapred.output.compress
true
false
  

  
mapred.output.compression.codec
com.hadoop.compression.lzo.LzoCodec
false
  

  
mapred.output.compression.type
BLOCK
false
  

On Mon, Jul 25, 2011 at 7:14 PM, Alejandro Abdelnur wrote:

> Vikas,
>
> You should be able to use the Snappy codec doing some minor tweaks
> from http://code.google.com/p/hadoop-snappy/ until a Hadoop releases
> with Snappy support.
>
> Thxs.
>
> Alejandro.
>
> On Mon, Jul 25, 2011 at 4:04 AM, Vikas Srivastava
>  wrote:
> > Hey ,
> >
> > i just want to use any compression in hadoop so i heard about lzo which
> is
> > best among all the compression (after snappy)
> >
> > please any1 tell me who is already using any kind of compression in
> hadoop
> > 0.20.2
> >
> >
> >
> > --
> > With Regards
> > Vikas Srivastava
> >
> > DWH & Analytics Team
> > Mob:+91 9560885900
> > One97 | Let's get talking !
> >
>


What sort of sequencefiles are created in hive

2011-07-25 Thread Koert Kuipers
Knowing that sequencefiles can store data (especially numeric data) much
more compact that text, i started converting our hive database from lzo
compressed text format to lzo compressed sequencdfiles.

My first observation was that the files were not smaller, which surprised me
since we have mostly numerical data which has a more compact binary
representation.

So then i issued some "describe extended" queries to poke around in the
sequencefile format used by hive. And it seems that 1) the keys are not
used, and 2) all the values are simply stored as a Text Writable? Is this
simply a copy of the textual representation which was used in the text
files? That would explain why the data did not get any smaller. But it also
would defeat all the benefits of sequencefiles, no?

Thanks Koert


Re: conversion of left outer join to mapjoin

2011-07-25 Thread Koert Kuipers
anyone any idea? this seems like very strange behavior to me. and it blows
up the job.

On Fri, Jul 22, 2011 at 5:51 PM, Koert Kuipers  wrote:

> hello,
> we have 2 tables x and y. table x is 11GB on disk and has 23M rows. table y
> is 3GB on disk and has 28M rows. Both tables are stored as LZO compressed
> sequencefiles without bucketing.
>
> a normal join of x an y gets executed as a map-reduce-join in hive and
> works very well. an outer join also gets executed as a map-reduce-join and
> again works well.
> but a left outer join gets converted in a map-join which results in a
> OutOfMemoryError (GC overhead limit exceeded).
>
> the mapjoin related parameters in my hive-settings.xml are:
> hive.auto.convert.join=true
> hive.mapjoin.maxsize=10
> hive.mapjoin.smalltable.filesize=2500
>
> why does the left outer join get converted into map-join? it seems like my
> table sizes are way beyond where a map-join should be attempted, no?
>
>


Re: Lzo Compression

2011-07-25 Thread Koert Kuipers
I have LZO compression enabled by default in hadoop 0.20.2 and hive 0.7.0
and it works well so far.

On Mon, Jul 25, 2011 at 7:04 AM, Vikas Srivastava <
vikas.srivast...@one97.net> wrote:

> Hey ,
>
> i just want to use any compression in hadoop so i heard about lzo which is
> best among all the compression (after snappy)
>
> please any1 tell me who is already using any kind of compression in hadoop
> 0.20.2
>
>
>
> --
> With Regards
> Vikas Srivastava
>
> DWH & Analytics Team
> Mob:+91 9560885900
> One97 | Let's get talking !
>
>


conversion of left outer join to mapjoin

2011-07-22 Thread Koert Kuipers
hello,
we have 2 tables x and y. table x is 11GB on disk and has 23M rows. table y
is 3GB on disk and has 28M rows. Both tables are stored as LZO compressed
sequencefiles without bucketing.

a normal join of x an y gets executed as a map-reduce-join in hive and works
very well. an outer join also gets executed as a map-reduce-join and again
works well.
but a left outer join gets converted in a map-join which results in a
OutOfMemoryError (GC overhead limit exceeded).

the mapjoin related parameters in my hive-settings.xml are:
hive.auto.convert.join=true
hive.mapjoin.maxsize=10
hive.mapjoin.smalltable.filesize=2500

why does the left outer join get converted into map-join? it seems like my
table sizes are way beyond where a map-join should be attempted, no?


Re: hive mapjoin decision process

2011-07-19 Thread Koert Kuipers
thanks.
changing mapred.child.java.opts from -Xmx512m to -Xmx1024m did the trick


allocating more memory to the

On Tue, Jul 19, 2011 at 6:49 PM, yongqiang he wrote:

> >> i thought only one table needed to be small?
> Yes.
>
> >> hive.mapjoin.maxsize also apply to big table?
> No.
>
> >> i made sure hive.mapjoin.smalltable.filesize and hive.mapjoin.maxsize
> are set large enough to accomodate the small table. yet hive does not
> attempt to do a mapjoin.
>
> There are physical limitations. If the local machine can not hold all
> records in memory locally, the local hashmap has to fail. So check
> your machine's memory or the memory allocated for hive.
>
> Thanks
> Yongqiang
> On Tue, Jul 19, 2011 at 1:55 PM, Koert Kuipers  wrote:
> > thanks!
> > i only see hive create the hashmap dump and perform mapjoin if both
> tables
> > are small. i thought only one table needed to be small?
> >
> > i try to merge a very large table with a small table. i made sure
> > hive.mapjoin.smalltable.filesize and hive.mapjoin.maxsize are set large
> > enough to accomodate the small table. yet hive does not attempt to do a
> > mapjoin. does hive.mapjoin.maxsize also apply to big table? or do i need
> to
> > look at other parameters as well?
> >
> > On Tue, Jul 19, 2011 at 4:15 PM, yongqiang he 
> > wrote:
> >>
> >> in most cases, the mapjoin falls back to normal join because of one of
> >> these three reasons:
> >> 1) the input table size is very big, so there will be no try on mapjoin
> >> 2) if one of the input table is small (let's say less than 25MB which
> >> is configurable), hive will try a local hashmap dump. If it cause OOM
> >> on the client side when doing the local hashmap dump, it will go back
> >> normal join.The reason here is mostly due to very good compression on
> >> the input data.
> >> 3) the mapjoin actually got started, and fails. it will fall back
> >> normal join. This will most unlikely happen
> >>
> >> Thanks
> >> Yongqiang
> >> On Tue, Jul 19, 2011 at 11:16 AM, Koert Kuipers 
> wrote:
> >> > note: this is somewhat a repost of something i posted on the CDH3 user
> >> > group. apologies if that is not appropriate.
> >> >
> >> > i am exploring map-joins in hive. with hive.auto.convert.join=true
> hive
> >> > tries to do a map-join and then falls back on a mapreduce-join if
> >> > certain
> >> > conditions are not met. this sounds great. but when i do a
> >> > query and i notice it falls back on a mapreduce-join, how can i see
> >> > which
> >> > condition triggered the fallback (smalltablle.filesize or
> >> > mapjoin.maxsize or
> >> > something else perhaps memory related)?
> >> >
> >> > i tried reading the default log that a hive session produces, but it
> >> > seems
> >> > more like a massive json file than a log to me, so it is very hard for
> >> > me to
> >> > interpret that. i also turned on logging to console with debugging,
> >> > looking
> >> > for any clues there but without luck so far. is the info there and am
> i
> >> > just
> >> > overlooking it? any ideas?
> >> >
> >> > thanks! koert
> >> >
> >> >
> >> >
> >
> >
>


Re: hive mapjoin decision process

2011-07-19 Thread Koert Kuipers
thanks!
i only see hive create the hashmap dump and perform mapjoin if both tables
are small. i thought only one table needed to be small?

i try to merge a very large table with a small table. i made sure
hive.mapjoin.smalltable.filesize and hive.mapjoin.maxsize are set large
enough to accomodate the small table. yet hive does not attempt to do a
mapjoin. does hive.mapjoin.maxsize also apply to big table? or do i need to
look at other parameters as well?

On Tue, Jul 19, 2011 at 4:15 PM, yongqiang he wrote:

> in most cases, the mapjoin falls back to normal join because of one of
> these three reasons:
> 1) the input table size is very big, so there will be no try on mapjoin
> 2) if one of the input table is small (let's say less than 25MB which
> is configurable), hive will try a local hashmap dump. If it cause OOM
> on the client side when doing the local hashmap dump, it will go back
> normal join.The reason here is mostly due to very good compression on
> the input data.
> 3) the mapjoin actually got started, and fails. it will fall back
> normal join. This will most unlikely happen
>
> Thanks
> Yongqiang
> On Tue, Jul 19, 2011 at 11:16 AM, Koert Kuipers  wrote:
> > note: this is somewhat a repost of something i posted on the CDH3 user
> > group. apologies if that is not appropriate.
> >
> > i am exploring map-joins in hive. with hive.auto.convert.join=true hive
> > tries to do a map-join and then falls back on a mapreduce-join if certain
> > conditions are not met. this sounds great. but when i do a
> > query and i notice it falls back on a mapreduce-join, how can i see which
> > condition triggered the fallback (smalltablle.filesize or mapjoin.maxsize
> or
> > something else perhaps memory related)?
> >
> > i tried reading the default log that a hive session produces, but it
> seems
> > more like a massive json file than a log to me, so it is very hard for me
> to
> > interpret that. i also turned on logging to console with debugging,
> looking
> > for any clues there but without luck so far. is the info there and am i
> just
> > overlooking it? any ideas?
> >
> > thanks! koert
> >
> >
> >
>


remote hive metastore

2011-07-19 Thread Koert Kuipers
i am testing running a remote hive metastore. i understand that the client
communicates with the metastore via thrift.
now is it the case that the client still communicates with HDFS directly?

in the metastore i see logs for all the actions that i perform on the
client. but they show up like this:

11/07/19 14:43:36 INFO HiveMetaStore.audit: ugi=hdfsip=/192.168.1.157
cmd=get_table : db=default tbl=things

what confuses me here is the ugi=hfds. that is not me. that is the user that
runs the metastore! does this mean my setup is broken?
in HDFS the files do show up with the right user.

thanks koert


hive mapjoin decision process

2011-07-19 Thread Koert Kuipers
note: this is somewhat a repost of something i posted on the CDH3 user
group. apologies if that is not appropriate.

i am exploring map-joins in hive. with hive.auto.convert.join=true hive
tries to do a map-join and then falls back on a mapreduce-join if certain
conditions are not met. this sounds great. but when i do a
query and i notice it falls back on a mapreduce-join, how can i see which
condition triggered the fallback (smalltablle.filesize or mapjoin.maxsize or
something else perhaps memory related)?

i tried reading the default log that a hive session produces, but it seems
more like a massive json file than a log to me, so it is very hard for me to
interpret that. i also turned on logging to console with debugging, looking
for any clues there but without luck so far. is the info there and am i just
overlooking it? any ideas?

thanks! koert