Organising Hive Scripts

2015-09-11 Thread Charles Mean
Hello,

I am working with a huge hive script that I would like to improve its
organisation to a better maintenance in the future.
Looking into this issue, I did not found any kind of include or something
to split my script into some smaller parts.
So, Is there some sort of pattern that is implemented to improve script
structure ?


mapjoin with left join

2015-09-11 Thread Steve Howard
We would like to utilize mapjoin for the following SQL construct:

select small.* from small s left join large l on s.id = l.id where l.id is
null;

We can easily fit small into RAM, but large is over 1TB according to
optimizer stats. Unless we set
hive.auto.convert.join.noconditionaltask.size = to at least the size of
"large", the optimizer falls back to a common map join, which is incredibly
slow.

Given the fact it is a left join, which means we won't always have rows in
large for each row in small, is this behavior expected? Could it be that
reading the large table would miss the new rows in small, so the large one
has to be the one that is probed for matches?

We simply want to load the 81K rows in to RAM, then for each row in large,
check the small hash table and if it the row in small is not in large, then
add it to large.

Again, the optimizer will use a mapjoin if we set
hive.auto.convert.join.noconditionaltask.size = 1TB (the size of the large
table). This is of course, not practical. The small table is only 50MB.

At the link below is the entire test case with two tables, one of which has
three rows and other has 96. We can duplicate it with tables this small,
which leads me to believe I am missing something, or this is a bug.

The link has the source code that shows each table create, as well as the
explain with an argument for hive.auto.convert.join.noconditionaltask.size
that is passed at the command line. The output shows a mergejoin when the
hive.auto.convert.join.noconditionaltask.size size is less than 192 (the
size of the larger table), and a mapjoin when
hive.auto.convert.join.noconditionaltask.size is larger than 192 (large
table fits).

http://pastebin.com/Qg6hb8yV

The business case is loading only new rows into a large fact table.  The
new rows are the ones that are small in number.


Fwd: Hive Server Load Data InPath Fails

2015-09-11 Thread Vineet Mishra
Hi All,

I am making a Hive Thrift connection to Hive Server and with load data
inpath command to one of my table I am landing into some bizzare exception.

Other queries seems out to be working fine, but this is the only query
which is getting failed

15/09/11 11:40:32 ERROR utils.HCatalogUtil: Executing Query: load data
inpath '/data/mount/si/piu/' overwrite into table scnrt2.piu
HiveServerException(message:Query returned non-zero code: 10001, cause:
FAILED: SemanticException [Error 10001]: Line 1:130 Table not found 'piu',
errorCode:10001, SQLState:42S02)

The table is created in the scnrt2 database and exist before the command is
invoked.

I have even tried executing the same query using the query "use scnrt2"
before firing the load data but still not luck, while the same query is
executing fine from the hive command line.

Any quick feedback would be appreciated!

Thanks!


Re: Error: java.lang.IllegalArgumentE:Column has wrong number of index entries found - when trying to insert from JSON external table to ORC table

2015-09-11 Thread Daniel Haviv
Hi Prasanth,
Can you elaborate on what does the hive.merge.orcfile.stripe.level parameter 
affext?

Thank you for your help.
Daniel

Sent from my iPhone

> On 8 בספט׳ 2015, at 17:48, Prasanth Jayachandran 
>  wrote:
> 
> hive.merge.orcfile.stripe.level


Re: Organising Hive Scripts

2015-09-11 Thread Charles Mean
Great Dmitry,

It will certainly help me a lot.
I will give it a try, thank you very much for your help.

On Fri, Sep 11, 2015 at 4:34 PM, Dmitry Tolpeko  wrote:

> Charles,
>
> Not sure what you can do in Hive CLI right now, but consider a new Hive
> HPL/SQL component that will be included to new Hive versions and that
> currently you can compile and run separately, see
> https://github.com/apache/hive/tree/master/hplsql or www.hplsql.org
>
> It supports include files, user defined procedures and functions,
> exception based exception handling, flow of control statements and so on.
> It is new, under active development so any feedback is welcome.
>
> Dmitry
>
> On Fri, Sep 11, 2015 at 8:32 PM, Charles Mean 
> wrote:
>
>> Hello,
>>
>> I am working with a huge hive script that I would like to improve its
>> organisation to a better maintenance in the future.
>> Looking into this issue, I did not found any kind of include or something
>> to split my script into some smaller parts.
>> So, Is there some sort of pattern that is implemented to improve script
>> structure ?
>>
>
>


Hive Macros roadmap

2015-09-11 Thread Elliot West
Hi,

I noticed some time ago the Hive Macro feature. To me at least this seemed
like an excellent addition to HQL, allowing the user to encapsulate complex
column logic as an independent HQL, reusable macro while avoiding the
complexities of Java UDFs. However, few people seem to be aware of them or
use them. If you are unfamiliar with macros they look like this:

hive> create temporary macro MYSIGMOID(x DOUBLE)
> 2.0 / (1.0 + exp(-x));
OK

hive> select MYSIGMOID(1.0) from dual;
OK

1.4621171572600098


As far as I can tell, they are no longer documented on the Hive wiki. There
is a tiny reference to them in the O'Reilly 'Programming Hive' book (page
185). Can anyone advise me on the following:

   - Are there are plans to keep or remove this functionality?
   - Are there are plans to document this functionality?
   - Aside from limitations of HQL are there compelling reasons not to use
   macros?

Thanks - Elliot.


Re: confluence access

2015-09-11 Thread Lefty Leverenz
Done.  Welcome to the Hive wiki team, Wojciech!

-- Lefty

On Fri, Sep 11, 2015 at 6:30 AM, Wojciech Indyk 
wrote:

> Hello!
> Please grant me a write-access to the confluence (user woj_in), due to
>
> https://issues.apache.org/jira/browse/HIVE-11329?focusedCommentId=14740243=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14740243
>
> --
> Kindly regards/ Pozdrawiam,
> Wojciech Indyk
>


Re: Hive Macros roadmap

2015-09-11 Thread Edward Capriolo
Macro's are in and tested. No one will remove them. The unit tests ensure
they keep working.

On Fri, Sep 11, 2015 at 3:38 PM, Elliot West  wrote:

> Hi,
>
> I noticed some time ago the Hive Macro feature. To me at least this seemed
> like an excellent addition to HQL, allowing the user to encapsulate complex
> column logic as an independent HQL, reusable macro while avoiding the
> complexities of Java UDFs. However, few people seem to be aware of them or
> use them. If you are unfamiliar with macros they look like this:
>
> hive> create temporary macro MYSIGMOID(x DOUBLE)
> > 2.0 / (1.0 + exp(-x));
> OK
>
> hive> select MYSIGMOID(1.0) from dual;
> OK
>
> 1.4621171572600098
>
>
> As far as I can tell, they are no longer documented on the Hive wiki.
> There is a tiny reference to them in the O'Reilly 'Programming Hive' book
> (page 185). Can anyone advise me on the following:
>
>- Are there are plans to keep or remove this functionality?
>- Are there are plans to document this functionality?
>- Aside from limitations of HQL are there compelling reasons not to
>use macros?
>
> Thanks - Elliot.
>
>


Re: mapjoin with left join

2015-09-11 Thread Sergey Shelukhin
As far as I know it’s not currently supported.
The large table will be streamed in multiple tasks with the small table in 
memory, so there’s not one place that knows for sure there was no row in the 
large table for a particular small table row in any of the locations. It could 
have no match in one task but a match in other task.
You can try rewriting the query as inner join unioned with not in, but “not in” 
might still be slow…
IIRC there was actually a JIRA to solve this, but no work has been done so far.

From: Steve Howard >
Reply-To: "user@hive.apache.org" 
>
Date: Friday, September 11, 2015 at 09:48
To: "user@hive.apache.org" 
>
Subject: mapjoin with left join

We would like to utilize mapjoin for the following SQL construct:

select small.* from small s left join large l on s.id = 
l.id where l.id is null;

We can easily fit small into RAM, but large is over 1TB according to optimizer 
stats. Unless we set hive.auto.convert.join.noconditionaltask.size = to at 
least the size of "large", the optimizer falls back to a common map join, which 
is incredibly slow.

Given the fact it is a left join, which means we won't always have rows in 
large for each row in small, is this behavior expected? Could it be that 
reading the large table would miss the new rows in small, so the large one has 
to be the one that is probed for matches?

We simply want to load the 81K rows in to RAM, then for each row in large, 
check the small hash table and if it the row in small is not in large, then add 
it to large.

Again, the optimizer will use a mapjoin if we set 
hive.auto.convert.join.noconditionaltask.size = 1TB (the size of the large 
table). This is of course, not practical. The small table is only 50MB.

At the link below is the entire test case with two tables, one of which has 
three rows and other has 96. We can duplicate it with tables this small, which 
leads me to believe I am missing something, or this is a bug.

The link has the source code that shows each table create, as well as the 
explain with an argument for hive.auto.convert.join.noconditionaltask.size that 
is passed at the command line. The output shows a mergejoin when the 
hive.auto.convert.join.noconditionaltask.size size is less than 192 (the size 
of the larger table), and a mapjoin when 
hive.auto.convert.join.noconditionaltask.size is larger than 192 (large table 
fits).

http://pastebin.com/Qg6hb8yV

The business case is loading only new rows into a large fact table.  The new 
rows are the ones that are small in number.


Re: Hive Server Load Data InPath Fails

2015-09-11 Thread Takahiko Saito
What version of Hive is being used?
In the past, I can see we had the following JIRA:
https://issues.apache.org/jira/browse/HIVE-4256: JDBC2 HiveConnection does
not use the specified database.

On Fri, Sep 11, 2015 at 12:44 AM, Vineet Mishra 
wrote:

>
> Hi All,
>
> I am making a Hive Thrift connection to Hive Server and with load data
> inpath command to one of my table I am landing into some bizzare exception.
>
> Other queries seems out to be working fine, but this is the only query
> which is getting failed
>
> 15/09/11 11:40:32 ERROR utils.HCatalogUtil: Executing Query: load data
> inpath '/data/mount/si/piu/' overwrite into table scnrt2.piu
> HiveServerException(message:Query returned non-zero code: 10001, cause:
> FAILED: SemanticException [Error 10001]: Line 1:130 Table not found 'piu',
> errorCode:10001, SQLState:42S02)
>
> The table is created in the scnrt2 database and exist before the command
> is invoked.
>
> I have even tried executing the same query using the query "use scnrt2"
> before firing the load data but still not luck, while the same query is
> executing fine from the hive command line.
>
> Any quick feedback would be appreciated!
>
> Thanks!
>
>


-- 
Takahiko Saito


How to use the Hive Lineage Service

2015-09-11 Thread sumit ghosh
Hi,I am trying to use the lineage service built in Hive. I need the tables used 
the columns at the source and how they are related to the target. So hive has 
this lineage servicehive --service lineage `cat myQuery`
However it always errors out failing to parse the Hive Query. What am I doing 
wrong here or is this not at all to pull the lineage information?Related 
Links:Jira https://issues.apache.org/jira/browse/HIVE-147Code 
http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop.hive/hive-exec/0.7.1-cdh3u1/org/apache/hadoop/hive/ql/tools/LineageInfo.java#LineageInfo.getLineageInfo%28java.lang.String%29
Thanks,Sumit

Re: Organising Hive Scripts

2015-09-11 Thread Dmitry Tolpeko
Charles,

Not sure what you can do in Hive CLI right now, but consider a new Hive
HPL/SQL component that will be included to new Hive versions and that
currently you can compile and run separately, see
https://github.com/apache/hive/tree/master/hplsql or www.hplsql.org

It supports include files, user defined procedures and functions, exception
based exception handling, flow of control statements and so on. It is new,
under active development so any feedback is welcome.

Dmitry

On Fri, Sep 11, 2015 at 8:32 PM, Charles Mean 
wrote:

> Hello,
>
> I am working with a huge hive script that I would like to improve its
> organisation to a better maintenance in the future.
> Looking into this issue, I did not found any kind of include or something
> to split my script into some smaller parts.
> So, Is there some sort of pattern that is implemented to improve script
> structure ?
>


confluence access

2015-09-11 Thread Wojciech Indyk
Hello!
Please grant me a write-access to the confluence (user woj_in), due to
https://issues.apache.org/jira/browse/HIVE-11329?focusedCommentId=14740243=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14740243

--
Kindly regards/ Pozdrawiam,
Wojciech Indyk


Checking the number of Readers

2015-09-11 Thread James Pirz
I am using Hive 1.2.0 on Hadoop 2.6 (on a cluster with 10 machines) and I
am trying to understand the performance of a full-table scan. I am running
the following query:

SELECT * FROM LINEITEM
WHERE L_LINENUMBER < 0;

and I am measuring its performance in different scenarios: using "MR vs.
Tez" and  with different table types/formats (an external table on text
data, or ORC).

My question is:
What is the best way to check the number of readers (scanners) that Hive
uses in parallel to read the data ?

My data is in HDFS and on each node I have 1 datanode process running which
writes its blocks into 3 separate paths (each path persists its data on a
separate disk).

I tried to get this info using "explain" or from the available consoles,
but I could not find that. Checking the number of established connections
to the data transfer port for datanode (using the command below) gives me
12, but I am not sure If I am looking at the correct metric:

netstat -anp | grep -w 50010 | grep ESTABLISHED | wc -l


Any help would be appreciated.

Thnx