RE: Multiple Insert with Where Clauses

2013-07-30 Thread Sha Liu
Doesn't INSERT INTO do what you said? I'm not sure I understand "inserting a 
few records into a table".
Anyway here the problem seems different to me. For my cases these where clauses 
for multiple inserts seem not effective, while Hive doesn't complain about that.
-Sha

Date: Tue, 30 Jul 2013 21:06:22 -0700
Subject: Re: Multiple Insert with Where Clauses
From: bruder...@radiumone.com
To: user@hive.apache.org

Hive doesn't support inserting a few records into a table. You will need to 
write a query to union your select and then insert. IF you can partition, then 
you can insert a whole partition at a time instead of the table.

Thanks,Brad

On Tue, Jul 30, 2013 at 9:04 PM, Sha Liu  wrote:




Yes for the example you gave, it works. It even works when there is a single 
insert under the from clause, but there there are multiple inserts, the where 
clauses seem no longer effective.


Date: Tue, 30 Jul 2013 20:29:19 -0700
Subject: Re: Multiple Insert with Where Clauses
From: bruder...@radiumone.com
To: user@hive.apache.org


Have you simply tried
INSERT OVERWRITE TABLE destination SELECT col1, col2, col3FROM sourceWHERE col4 
= 'abc'

Thanks!



On Tue, Jul 30, 2013 at 8:25 PM, Sha Liu  wrote:




Hi Hive Gurus,
When using the Hive extension of multiple inserts, can we add Where clauses for 
each Select statement, like the following?
FROM ...

INSERT OVERWRITE TABLE ...SELECT col1, col2, col3WHERE col4='abc'INSERT 
OVERWRITE TABLE ...SELECT col1, col4, col2WHERE col3='xyz'


The underlined parts didn't cause any errors, but they didn't seem to be 
effective either (I'm using Hive 0.9). Note that the columns used in the Where 
clauses are not among the selected ones, but I'm not sure if that is important. 
Is this kind of operations supported?


Thanks,Sha Liu

  

  

Re: Multiple Insert with Where Clauses

2013-07-30 Thread Brad Ruderman
Hive doesn't support inserting a few records into a table. You will need to
write a query to union your select and then insert. IF you can partition,
then you can insert a whole partition at a time instead of the table.

Thanks,
Brad


On Tue, Jul 30, 2013 at 9:04 PM, Sha Liu  wrote:

> Yes for the example you gave, it works. It even works when there is a
> single insert under the from clause, but there there are multiple inserts,
> the where clauses seem no longer effective.
>
> --
> Date: Tue, 30 Jul 2013 20:29:19 -0700
> Subject: Re: Multiple Insert with Where Clauses
> From: bruder...@radiumone.com
> To: user@hive.apache.org
>
>
> Have you simply tried
>
> INSERT OVERWRITE TABLE destination
> SELECT col1, col2, col3
> FROM source
> WHERE col4 = 'abc'
>
> Thanks!
>
>
>
> On Tue, Jul 30, 2013 at 8:25 PM, Sha Liu  wrote:
>
> Hi Hive Gurus,
>
> When using the Hive extension of multiple inserts, can we add Where
> clauses for each Select statement, like the following?
>
> FROM ...
> INSERT OVERWRITE TABLE ...
> SELECT col1, col2, col3
> *WHERE col4='abc'*
> INSERT OVERWRITE TABLE ...
> SELECT col1, col4, col2
> *WHERE col3='xyz'*
> *
> *
> The underlined parts didn't cause any errors, but they didn't seem to be
> effective either (I'm using Hive 0.9). Note that the columns used in the
> Where clauses are not among the selected ones, but I'm not sure if that is
> important. Is this kind of operations supported?
>
> Thanks,
> Sha Liu
>
>
>


RE: Multiple Insert with Where Clauses

2013-07-30 Thread Sha Liu
Yes for the example you gave, it works. It even works when there is a single 
insert under the from clause, but there there are multiple inserts, the where 
clauses seem no longer effective.

Date: Tue, 30 Jul 2013 20:29:19 -0700
Subject: Re: Multiple Insert with Where Clauses
From: bruder...@radiumone.com
To: user@hive.apache.org

Have you simply tried
INSERT OVERWRITE TABLE destination SELECT col1, col2, col3FROM sourceWHERE col4 
= 'abc'
Thanks!



On Tue, Jul 30, 2013 at 8:25 PM, Sha Liu  wrote:




Hi Hive Gurus,
When using the Hive extension of multiple inserts, can we add Where clauses for 
each Select statement, like the following?
FROM ...
INSERT OVERWRITE TABLE ...SELECT col1, col2, col3WHERE col4='abc'INSERT 
OVERWRITE TABLE ...SELECT col1, col4, col2WHERE col3='xyz'

The underlined parts didn't cause any errors, but they didn't seem to be 
effective either (I'm using Hive 0.9). Note that the columns used in the Where 
clauses are not among the selected ones, but I'm not sure if that is important. 
Is this kind of operations supported?

Thanks,Sha Liu

  

Re: Multiple Insert with Where Clauses

2013-07-30 Thread Brad Ruderman
Have you simply tried

INSERT OVERWRITE TABLE destination
SELECT col1, col2, col3
FROM source
WHERE col4 = 'abc'

Thanks!



On Tue, Jul 30, 2013 at 8:25 PM, Sha Liu  wrote:

> Hi Hive Gurus,
>
> When using the Hive extension of multiple inserts, can we add Where
> clauses for each Select statement, like the following?
>
> FROM ...
> INSERT OVERWRITE TABLE ...
> SELECT col1, col2, col3
> *WHERE col4='abc'*
> INSERT OVERWRITE TABLE ...
> SELECT col1, col4, col2
> *WHERE col3='xyz'*
> *
> *
> The underlined parts didn't cause any errors, but they didn't seem to be
> effective either (I'm using Hive 0.9). Note that the columns used in the
> Where clauses are not among the selected ones, but I'm not sure if that is
> important. Is this kind of operations supported?
>
> Thanks,
> Sha Liu
>


Multiple Insert with Where Clauses

2013-07-30 Thread Sha Liu
Hi Hive Gurus,
When using the Hive extension of multiple inserts, can we add Where clauses for 
each Select statement, like the following?
FROM ...INSERT OVERWRITE TABLE ...SELECT col1, col2, col3WHERE 
col4='abc'INSERT OVERWRITE TABLE ...SELECT col1, col4, col2WHERE 
col3='xyz'
The underlined parts didn't cause any errors, but they didn't seem to be 
effective either (I'm using Hive 0.9). Note that the columns used in the Where 
clauses are not among the selected ones, but I'm not sure if that is important. 
Is this kind of operations supported?
Thanks,Sha Liu

Re: UDFs with package names

2013-07-30 Thread Edward Capriolo
It might be a better idea to use your own package com.mystuff.x. You might
be running into an issue where java is not finding the file because it
assumes the relation between package and jar is 1 to 1. You might also be
compiling wrong If your package is com.mystuff that class file should be in
a directory structure com/mystuff/whateverUDF.class I am not seeing that
from your example.


On Tue, Jul 30, 2013 at 8:00 PM, Michael Malak wrote:

> Thus far, I've been able to create Hive UDFs, but now I need to define
> them within a Java package name (as opposed to the "default" Java package
> as I had been doing), but once I do that, I'm no longer able to load them
> into Hive.
>
> First off, this works:
>
> add jar /usr/lib/hive/lib/hive-contrib-0.10.0-cdh4.3.0.jar;
> create temporary function row_sequence as
> 'org.apache.hadoop.hive.contrib.udf.UDFRowSequence';
>
> Then I took the source code for UDFRowSequence.java from
>
> http://svn.apache.org/repos/asf/hive/trunk/contrib/src/java/org/apache/hadoop/hive/contrib/udf/UDFRowSequence.java
>
> and renamed the file and the class inside to UDFRowSequence2.java
>
> I compile and deploy it with:
> javac -cp
> /usr/lib/hive/lib/hive-exec-0.10.0-cdh4.3.0.jar:/usr/lib/hadoop/hadoop-common.jar
> UDFRowSequence2.java
> jar cvf UDFRowSequence2.jar UDFRowSequence2.class
> sudo cp UDFRowSequence2.jar /usr/local/lib
>
>
> But in Hive, I get the following:
> hive>  add jar /usr/local/lib/UDFRowSequence2.jar;
> Added /usr/local/lib/UDFRowSequence2.jar to class path
> Added resource: /usr/local/lib/UDFRowSequence2.jar
> hive> create temporary function row_sequence as
> 'org.apache.hadoop.hive.contrib.udf.UDFRowSequence2';
> FAILED: Class org.apache.hadoop.hive.contrib.udf.UDFRowSequence2 not found
> FAILED: Execution Error, return code 1 from
> org.apache.hadoop.hive.ql.exec.FunctionTask
>
> But if I comment out the package line in UDFRowSequence2.java (to put the
> UDF into the default Java package), it works:
> hive>  add jar /usr/local/lib/UDFRowSequence2.jar;
> Added /usr/local/lib/UDFRowSequence2.jar to class path
> Added resource: /usr/local/lib/UDFRowSequence2.jar
> hive> create temporary function row_sequence as 'UDFRowSequence2';
> OK
> Time taken: 0.383 seconds
>
> What am I doing wrong?  I have a feeling it's something simple.
>
>


Re: Hive Join with distinct rows

2013-07-30 Thread Sunita Arvind
Thanks for sharing your experience Marcin

Sunita


On Tue, Jul 30, 2013 at 11:54 AM, Marcin Mejran  wrote:

>  I’ve used a rank udf for this previously, distribute and sort by the
> column then select all rows where rank=1. That should work with a join but
> I never tried it. It’d be an issue if the join outputs a lot of records
> that then are all dropped since that’d slow down the query.
>
> ** **
>
> I’ve actually forked Hive internally and added a distinct join based on
> the, now deprecated I guess, unique join code. It’s ugly in terms of syntax
> and I haven’t had a chance to open source it but it allows a good amount of
> control over what is joined to what (ie: select the row in table A whose
> column x is closets to column y in table B, for example request time). I
> really wish Hive had better support for such “non-SQL” types of queries
> which are common in a world of unstructured and un-clean data.
>
> ** **
>
> -Marcin
>
> ** **
>
> *From:* Sunita Arvind [mailto:sunitarv...@gmail.com]
> *Sent:* Tuesday, July 30, 2013 11:00 AM
> *To:* user@hive.apache.org
> *Subject:* Hive Join with distinct rows
>
> ** **
>
> Hi Praveen / All,
>
> I also have a requirement similar to the one explained (by Praveen) below:
> distinct rows on a single column with corresponding data from other
> columns.
>
>
> http://mail-archives.apache.org/mod_mbox/hive-user/201211.mbox/%3ccahmb8ta+r0h5a+armutookhkp8fxctown68qoz6lkfmwbrk...@mail.gmail.com%3E
> 
>
> This email thread dates back to Nov 2012 and is a very common use case.I
> just wanted to check if there is a solution already or we still need to
> write a UDF.
>
> regards
> Sunita
>


UDFs with package names

2013-07-30 Thread Michael Malak
Thus far, I've been able to create Hive UDFs, but now I need to define them 
within a Java package name (as opposed to the "default" Java package as I had 
been doing), but once I do that, I'm no longer able to load them into Hive.

First off, this works:

add jar /usr/lib/hive/lib/hive-contrib-0.10.0-cdh4.3.0.jar;
create temporary function row_sequence as 
'org.apache.hadoop.hive.contrib.udf.UDFRowSequence';

Then I took the source code for UDFRowSequence.java from
http://svn.apache.org/repos/asf/hive/trunk/contrib/src/java/org/apache/hadoop/hive/contrib/udf/UDFRowSequence.java

and renamed the file and the class inside to UDFRowSequence2.java

I compile and deploy it with:
javac -cp 
/usr/lib/hive/lib/hive-exec-0.10.0-cdh4.3.0.jar:/usr/lib/hadoop/hadoop-common.jar
 UDFRowSequence2.java
jar cvf UDFRowSequence2.jar UDFRowSequence2.class
sudo cp UDFRowSequence2.jar /usr/local/lib


But in Hive, I get the following:
hive>  add jar /usr/local/lib/UDFRowSequence2.jar;
Added /usr/local/lib/UDFRowSequence2.jar to class path
Added resource: /usr/local/lib/UDFRowSequence2.jar
hive> create temporary function row_sequence as 
'org.apache.hadoop.hive.contrib.udf.UDFRowSequence2';
FAILED: Class org.apache.hadoop.hive.contrib.udf.UDFRowSequence2 not found
FAILED: Execution Error, return code 1 from 
org.apache.hadoop.hive.ql.exec.FunctionTask

But if I comment out the package line in UDFRowSequence2.java (to put the UDF 
into the default Java package), it works:
hive>  add jar /usr/local/lib/UDFRowSequence2.jar;
Added /usr/local/lib/UDFRowSequence2.jar to class path
Added resource: /usr/local/lib/UDFRowSequence2.jar
hive> create temporary function row_sequence as 'UDFRowSequence2';
OK
Time taken: 0.383 seconds

What am I doing wrong?  I have a feeling it's something simple.



Review Request (wikidoc): LZO Compression in Hive

2013-07-30 Thread Sanjay Subramanian
Hi

Met with Lefty this afternoon and she was kind to spend time to add my 
documentation to the site - since I still don't have editing privileges :-)

Please review the new wikidoc about LZO compression in the Hive language 
manual.  If anything is unclear or needs more information, you can email 
suggestions to this list or edit the wiki yourself (if you have editing 
privileges).  Here are the links:

  1.  Language 
Manual (new 
bullet under File Formats)
  2.  LZO 
Compression
  3.  CREATE 
TABLE
 (near end of section, pasted in here:)
Use STORED AS TEXTFILE if the data needs to be stored as plain text files. Use 
STORED AS SEQUENCEFILE if the data needs to be compressed. Please read more 
about 
CompressedStorage
 if you are planning to keep data compressed in your Hive tables. Use 
INPUTFORMAT and OUTPUTFORMAT to specify the name of a corresponding InputFormat 
and OutputFormat class as a string literal, e.g., 
'org.apache.hadoop.hive.contrib.fileformat.base64.Base64TextInputFormat'. For 
LZO compression, the values to use are 'INPUTFORMAT 
"com.hadoop.mapred.DeprecatedLzoTextInputFormat" OUTPUTFORMAT 
"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"' (see LZO 
Compression).

My cwiki id is
https://cwiki.apache.org/confluence/display/~sanjaysubraman...@yahoo.com
It will be great if I could get edit privileges

Thanks
sanjay

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: Write access for the wiki

2013-07-30 Thread Ashutosh Chauhan
Done. Added you as contributor.
Happy Documenting !!

Ashutosh


On Tue, Jul 30, 2013 at 2:15 PM, Mark Wagner wrote:

> Yes, I created it right before emailing the list:
> https://cwiki.apache.org/confluence/display/~mwagner
>
>
> On Tue, Jul 30, 2013 at 1:45 PM, Ashutosh Chauhan wrote:
>
>> Is that your cwiki id ? I am not seeing it there. Remember cwiki
>> is separate than jira account.
>>
>> Ashutosh
>>
>>
>> On Tue, Jul 30, 2013 at 1:40 PM, Mark Wagner wrote:
>>
>>> My id is mwagner. Thanks!
>>>
>>>
>>> On Tue, Jul 30, 2013 at 1:36 PM, Ashutosh Chauhan 
>>> wrote:
>>>
 Mark,

 Do you have an account on hive cwiki. Whats your id ?

 Thanks,
 Ashutosh


 On Tue, Jul 30, 2013 at 1:06 PM, Mark Wagner 
 wrote:

> Hi all,
>
> Would someone with the right permissions grant me write access to the
> Hive wiki? I'd like to update some information on the Avro Serde.
>
> Thanks,
> Mark
>


>>>
>>
>


Re: Write access for the wiki

2013-07-30 Thread Mark Wagner
Yes, I created it right before emailing the list:
https://cwiki.apache.org/confluence/display/~mwagner


On Tue, Jul 30, 2013 at 1:45 PM, Ashutosh Chauhan wrote:

> Is that your cwiki id ? I am not seeing it there. Remember cwiki
> is separate than jira account.
>
> Ashutosh
>
>
> On Tue, Jul 30, 2013 at 1:40 PM, Mark Wagner wrote:
>
>> My id is mwagner. Thanks!
>>
>>
>> On Tue, Jul 30, 2013 at 1:36 PM, Ashutosh Chauhan 
>> wrote:
>>
>>> Mark,
>>>
>>> Do you have an account on hive cwiki. Whats your id ?
>>>
>>> Thanks,
>>> Ashutosh
>>>
>>>
>>> On Tue, Jul 30, 2013 at 1:06 PM, Mark Wagner wrote:
>>>
 Hi all,

 Would someone with the right permissions grant me write access to the
 Hive wiki? I'd like to update some information on the Avro Serde.

 Thanks,
 Mark

>>>
>>>
>>
>


Re: Write access for the wiki

2013-07-30 Thread Ashutosh Chauhan
Is that your cwiki id ? I am not seeing it there. Remember cwiki
is separate than jira account.

Ashutosh


On Tue, Jul 30, 2013 at 1:40 PM, Mark Wagner wrote:

> My id is mwagner. Thanks!
>
>
> On Tue, Jul 30, 2013 at 1:36 PM, Ashutosh Chauhan wrote:
>
>> Mark,
>>
>> Do you have an account on hive cwiki. Whats your id ?
>>
>> Thanks,
>> Ashutosh
>>
>>
>> On Tue, Jul 30, 2013 at 1:06 PM, Mark Wagner wrote:
>>
>>> Hi all,
>>>
>>> Would someone with the right permissions grant me write access to the
>>> Hive wiki? I'd like to update some information on the Avro Serde.
>>>
>>> Thanks,
>>> Mark
>>>
>>
>>
>


Re: Write access for the wiki

2013-07-30 Thread Mark Wagner
My id is mwagner. Thanks!


On Tue, Jul 30, 2013 at 1:36 PM, Ashutosh Chauhan wrote:

> Mark,
>
> Do you have an account on hive cwiki. Whats your id ?
>
> Thanks,
> Ashutosh
>
>
> On Tue, Jul 30, 2013 at 1:06 PM, Mark Wagner wrote:
>
>> Hi all,
>>
>> Would someone with the right permissions grant me write access to the
>> Hive wiki? I'd like to update some information on the Avro Serde.
>>
>> Thanks,
>> Mark
>>
>
>


Re: Write access for the wiki

2013-07-30 Thread Ashutosh Chauhan
Mark,

Do you have an account on hive cwiki. Whats your id ?

Thanks,
Ashutosh


On Tue, Jul 30, 2013 at 1:06 PM, Mark Wagner wrote:

> Hi all,
>
> Would someone with the right permissions grant me write access to the Hive
> wiki? I'd like to update some information on the Avro Serde.
>
> Thanks,
> Mark
>


Write access for the wiki

2013-07-30 Thread Mark Wagner
Hi all,

Would someone with the right permissions grant me write access to the Hive
wiki? I'd like to update some information on the Avro Serde.

Thanks,
Mark


Select statements return null

2013-07-30 Thread Sunita Arvind
Hi,

I have written a script which generates JSON files, loads it into a
dictionary, adds a few attributes and uploads the modified files to HDFS.
After the files are generated, if I perform a select * from..; on the table
which points to this location, I get "null, null" as the result. I also
tried without the added attributes and it did not make a difference. I
strongly suspect the data.
Currently I am using strip() to eliminate trailing and leading whitespaces
and newlines. Wondering if embedded "\n" that is, json string objects
containing "\n" in the value, causes such issues.
There are no parsing errors, so I am not able to debug this issue. Are
there any flags that I can set to figure out what is happening within the
parser code?

I set this:
hive -hiveconf hive.root.logger=DEBUG,console

But the output is not really useful:

blocks=[LocatedBlock{BP-330966259-192.168.1.61-1351349834344:blk_-6076570611719758877_116734;
getBlockSize()=20635; corrupt=false; offset=0; locs=[192.168.1.61:50010,
192.168.1.66:50010, 192.168.1.63:50010]}]

lastLocatedBlock=LocatedBlock{BP-330966259-192.168.1.61-1351349834344:blk_-6076570611719758877_116734;
getBlockSize()=20635; corrupt=false; offset=0; locs=[192.168.1.61:50010,
192.168.1.66:50010, 192.168.1.63:50010]}
  isLastBlockComplete=true}
13/07/30 11:49:41 DEBUG hdfs.DFSClient: Connecting to datanode
192.168.1.61:50010
null
null
null
null
null
null
null
null
null
null
null
null
null
null
null
null
13/07/30 11:49:41 INFO exec.

Also, the attributes I am adding are current year, month day and time. So
they are not null for any record. I even moved existing files which did not
have these fields set so that there are no records with these fields as
null. However, I dont think this is an issue as the advantage of JSON/Hive
JSON serde is that it allows object struct to be dynamic. Right?

Any suggestion regarding debugging would be very helpful.

thanks
Sunita


Re: Prevent users from killing each other's jobs

2013-07-30 Thread Vinod Kumar Vavilapalli

That is correct. Seems like something else is happening.

One thing to see if all your users or more importantly their group is added to 
the cluster-admin acl (mapreduce.cluster.administrators)

You should look at mapreduce audit logs (which by default go into JobTracker 
logs, search for Audit). It clearly logs which user is killing a job

Thanks,
+Vinod

On Jul 30, 2013, at 11:31 AM, Murat Odabasi wrote:

> I'm not sure how I should do that.
> 
> The documentation says "A job submitter can specify access control
> lists for viewing or modifying a job via the configuration properties
> mapreduce.job.acl-view-job and mapreduce.job.acl-modify-job
> respectively. By default, nobody is given access in these properties."
> 
> My understanding is no other user should be able to modify a job
> unless explicitly authorized. Is that not the case? Should I set these
> two properties before running the job?
> 
> Thanks.
> 
> 
> On 30 July 2013 19:25, Vinod Kumar Vavilapalli  wrote:
>> 
>> You need to set up Job ACLs. See
>> http://hadoop.apache.org/docs/stable/mapred_tutorial.html#Job+Authorization.
>> 
>> It is a per job configuration, you can provide with defaults. If the job
>> owner wishes to give others access, he/she can do so.
>> 
>> Thanks,
>> +Vinod Kumar Vavilapalli
>> Hortonworks Inc.
>> http://hortonworks.com/
>> 
>> On Jul 30, 2013, at 11:21 AM, Murat Odabasi wrote:
>> 
>> Hi there,
>> 
>> I am trying to introduce some sort of security to prevent different
>> people using the cluster from interfering with each other's jobs.
>> 
>> Following the instructions at
>> http://hadoop.apache.org/docs/stable/cluster_setup.html and
>> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-9/security
>> , this is what I put in my mapred-site.xml:
>> 
>> 
>> mapred.task.tracker.task-controller
>> org.apache.hadoop.mapred.LinuxTaskController
>> 
>> 
>> 
>> mapred.acls.enabled
>> true
>> 
>> 
>> I can see the configuration parameters in the job configuration when I
>> run a hive query, but the users are still able to kill each other's
>> jobs.
>> 
>> Any ideas about what I may be missing?
>> Any alternative approaches I can adopt?
>> 
>> Thanks.
>> 
>> 



Re: Prevent users from killing each other's jobs

2013-07-30 Thread pandees waran
Hi Mikhail,

Could you please explain how we can track all the kill requests for a job?
Is there any feature available in hadoop stack for this? Or do we need to
track this in OS layer by capturing the signals?

Thanks,
Pandeesh
On Jul 31, 2013 12:03 AM, "Mikhail Antonov"  wrote:

> In addition to using job's ACLs you could have more brutal schema. Track
> all requests to kill the jobs, and if any request is coming from the user
> who should't be trying to kill this particular job, then ssh from the
> script to his client machine and forcibly reboot it :)
>
>
> 2013/7/30 Edward Capriolo 
>
>> Honestly tell your users to stop being jerks. People know if they kill my
>> query there is going to be hell to pay :)
>>
>>
>> On Tue, Jul 30, 2013 at 2:25 PM, Vinod Kumar Vavilapalli <
>> vino...@apache.org> wrote:
>>
>>>
>>> You need to set up Job ACLs. See
>>> http://hadoop.apache.org/docs/stable/mapred_tutorial.html#Job+Authorization
>>> .
>>>
>>> It is a per job configuration, you can provide with defaults. If the job
>>> owner wishes to give others access, he/she can do so.
>>>
>>>  Thanks,
>>> +Vinod Kumar Vavilapalli
>>> Hortonworks Inc.
>>> http://hortonworks.com/
>>>
>>> On Jul 30, 2013, at 11:21 AM, Murat Odabasi wrote:
>>>
>>> Hi there,
>>>
>>> I am trying to introduce some sort of security to prevent different
>>> people using the cluster from interfering with each other's jobs.
>>>
>>> Following the instructions at
>>> http://hadoop.apache.org/docs/stable/cluster_setup.html and
>>>
>>> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-9/security
>>> , this is what I put in my mapred-site.xml:
>>>
>>> 
>>>  mapred.task.tracker.task-controller
>>>  org.apache.hadoop.mapred.LinuxTaskController
>>> 
>>>
>>> 
>>>  mapred.acls.enabled
>>>  true
>>> 
>>>
>>> I can see the configuration parameters in the job configuration when I
>>> run a hive query, but the users are still able to kill each other's
>>> jobs.
>>>
>>> Any ideas about what I may be missing?
>>> Any alternative approaches I can adopt?
>>>
>>> Thanks.
>>>
>>>
>>>
>>
>
>
> --
> Thanks,
> Michael Antonov
>


Re: Prevent users from killing each other's jobs

2013-07-30 Thread Mikhail Antonov
In addition to using job's ACLs you could have more brutal schema. Track
all requests to kill the jobs, and if any request is coming from the user
who should't be trying to kill this particular job, then ssh from the
script to his client machine and forcibly reboot it :)


2013/7/30 Edward Capriolo 

> Honestly tell your users to stop being jerks. People know if they kill my
> query there is going to be hell to pay :)
>
>
> On Tue, Jul 30, 2013 at 2:25 PM, Vinod Kumar Vavilapalli <
> vino...@apache.org> wrote:
>
>>
>> You need to set up Job ACLs. See
>> http://hadoop.apache.org/docs/stable/mapred_tutorial.html#Job+Authorization
>> .
>>
>> It is a per job configuration, you can provide with defaults. If the job
>> owner wishes to give others access, he/she can do so.
>>
>>  Thanks,
>> +Vinod Kumar Vavilapalli
>> Hortonworks Inc.
>> http://hortonworks.com/
>>
>> On Jul 30, 2013, at 11:21 AM, Murat Odabasi wrote:
>>
>> Hi there,
>>
>> I am trying to introduce some sort of security to prevent different
>> people using the cluster from interfering with each other's jobs.
>>
>> Following the instructions at
>> http://hadoop.apache.org/docs/stable/cluster_setup.html and
>>
>> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-9/security
>> , this is what I put in my mapred-site.xml:
>>
>> 
>>  mapred.task.tracker.task-controller
>>  org.apache.hadoop.mapred.LinuxTaskController
>> 
>>
>> 
>>  mapred.acls.enabled
>>  true
>> 
>>
>> I can see the configuration parameters in the job configuration when I
>> run a hive query, but the users are still able to kill each other's
>> jobs.
>>
>> Any ideas about what I may be missing?
>> Any alternative approaches I can adopt?
>>
>> Thanks.
>>
>>
>>
>


-- 
Thanks,
Michael Antonov


Re: Prevent users from killing each other's jobs

2013-07-30 Thread Murat Odabasi
I'm not sure how I should do that.

The documentation says "A job submitter can specify access control
lists for viewing or modifying a job via the configuration properties
mapreduce.job.acl-view-job and mapreduce.job.acl-modify-job
respectively. By default, nobody is given access in these properties."

My understanding is no other user should be able to modify a job
unless explicitly authorized. Is that not the case? Should I set these
two properties before running the job?

Thanks.


On 30 July 2013 19:25, Vinod Kumar Vavilapalli  wrote:
>
> You need to set up Job ACLs. See
> http://hadoop.apache.org/docs/stable/mapred_tutorial.html#Job+Authorization.
>
> It is a per job configuration, you can provide with defaults. If the job
> owner wishes to give others access, he/she can do so.
>
> Thanks,
> +Vinod Kumar Vavilapalli
> Hortonworks Inc.
> http://hortonworks.com/
>
> On Jul 30, 2013, at 11:21 AM, Murat Odabasi wrote:
>
> Hi there,
>
> I am trying to introduce some sort of security to prevent different
> people using the cluster from interfering with each other's jobs.
>
> Following the instructions at
> http://hadoop.apache.org/docs/stable/cluster_setup.html and
> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-9/security
> , this is what I put in my mapred-site.xml:
>
> 
>  mapred.task.tracker.task-controller
>  org.apache.hadoop.mapred.LinuxTaskController
> 
>
> 
>  mapred.acls.enabled
>  true
> 
>
> I can see the configuration parameters in the job configuration when I
> run a hive query, but the users are still able to kill each other's
> jobs.
>
> Any ideas about what I may be missing?
> Any alternative approaches I can adopt?
>
> Thanks.
>
>


Re: Prevent users from killing each other's jobs

2013-07-30 Thread Edward Capriolo
Honestly tell your users to stop being jerks. People know if they kill my
query there is going to be hell to pay :)


On Tue, Jul 30, 2013 at 2:25 PM, Vinod Kumar Vavilapalli  wrote:

>
> You need to set up Job ACLs. See
> http://hadoop.apache.org/docs/stable/mapred_tutorial.html#Job+Authorization
> .
>
> It is a per job configuration, you can provide with defaults. If the job
> owner wishes to give others access, he/she can do so.
>
> Thanks,
> +Vinod Kumar Vavilapalli
> Hortonworks Inc.
> http://hortonworks.com/
>
> On Jul 30, 2013, at 11:21 AM, Murat Odabasi wrote:
>
> Hi there,
>
> I am trying to introduce some sort of security to prevent different
> people using the cluster from interfering with each other's jobs.
>
> Following the instructions at
> http://hadoop.apache.org/docs/stable/cluster_setup.html and
>
> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-9/security
> , this is what I put in my mapred-site.xml:
>
> 
>  mapred.task.tracker.task-controller
>  org.apache.hadoop.mapred.LinuxTaskController
> 
>
> 
>  mapred.acls.enabled
>  true
> 
>
> I can see the configuration parameters in the job configuration when I
> run a hive query, but the users are still able to kill each other's
> jobs.
>
> Any ideas about what I may be missing?
> Any alternative approaches I can adopt?
>
> Thanks.
>
>
>


Re: Prevent users from killing each other's jobs

2013-07-30 Thread Vinod Kumar Vavilapalli

You need to set up Job ACLs. See 
http://hadoop.apache.org/docs/stable/mapred_tutorial.html#Job+Authorization.

It is a per job configuration, you can provide with defaults. If the job owner 
wishes to give others access, he/she can do so.

Thanks,
+Vinod Kumar Vavilapalli
Hortonworks Inc.
http://hortonworks.com/

On Jul 30, 2013, at 11:21 AM, Murat Odabasi wrote:

> Hi there,
> 
> I am trying to introduce some sort of security to prevent different
> people using the cluster from interfering with each other's jobs.
> 
> Following the instructions at
> http://hadoop.apache.org/docs/stable/cluster_setup.html and
> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-9/security
> , this is what I put in my mapred-site.xml:
> 
> 
>  mapred.task.tracker.task-controller
>  org.apache.hadoop.mapred.LinuxTaskController
> 
> 
> 
>  mapred.acls.enabled
>  true
> 
> 
> I can see the configuration parameters in the job configuration when I
> run a hive query, but the users are still able to kill each other's
> jobs.
> 
> Any ideas about what I may be missing?
> Any alternative approaches I can adopt?
> 
> Thanks.



Prevent users from killing each other's jobs

2013-07-30 Thread Murat Odabasi
Hi there,

I am trying to introduce some sort of security to prevent different
people using the cluster from interfering with each other's jobs.

Following the instructions at
http://hadoop.apache.org/docs/stable/cluster_setup.html and
https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-9/security
, this is what I put in my mapred-site.xml:


  mapred.task.tracker.task-controller
  org.apache.hadoop.mapred.LinuxTaskController



  mapred.acls.enabled
  true


I can see the configuration parameters in the job configuration when I
run a hive query, but the users are still able to kill each other's
jobs.

Any ideas about what I may be missing?
Any alternative approaches I can adopt?

Thanks.


RE: Hive Join with distinct rows

2013-07-30 Thread Marcin Mejran
I've used a rank udf for this previously, distribute and sort by the column 
then select all rows where rank=1. That should work with a join but I never 
tried it. It'd be an issue if the join outputs a lot of records that then are 
all dropped since that'd slow down the query.

I've actually forked Hive internally and added a distinct join based on the, 
now deprecated I guess, unique join code. It's ugly in terms of syntax and I 
haven't had a chance to open source it but it allows a good amount of control 
over what is joined to what (ie: select the row in table A whose column x is 
closets to column y in table B, for example request time). I really wish Hive 
had better support for such "non-SQL" types of queries which are common in a 
world of unstructured and un-clean data.

-Marcin

From: Sunita Arvind [mailto:sunitarv...@gmail.com]
Sent: Tuesday, July 30, 2013 11:00 AM
To: user@hive.apache.org
Subject: Hive Join with distinct rows

Hi Praveen / All,

I also have a requirement similar to the one explained (by Praveen) below:
distinct rows on a single column with corresponding data from other columns.

http://mail-archives.apache.org/mod_mbox/hive-user/201211.mbox/%3ccahmb8ta+r0h5a+armutookhkp8fxctown68qoz6lkfmwbrk...@mail.gmail.com%3E
This email thread dates back to Nov 2012 and is a very common use case.I just 
wanted to check if there is a solution already or we still need to write a UDF.
regards
Sunita


Hive Join with distinct rows

2013-07-30 Thread Sunita Arvind
Hi Praveen / All,

I also have a requirement similar to the one explained (by Praveen) below:
distinct rows on a single column with corresponding data from other columns.

http://mail-archives.apache.org/mod_mbox/hive-user/201211.mbox/%3ccahmb8ta+r0h5a+armutookhkp8fxctown68qoz6lkfmwbrk...@mail.gmail.com%3E

This email thread dates back to Nov 2012 and is a very common use case.I
just wanted to check if there is a solution already or we still need to
write a UDF.

regards
Sunita


Re: Hive Metastore Server 0.9 Connection Reset and Connection Timeout errors

2013-07-30 Thread Nitin Pawar
The mentioned flow is called when you have unsecure mode of thrift
metastore client-server connection. So one way to avoid this is have a
secure way.


public boolean process(final TProtocol in, final TProtocol out)
throwsTException {
setIpAddress(in);
...
...
...
@Override
 protected void setIpAddress(final TProtocol in) {
TUGIContainingTransport ugiTrans =
(TUGIContainingTransport)in.getTransport();
Socket socket = ugiTrans.getSocket();
if (socket != null) {
  setIpAddress(socket);




>From the above code snippet, it looks like the null pointer exception is
not handled if the getSocket returns null.

can you check whats the ulimit setting on the server? If its set to default
can you set it to unlimited and restart hcat server. (This is just a wild
guess).

also the getSocket method suggests "If the underlying TTransport is an
instance of TSocket, it returns the Socket object which it contains.
Otherwise it returns null."

so someone from thirft gurus need to tell us whats happening. I have no
knowledge of this depth

may be Ashutosh or Thejas will be able to help on this.




>From the netstat close_wait, it looks like the hive metastore server has
not closed the connection (do not know why yet), may be the hive dev guys
can help.Are there too many connections in close_wait state?



On Tue, Jul 30, 2013 at 5:52 AM, agateaaa  wrote:

> Looking at the hive metastore server logs see errors like these:
>
> 2013-07-26 06:34:52,853 ERROR server.TThreadPoolServer
> (TThreadPoolServer.java:run(182)) - Error occurred during processing of
> message.
> java.lang.NullPointerException
> at
>
> org.apache.hadoop.hive.metastore.TUGIBasedProcessor.setIpAddress(TUGIBasedProcessor.java:183)
> at
>
> org.apache.hadoop.hive.metastore.TUGIBasedProcessor.process(TUGIBasedProcessor.java:79)
> at
>
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:176)
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>  at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:662)
>
> approx same time as we see timeout or connection reset errors.
>
> Dont know if this is the cause or the side affect of he connection
> timeout/connection reset errors. Does anybody have any pointers or
> suggestions ?
>
> Thanks
>
>
> On Mon, Jul 29, 2013 at 11:29 AM, agateaaa  wrote:
>
> > Thanks Nitin!
> >
> > We have simiar setup (identical hcatalog and hive server versions) on a
> > another production environment and dont see any errors (its been running
> ok
> > for a few months)
> >
> > Unfortunately we wont be able to move to hcat 0.5 and hive 0.11 or hive
> > 0.10 soon.
> >
> > I did see that the last time we ran into this problem doing a netstat-ntp
> > | grep ":1" see that server was holding on to one socket connection
> in
> > CLOSE_WAIT state for a long time
> >  (hive metastore server is running on port 1). Dont know if thats
> > relevant here or not
> >
> > Can you suggest any hive configuration settings we can tweak or
> networking
> > tools/tips, we can use to narrow this down ?
> >
> > Thanks
> > Agateaaa
> >
> >
> >
> >
> > On Mon, Jul 29, 2013 at 11:02 AM, Nitin Pawar  >wrote:
> >
> >> Is there any chance you can do a update on test environment with
> hcat-0.5
> >> and hive-0(11 or 10) and see if you can reproduce the issue?
> >>
> >> We used to see this error when there was load on hcat server or some
> >> network issue connecting to the server(second one was rare occurrence)
> >>
> >>
> >> On Mon, Jul 29, 2013 at 11:13 PM, agateaaa  wrote:
> >>
> >>> Hi All:
> >>>
> >>> We are running into frequent problem using HCatalog 0.4.1 (HIve
> Metastore
> >>> Server 0.9) where we get connection reset or connection timeout errors.
> >>>
> >>> The hive metastore server has been allocated enough (12G) memory.
> >>>
> >>> This is a critical problem for us and would appreciate if anyone has
> any
> >>> pointers.
> >>>
> >>> We did add a retry logic in our client, which seems to help, but I am
> >>> just
> >>> wondering how can we narrow down to the root cause
> >>> of this problem. Could this be a hiccup in networking which causes the
> >>> hive
> >>> server to get into a unresponsive state  ?
> >>>
> >>> Thanks
> >>>
> >>> Agateaaa
> >>>
> >>>
> >>> Example Connection reset error:
> >>> ===
> >>>
> >>> org.apache.thrift.transport.TTransportException:
> >>> java.net.SocketException:
> >>> Connection reset
> >>> at
> >>>
> >>>
> org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129)
> >>>  at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
> >>> at
> >>>
> >>>
> org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378)
> >>>  at
> >>>
> >>>
> org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:297)
> >>> at
> >>>
> >>>
> org.apache.thrift.pr

Re: PL/SQL to HiveQL translation

2013-07-30 Thread Jérôme Verdier
Hi,

Thanks for this link, it was very helpful :-)

I have another question.

I have some HiveQL script wich are stored into .hql file.

What is the best way to execute these scripts with a Java/JDBC program ?

Thanks.


2013/7/29 Brendan Heussler 

> Jerome,
>
> There is a really good page on the wiki:
> https://cwiki.apache.org/Hive/hiveserver2-clients.html
>
> I use the HiveServer2 JDBC driver.  Maybe there are other ways?
>
>
>
> Brendan
>
>
> On Mon, Jul 29, 2013 at 5:47 AM, Jérôme Verdier <
> verdier.jerom...@gmail.com> wrote:
>
>> Hi,
>>
>> Thanks everyone for your help.
>>
>> Has anyone have a good tutorial to run Hive queries and scripts with Java
>> (over Eclipse). I have some Java Development basis but i'm  pretty new
>> using Hive with Java/Eclipse.
>>
>> Thanks.
>>
>>
>> 2013/7/25 j.barrett Strausser 
>>
>>> The advice I have always seen for your case is to transform the subquery
>>> in the WHERE clause into a LEFT OUTER JOIN.
>>>
>>>
>>>
>>>
>>> On Thu, Jul 25, 2013 at 11:04 AM, Edson Ramiro wrote:
>>>
 AFAIK,

 Hive supports subqueries only in the FROM clause.

 Maybe you have to split you query into more queries...


 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SubQueries




Edson Ramiro


 On Thu, Jul 25, 2013 at 9:31 AM, Jérôme Verdier <
 verdier.jerom...@gmail.com> wrote:

> Hi Bennie,
>
> I was trying some solutions to pass through my problem, and a problem
> occurs
>
> here is the error :
>
> FAILED: ParseException line 26:14 cannot recognize input near 'SELECT'
> 'cal' '.' in expression specification
>
> Is AND...BETWEEN ( SELECT. is possible in Hive?
>
>
> 2013/7/25 Bennie Schut 
>
>>  Hi Jerome,
>>
>> Yes it looks like you could stop using GET_SEMAINE  and directly
>> joining "calendrier_hebdo" with "calendrier" for example. For
>> "FCALC_IDJOUR" you will have to make a udf so I hope you have some java
>> skills :)
>> The "calendrier" tables suggests you have star schema with a calendar
>> table. If on oracle you partitioned on a date and use a subquery to get 
>> the
>> dates you want from the fact table you can expect this to be a problem in
>> hive. Partition pruning works during planning it will not know which
>> partitioned to prune and thus run on all the data in the fact table and
>> filter after it's done making partitioning useless. There are ways of
>> working around this, it seems most people decide to use a "deterministic"
>> udf which produces the dates and this causes the udfs to be run during
>> planning and partition pruning magically works again.
>> Hope this helps.
>>
>> Bennie.
>>
>> Op 25-7-2013 09:50, Jérôme Verdier schreef:
>>
>>Hi,
>>
>>  I need some help to translate a PL/SQL script in HiveQL.
>>
>>  Problem : my PL/SQL script is calling two functions.
>>
>>  you can see the script below :
>>
>> SELECT
>>   in_co_societe as co_societe,
>>   'SEMAINE' as co_type_periode,
>>   a.type_entite as type_entite,
>>   a.code_entite as code_entite,
>>   a.type_rgrp_produits  as type_rgrp_produits,
>>   a.co_rgrp_produitsas co_rgrp_produits,
>>   SUM(a.MT_CA_NET_TTC)  as MT_CA_NET_TTC,
>>   SUM(a.MT_OBJ_CA_NET_TTC)  as MT_OBJ_CA_NET_TTC,
>>   SUM(a.NB_CLIENTS) as NB_CLIENTS,
>>   SUM(a.MT_CA_NET_TTC_COMP) as MT_CA_NET_TTC_COMP,
>>   SUM(a.MT_OBJ_CA_NET_TTC_COMP) as
>> MT_OBJ_CA_NET_TTC_COMP,
>>   SUM(a.NB_CLIENTS_COMP)as NB_CLIENTS_COMP
>> from
>>   kpi.thm_ca_rgrp_produits_jour/*@o_bi.match.eu*/ a
>> WHERE
>> a.co_societe = in_co_societe
>> AND a.dt_jour between
>>   (
>> SELECT
>>   cal.dt_jour_deb
>> FROM ods.calendrier_hebdo cal
>> WHERE cal.co_societe = in_co_societe
>> AND cal.co_an_semaine = ods.package_date.get_semaine(
>>   ods.package_date.fcalc_idjour(
>> CASE
>>   WHEN TO_CHAR(D_Dernier_Jour,'') =
>> TO_CHAR(D_Dernier_Jour-364,'') THEN
>> NEXT_DAY(D_Dernier_Jour-364,1)-7
>>   ELSE
>> D_Dernier_Jour-364
>> END
>>   )
>> )
>>   )
>>   AND D_Dernier_Jour-364
>> -- On ne calcule rien si la semaine est compl¿¿te
>> AND (
>>   TO_CHAR(D_Dernier_Jour,'DDMM') <> '3112'
>>   AND TO_CHAR(D_Dernier_Jour,'D'