Request for edit permission to hive wiki

2014-09-09 Thread Li, Rui
Hi,

Could anyone grant me the edit permission of hive wiki?
My Confluence user name: lirui

Thanks a lot!

Cheers.
Rui Li



Re: doubt about locking mechanism in Hive

2014-09-09 Thread wzc
Hi,
We also encounter this in hive 0.13 , we need to enable concurrency  in
daily ETL workflows (to avoid sub etl start to read parent etl 's output
while it's still running).
We found that in hive 0.13 sometime when you open hive cli shell it would
output the msg conflicting lock present for default mode EXCLUSIVE and
wait for some locks to be released. We haven't  encounter this in hive 0.11
and are still trying to figure it out.



2014-08-25 15:21 GMT+08:00 Sourygna Luangsay sluang...@pragsis.com:

  Many thanks Edward for this complete answer.



 So the main idea is to simply disable concurrency in Hive if I get you.



 My doubt now is: is it something most Hive users do as default?

 Can somebody else share its own experience?



 Regards,



 *Sourygna Luangsay*



 *From:* Edward Capriolo [mailto:edlinuxg...@gmail.com]
 *Sent:* viernes, 22 de agosto de 2014 16:07
 *To:* user@hive.apache.org
 *Subject:* Re: doubt about locking mechanism in Hive



 IMHO locking support should be turned off by default. I would argue if you
 are requiring this feature often you may be designing your systems
 improperly.

 You really should not have that many situations where you need locking in
 a write (mostly) once file system. The only time I have ever used it is if
 I had a process completely re-writing the contents of a table and I needed
 downstream things not to select from this table when it was in an
 inconsistent state. Having it on by default is a bad idea. You have pointed
 out a case where doing a simple select query attempts to acquire locks it
 does not need. That puts strain on more systems and creates more changes
 for issues.



 One of the big design philosophy issues I tend to have with hive lately is
 we have this pool of users (like myself) that use hive for its original
 purpose. To query write once text files, and create aggregations.

 Then there are other groups attempting to implement very complicated
 semantics around streaming, transactions, locking, whatever. Then you have
 tools like cloudera manager giving configution warnings such as:

  Hive: Hive is not configured with ZooKeeper Service. As a result,
 hive-site will not contain hive.zookeeper.quorum, which can lead to
 corruption in concurrency scenarios.

 I think this statement is incorrect AND is BAD advice.  Then users such as
 yourself making a conclusion like I should turn on locking because no one
 would ever assume that 

 !!!SELECTING 1 ROW FROM A TABLE WOULD CAUSE 1100 LOCKS TO BE ACQUIRED

 ::rant over:: I am not saying that hive locking is bad, but I am saying I
 leave it off and turn it on when I need it on a per query basis.











 On Fri, Aug 22, 2014 at 8:48 AM, Sourygna Luangsay sluang...@pragsis.com
 wrote:

 Hi,



 I have some troubles with the locking/concurrency mechanism of Hive when
 doing a large select and trying to create a table at the same time.

 My version of Hive is 0.13.



 What I try to do is the following:



 1)  In a hive shell:
 use mydatabase;
 select * from competence limit 1; # this table has 1100 partitions. So
 with hive.support.concurrency=true, it needs at least 90s to execute (I
 know, this is a silly query: I should rather do a select * where “a
 partition”… The purpose of this query is to replicate easily the problem by
 having a query that needs a lot of time to execute)



 2)  In another hive shell, meanwhile the 1st query is executing:
 use mydatabase;
 create table probsourygna (foo string) ROW FORMAT DELIMITED FIELDS
 TERMINATED BY '\t'  STORED AS TEXTFILE ;

 The problem is that the “create table” does not execute untill the first
 query (select) has finished.

 And we can see messages of the following type:

 conflicting lock present for mydatabase mode EXCLUSIVE

 conflicting lock present for mydatabase mode EXCLUSIVE

 …



 (1 line every 60 s)





 It seems to me that the first query puts a shared lock at the database
 (mydatabase) level.

 Then, the second query tries to acquire an exclusive lock at the database
 level (fails and retries every 60s).



 Am I right? (when I look at the documentation
 https://cwiki.apache.org/confluence/display/Hive/Locking , it says
 nothing about locks at a database level)

 Is there any solution to my problem? (avoiding a long “select” to block a
 “create” query, without removing the concurrency of Hive)



 Regards,



 *Sourygna Luangsay*


 AVISO CONFIDENCIAL
 Este correo y la información contenida o adjunta al mismo es privada y
 confidencial y va dirigida exclusivamente a su destinatario. Pragsis
 informa a quien pueda haber recibido este correo por error que contiene
 información confidencial cuyo uso, copia, reproducción o distribución está
 expresamente prohibida. Si no es Vd. el destinatario del mismo y recibe
 este correo por error, le rogamos lo ponga en conocimiento del emisor y
 proceda a su eliminación sin copiarlo, imprimirlo o utilizarlo de ningún
 modo.
 CONFIDENTIALITY WARNING.
 This message 

Re: Nested types in ORC

2014-09-09 Thread Abhishek Agarwal
Thanks Prasanth. Does it also mean that a query reading nested.k column
will invariably read nested.v as well even if nested.v column in not used
in the query?

On Mon, Sep 8, 2014 at 11:29 PM, Prasanth Jayachandran 
pjayachand...@hortonworks.com wrote:

 Hi

 ORC stores nested fields as separate columns. For example: The following
 table
 create table orc_nested (key string, nested structk:string,v:string, zip
 long) stored as orc;
 will be flattened and stored as separated columns like below
 key, nested, nested.k, nested.v, zip

 you can have a look at the structure of ORC files using hive
 —orcfiledump” utility.

 With regard to your next question, predicate pushdown is not supported for
 complex types at this point. There is a JIRA already for supporting it
 https://issues.apache.org/jira/browse/HIVE-7214

 At this point, schema 2 will make you enable predicate pushdown. The
 performance difference depends mainly on the data layout/if column is
 sorted or not.

 Thanks
 Prasanth Jayachandran

 On Sep 8, 2014, at 6:16 AM, Abhishek Agarwal abhishc...@gmail.com wrote:

 Hi all,
 I have few questions with regards to nested columns in Hive.
  How does ORC internally stores the complex types such as a struct? Are
 the nested fields stored as separate columns or is the whole struct is
 serialized as one column?

  Is predicate pushdown supported for queries which access nested columns?
 In general, is there a significant performance difference in following
 schemas with regards to query execution and storage?

 Schema1:

 {
 string a;
 struct b {
   string b1;
   string b2;
 }
 }

 Schema 2:
 {
 string a;
 string b.b1;
 string b.b2;
 }

 --
 Regards,
 Abhishek Agarwal



 CONFIDENTIALITY NOTICE
 NOTICE: This message is intended for the use of the individual or entity
 to which it is addressed and may contain information that is confidential,
 privileged and exempt from disclosure under applicable law. If the reader
 of this message is not the intended recipient, you are hereby notified that
 any printing, copying, dissemination, distribution, disclosure or
 forwarding of this communication is strictly prohibited. If you have
 received this communication in error, please contact the sender immediately
 and delete it from your system. Thank You.




-- 
Regards,
Abhishek Agarwal


Re: Nested types in ORC

2014-09-09 Thread Prasanth Jayachandran
Yes. It does now. 

Thanks
Prasanth Jayachandran

On Sep 9, 2014, at 12:30 AM, Abhishek Agarwal abhishc...@gmail.com wrote:

 Thanks Prasanth. Does it also mean that a query reading nested.k column will 
 invariably read nested.v as well even if nested.v column in not used in the 
 query? 
 
 On Mon, Sep 8, 2014 at 11:29 PM, Prasanth Jayachandran 
 pjayachand...@hortonworks.com wrote:
 Hi
 
 ORC stores nested fields as separate columns. For example: The following table
 create table orc_nested (key string, nested structk:string,v:string, zip 
 long) stored as orc;
 will be flattened and stored as separated columns like below
 key, nested, nested.k, nested.v, zip
 
 you can have a look at the structure of ORC files using hive —orcfiledump” 
 utility.
 
 With regard to your next question, predicate pushdown is not supported for 
 complex types at this point. There is a JIRA already for supporting it 
 https://issues.apache.org/jira/browse/HIVE-7214
 
 At this point, schema 2 will make you enable predicate pushdown. The 
 performance difference depends mainly on the data layout/if column is sorted 
 or not.
 
 Thanks
 Prasanth Jayachandran
 
 On Sep 8, 2014, at 6:16 AM, Abhishek Agarwal abhishc...@gmail.com wrote:
 
 Hi all,
 I have few questions with regards to nested columns in Hive. 
  How does ORC internally stores the complex types such as a struct? Are the 
  nested fields stored as separate columns or is the whole struct is 
  serialized as one column?
 
  Is predicate pushdown supported for queries which access nested columns? 
  In general, is there a significant performance difference in following 
  schemas with regards to query execution and storage?
 
 Schema1:
 
 {
 string a;
 struct b {
   string b1;
   string b2;
 }
 }
 
 Schema 2:
 {
 string a;
 string b.b1;
 string b.b2;
 }
 
 -- 
 Regards,
 Abhishek Agarwal
 
 
 
 CONFIDENTIALITY NOTICE
 NOTICE: This message is intended for the use of the individual or entity to 
 which it is addressed and may contain information that is confidential, 
 privileged and exempt from disclosure under applicable law. If the reader of 
 this message is not the intended recipient, you are hereby notified that any 
 printing, copying, dissemination, distribution, disclosure or forwarding of 
 this communication is strictly prohibited. If you have received this 
 communication in error, please contact the sender immediately and delete it 
 from your system. Thank You.
 
 
 
 -- 
 Regards,
 Abhishek Agarwal
 


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.


Re: UDTF KryoException: unable create/find class error in hive 0.13

2014-09-09 Thread Furcy Pin
Hi,

I think I encountered this kind of serialization problem when writing UDFs.
Usually, marking every fields of the UDF as *transient* does the trick.

I guess the error means that Kryo tries to serialize the UDF class and
everything that is inside, and by marking them as transient
you ensure that it will not and that they will be instantiated in the
default constructor or during the call of initialize()

Please keep me informed if it works or not,

Regards,

Furcy


2014-09-09 1:44 GMT+02:00 Echo Li echo...@gmail.com:

 I wrote a UDTF in hive 0.13, the function parse a column which is json
 string and return a table. The function compiles successfully by adding
 hive-exec-0.13.0.2.1.2.1-471.jar to classpath, however when the jar is
 added to hive and a function created using the jar then I try to run a
 query using that function, I got error:

 org.apache.hive.com.esotericsoftware.kryo.KryoException: Unable to find
 class: class_name

 I went through all steps in a lower version hive (0.10) everything works
 fine, I searched around and seams that is caused by the ‘kryo” serde, so my
 question is, is there a fix? and where to find it?

 thank you.



Re: Dynamic Partitioning- Partition_Naming

2014-09-09 Thread Nitin Pawar
you can not modify the paths of partitions being created by dynamic
partitioning or rename them
Thats the default implementation for having column=value in path as
partition


On Tue, Sep 9, 2014 at 5:18 AM, anusha Mangina anusha.mang...@gmail.com
wrote:


 I need a table partitioned by country and then city . I created a table
 and INSERTed data  from another table using dynamic partition.

 CREATE TABLE invoice_details_hive _partitioned(Invoice_Id
 double,Invoice_Date string,Invoice_Amount double,Paid_Date
 string)PARTITIONED BY(pay_country STRING,pay_location STRING);

 Everything worked fine.



 Partitions by default are named like pay_country=INDIA and
  pay_city=DELHI etc in

 ../hive/warehouse/invoice_details_hive_partitioned/pay_country=INDIA/pay_city=DELHI


 can I get partition name as Just Column Value  INDIA and DELHI ...not
 including column name ...like  /hive/warehouse/invoice_details_hive
 _partitioned/INDIA/DELHI?

 Thanks in Advance





-- 
Nitin Pawar


Re: doubt about locking mechanism in Hive

2014-09-09 Thread Edward Capriolo
We use our own library, simple constructions like files in hdfs that work
like pid/lock files. a file like /flags/tablea/process1 could mean hey i'm
working on table a leave it alone.  Accomplishes the exact same thing with
less fuss, it is also much easier for an external process/scheduler/shell
script to integrate with this system. I doubt many use hive locking as flow
control for a scheduling system.

On Tue, Sep 9, 2014 at 3:25 AM, wzc wzc1...@gmail.com wrote:

 Hi,
 We also encounter this in hive 0.13 , we need to enable concurrency  in
 daily ETL workflows (to avoid sub etl start to read parent etl 's output
 while it's still running).
 We found that in hive 0.13 sometime when you open hive cli shell it would
 output the msg conflicting lock present for default mode EXCLUSIVE and
 wait for some locks to be released. We haven't  encounter this in hive 0.11
 and are still trying to figure it out.



 2014-08-25 15:21 GMT+08:00 Sourygna Luangsay sluang...@pragsis.com:

  Many thanks Edward for this complete answer.



 So the main idea is to simply disable concurrency in Hive if I get you.



 My doubt now is: is it something most Hive users do as default?

 Can somebody else share its own experience?



 Regards,



 *Sourygna Luangsay*



 *From:* Edward Capriolo [mailto:edlinuxg...@gmail.com]
 *Sent:* viernes, 22 de agosto de 2014 16:07
 *To:* user@hive.apache.org
 *Subject:* Re: doubt about locking mechanism in Hive



 IMHO locking support should be turned off by default. I would argue if
 you are requiring this feature often you may be designing your systems
 improperly.

 You really should not have that many situations where you need locking in
 a write (mostly) once file system. The only time I have ever used it is if
 I had a process completely re-writing the contents of a table and I needed
 downstream things not to select from this table when it was in an
 inconsistent state. Having it on by default is a bad idea. You have pointed
 out a case where doing a simple select query attempts to acquire locks it
 does not need. That puts strain on more systems and creates more changes
 for issues.



 One of the big design philosophy issues I tend to have with hive lately
 is we have this pool of users (like myself) that use hive for its original
 purpose. To query write once text files, and create aggregations.

 Then there are other groups attempting to implement very complicated
 semantics around streaming, transactions, locking, whatever. Then you have
 tools like cloudera manager giving configution warnings such as:

  Hive: Hive is not configured with ZooKeeper Service. As a result,
 hive-site will not contain hive.zookeeper.quorum, which can lead to
 corruption in concurrency scenarios.

 I think this statement is incorrect AND is BAD advice.  Then users such
 as yourself making a conclusion like I should turn on locking because no
 one would ever assume that 

 !!!SELECTING 1 ROW FROM A TABLE WOULD CAUSE 1100 LOCKS TO BE ACQUIRED

 ::rant over:: I am not saying that hive locking is bad, but I am saying I
 leave it off and turn it on when I need it on a per query basis.











 On Fri, Aug 22, 2014 at 8:48 AM, Sourygna Luangsay sluang...@pragsis.com
 wrote:

 Hi,



 I have some troubles with the locking/concurrency mechanism of Hive when
 doing a large select and trying to create a table at the same time.

 My version of Hive is 0.13.



 What I try to do is the following:



 1)  In a hive shell:
 use mydatabase;
 select * from competence limit 1; # this table has 1100 partitions.
 So with hive.support.concurrency=true, it needs at least 90s to execute (I
 know, this is a silly query: I should rather do a select * where “a
 partition”… The purpose of this query is to replicate easily the problem by
 having a query that needs a lot of time to execute)



 2)  In another hive shell, meanwhile the 1st query is executing:
 use mydatabase;
 create table probsourygna (foo string) ROW FORMAT DELIMITED FIELDS
 TERMINATED BY '\t'  STORED AS TEXTFILE ;

 The problem is that the “create table” does not execute untill the first
 query (select) has finished.

 And we can see messages of the following type:

 conflicting lock present for mydatabase mode EXCLUSIVE

 conflicting lock present for mydatabase mode EXCLUSIVE

 …



 (1 line every 60 s)





 It seems to me that the first query puts a shared lock at the database
 (mydatabase) level.

 Then, the second query tries to acquire an exclusive lock at the database
 level (fails and retries every 60s).



 Am I right? (when I look at the documentation
 https://cwiki.apache.org/confluence/display/Hive/Locking , it says
 nothing about locks at a database level)

 Is there any solution to my problem? (avoiding a long “select” to block a
 “create” query, without removing the concurrency of Hive)



 Regards,



 *Sourygna Luangsay*


 AVISO CONFIDENCIAL
 Este correo y la información contenida o adjunta al mismo es privada y
 

Output File Path- Directory Structure

2014-09-09 Thread anusha Mangina
My Table has Dynamic Partitions  and  creates the File Path as

s3://some-bucket/pageviews/dt=20120311/key=ACME1234/site=
example.com/Output-file-1


Is there something i can do so i can have the path always as

s3://some-bucket/pageviews/20120311/ACME1234/example.com/Output-file-1

Please help me out guys


Weird Error on Inserting in Table [ORC, MESOS, HIVE]

2014-09-09 Thread John Omernik
I am doing a dynamic partition load in Hive 0.13 using ORC files. This has
always worked in the past both with MapReduce V1 and YARN. I am working
with Mesos now, and trying to trouble shoot this weird error:



Failed with exception AlreadyExistsException(message:Partition already
exists



What's odd is is my insert is an insert (without Overwrite) so it's like
two different reducers have data to go into the same partition, but then
there is a collision of some sort? Perhaps there is a situation where the
partition doesn't exist prior to the run, but when two reducers have data,
they both think they should be the one to create the partition? Shouldn't
if a partition already exists, the reducer just copies it's file into the
partition?  I am struggling to see why this would be an issue with Mesos,
but not on Yarn, or MRv1.


Any thoughts would be welcome.


John


Re: Weird Error on Inserting in Table [ORC, MESOS, HIVE]

2014-09-09 Thread John Omernik
I ran with debug logging, and this is interesting, there was a loss of
connection to the metastore client RIGHT before the partition mention
above... as data was looking to be moved around... I wonder if the timing
on that is bad?

14/09/09 12:47:37 [main]: INFO exec.MoveTask: Partition is: {day=null,
source=null}

14/09/09 12:47:38 [main]: INFO metadata.Hive: Renaming
src:maprfs:/user/hive/scratch/hive-mapr/hive_2014-09-09_12-38-30_860_3555291990145206535-1/-ext-1/day=2012-11-30/source=20121119_SWAirlines_Spam/04_0;dest:
maprfs:/user/hive/warehouse/intel_flow.db/pcaps/day=2012-11-30/source=20121119_SWAirlines_Spam/04_0;Status:true

14/09/09 12:48:02 [main]: WARN metastore.RetryingMetaStoreClient:
MetaStoreClient lost connection. Attempting to reconnect.

org.apache.thrift.transport.TTransportException:
java.net.SocketTimeoutException: Read timed out

at
org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129)




On Tue, Sep 9, 2014 at 11:02 AM, John Omernik j...@omernik.com wrote:

 I am doing a dynamic partition load in Hive 0.13 using ORC files. This has
 always worked in the past both with MapReduce V1 and YARN. I am working
 with Mesos now, and trying to trouble shoot this weird error:



 Failed with exception AlreadyExistsException(message:Partition already
 exists



 What's odd is is my insert is an insert (without Overwrite) so it's like
 two different reducers have data to go into the same partition, but then
 there is a collision of some sort? Perhaps there is a situation where the
 partition doesn't exist prior to the run, but when two reducers have data,
 they both think they should be the one to create the partition? Shouldn't
 if a partition already exists, the reducer just copies it's file into the
 partition?  I am struggling to see why this would be an issue with Mesos,
 but not on Yarn, or MRv1.


 Any thoughts would be welcome.


 John



RE: Indexes vs Partitions in hive

2014-09-09 Thread Martin, Nick
Lefty, that’s the single best description of indexes/partitions I’ve yet 
encountered. Stealing it.

Nice ☺

From: Lefty Leverenz [mailto:leftylever...@gmail.com]
Sent: Tuesday, September 09, 2014 2:28 PM
To: user@hive.apache.org
Subject: Re: Indexes vs Partitions in hive

Others can give technical explanations, but I'll give you a simple analogy:  a 
book might have an index as well as chapters.  Both help you find information 
more quickly.  The index directs you to particular information, and chapters 
partition the book into smaller pieces that are organized around a common theme.

To stretch the analogy, a book can only have one set of chapters but it can 
have multiple indexes (topic index, scientific name index, poem title index, 
poem author index, and so on).


-- Lefty

On Mon, Sep 8, 2014 at 6:26 AM, Chhaya Vishwakarma 
chhaya.vishwaka...@lntinfotech.commailto:chhaya.vishwaka...@lntinfotech.com 
wrote:

Hi All,

How indexes in hive are different than partitions? both improves query 
performance as per my knowledge then in what way they differ?

What are the situations I'll be using indexing or partitioning? Can i use them 
together?

Kindly suggest


Regards,
Chhaya Vishwakarma



The contents of this e-mail and any attachment(s) may contain confidential or 
privileged information for the intended recipient(s). Unintended recipients are 
prohibited from taking action on the basis of information in this e-mail and 
using or disseminating the information, and must notify the sender and delete 
it from their system. LT Infotech will not accept responsibility or liability 
for the accuracy or completeness of, or the presence of any virus or disabling 
code in this e-mail



Re: Weird Error on Inserting in Table [ORC, MESOS, HIVE]

2014-09-09 Thread John Omernik
Well, here is me talking to myself: but in case someone else runs across
this, I changed the hive metastore connect timeout to 600 seconds (per the
JIRA below for Hive 0.14) and now my problem has gone away. It looks like
the timeout was causing some craziness.

https://issues.apache.org/jira/browse/HIVE-7140



On Tue, Sep 9, 2014 at 1:00 PM, John Omernik j...@omernik.com wrote:

 I ran with debug logging, and this is interesting, there was a loss of
 connection to the metastore client RIGHT before the partition mention
 above... as data was looking to be moved around... I wonder if the timing
 on that is bad?

 14/09/09 12:47:37 [main]: INFO exec.MoveTask: Partition is: {day=null,
 source=null}

 14/09/09 12:47:38 [main]: INFO metadata.Hive: Renaming
 src:maprfs:/user/hive/scratch/hive-mapr/hive_2014-09-09_12-38-30_860_3555291990145206535-1/-ext-1/day=2012-11-30/source=20121119_SWAirlines_Spam/04_0;dest:
 maprfs:/user/hive/warehouse/intel_flow.db/pcaps/day=2012-11-30/source=20121119_SWAirlines_Spam/04_0;Status:true

 14/09/09 12:48:02 [main]: WARN metastore.RetryingMetaStoreClient:
 MetaStoreClient lost connection. Attempting to reconnect.

 org.apache.thrift.transport.TTransportException:
 java.net.SocketTimeoutException: Read timed out

 at
 org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129)




 On Tue, Sep 9, 2014 at 11:02 AM, John Omernik j...@omernik.com wrote:

 I am doing a dynamic partition load in Hive 0.13 using ORC files. This
 has always worked in the past both with MapReduce V1 and YARN. I am working
 with Mesos now, and trying to trouble shoot this weird error:



 Failed with exception AlreadyExistsException(message:Partition already
 exists



 What's odd is is my insert is an insert (without Overwrite) so it's like
 two different reducers have data to go into the same partition, but then
 there is a collision of some sort? Perhaps there is a situation where the
 partition doesn't exist prior to the run, but when two reducers have data,
 they both think they should be the one to create the partition? Shouldn't
 if a partition already exists, the reducer just copies it's file into the
 partition?  I am struggling to see why this would be an issue with Mesos,
 but not on Yarn, or MRv1.


 Any thoughts would be welcome.


 John





Re: Hive Index and ORC

2014-09-09 Thread Gopal V

On 9/6/14, 9:36 AM, Alain Petrus wrote:


I am wondering whether is it possible to use Hive index and ORC format?  Does 
it make sense?


ORC maintains its own indexes within the file - one index record every 
10,000 rows (orc.row.index.stride / orc.create.index).


You can take advantage of it during scan+filter with the following option

hive set hive.optimize.index.filter=true;

A recent IBM paper did have some detailed analysis on ORC's indexing 
performance - but it is relatively free because there is no other step 
than just inserting into an ORC table.


The part where ORC does help a lot is if you then do a ANALYZE TABLE 
to build information required to make query plans better, because it 
will read the stats off the single index record at the bottom of each 
orc file (the partial scan mode).


Cheers,
Gopal


PIG heart beat freeze using hue + cdh 5.1

2014-09-09 Thread Amit Dutta
Hi I have a only 604 rows in the hive table.
while using A = LOAD 'revenue' USING org.apache.hcatalog.pig.HCatLoader(); DUMP 
A; it starts spouting heart beat repeatedly and does not leave this state.Can 
please someone help.I am getting following exception
  2014-09-09 17:27:45,844 [JobControl] INFO  
org.apache.hadoop.mapreduce.JobSubmitter  - Kind: RM_DELEGATION_TOKEN, Service: 
10.215.204.182:8032, Ident: (owner=cloudera, renewer=oozie mr token, 
realUser=oozie, issueDate=1410301632571, maxDate=1410906432571, 
sequenceNumber=14, masterKeyId=2)
  2014-09-09 17:27:46,709 [JobControl] WARN  
org.apache.hadoop.mapreduce.v2.util.MRApps  - cache file 
(mapreduce.job.cache.files) 
hdfs://txwlcloud2:8020/user/oozie/share/lib/lib_20140820161455/pig/commons-httpclient-3.1.jar
 conflicts with cache file (mapreduce.job.cache.files) 
hdfs://txwlcloud2:8020/user/oozie/share/lib/lib_20140820161455/hcatalog/commons-httpclient-3.1.jar
 This will be an error in Hadoop 2.0
  2014-09-09 17:27:46,712 [JobControl] WARN  
org.apache.hadoop.mapreduce.v2.util.MRApps  - cache file 
(mapreduce.job.cache.files) 
hdfs://txwlcloud2:8020/user/oozie/share/lib/lib_20140820161455/pig/commons-io-2.1.jar
 conflicts with cache file (mapreduce.job.cache.files) 
hdfs://txwlcloud2:8020/user/oozie/share/lib/lib_20140820161455/hcatalog/commons-io-2.1.jar
 This will be an error in Hadoop 2.0
  2014-09-09 17:27:46,894 [JobControl] INFO  
org.apache.hadoop.yarn.client.api.impl.YarnClientImpl  - Submitted application 
application_1410291186220_0006
  2014-09-09 17:27:46,968 [JobControl] INFO  org.apache.hadoop.mapreduce.Job  - 
The url to track the job: 
http://txwlcloud2:8088/proxy/application_1410291186220_0006/
  2014-09-09 17:27:46,969 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher  
- HadoopJobId: job_1410291186220_0006
  2014-09-09 17:27:46,969 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher  
- Processing aliases A
  2014-09-09 17:27:46,969 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher  
- detailed locations: M: A[1,4] C:  R:
  2014-09-09 17:27:46,969 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher  
- More information at: 
http://txwlcloud2:50030/jobdetails.jsp?jobid=job_1410291186220_0006
  2014-09-09 17:27:47,019 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher  
- 0% complete
  Heart beat
  Heart beat
  Heart beat
  Heart beat
  Heart beat  

Re: Pig jobs run forever with PigEditor in Hue

2014-09-09 Thread Amit Dutta
Hi 
Does anyone please let me know how to increase the mapreduce slots? i am 
getting infinite heartbeat when i run a PIG script from hue cloudera cdh5.1
Thanks,Amit   

Increase mapreduce slots

2014-09-09 Thread Amit Dutta
Hi 
Does anyone please let me know how to increase the mapreduce slots? i am 
getting infinite heartbeat when i run a PIG script from hue cloudera cdh5.1
Thanks,Amit
  

Re: PIG heart beat freeze using hue + cdh 5.1

2014-09-09 Thread Zenonlpc
It use Yarn now you need to set your container resource memory and CPU then set 
the mapreduce physical memory and CPU cores the number of mapper and reducers 
are calculated based on the resource you gave to your mapper and reducer

Pengcheng
Sent from my iPhone

 On Sep 9, 2014, at 7:55 PM, Amit Dutta amitkrdu...@outlook.com wrote:
 
 I think one of the issue is number of mapreduce slot for the cluster... 
 Can anyone please let me know how do I increase the mapreduce slot?
 
 From: amitkrdu...@outlook.com
 To: user@hive.apache.org
 Subject: PIG heart beat freeze using hue + cdh 5.1
 Date: Tue, 9 Sep 2014 17:55:01 -0500
 
 Hi 
 I have a only 604 rows in the hive table.
 
 while using 
 A = LOAD 'revenue' USING org.apache.hcatalog.pig.HCatLoader(); 
 DUMP A;
  it starts spouting heart beat repeatedly and does not leave this state.
 Can please someone help.
 I am getting following exception
 
   2014-09-09 17:27:45,844 [JobControl] INFO  
 org.apache.hadoop.mapreduce.JobSubmitter  - Kind: RM_DELEGATION_TOKEN, 
 Service: 10.215.204.182:8032, Ident: (owner=cloudera, renewer=oozie mr token, 
 realUser=oozie, issueDate=1410301632571, maxDate=1410906432571, 
 sequenceNumber=14, masterKeyId=2)
   2014-09-09 17:27:46,709 [JobControl] WARN  
 org.apache.hadoop.mapreduce.v2.util.MRApps  - cache file 
 (mapreduce.job.cache.files) 
 hdfs://txwlcloud2:8020/user/oozie/share/lib/lib_20140820161455/pig/commons-httpclient-3.1.jar
  conflicts with cache file (mapreduce.job.cache.files) 
 hdfs://txwlcloud2:8020/user/oozie/share/lib/lib_20140820161455/hcatalog/commons-httpclient-3.1.jar
  This will be an error in Hadoop 2.0
   2014-09-09 17:27:46,712 [JobControl] WARN  
 org.apache.hadoop.mapreduce.v2.util.MRApps  - cache file 
 (mapreduce.job.cache.files) 
 hdfs://txwlcloud2:8020/user/oozie/share/lib/lib_20140820161455/pig/commons-io-2.1.jar
  conflicts with cache file (mapreduce.job.cache.files) 
 hdfs://txwlcloud2:8020/user/oozie/share/lib/lib_20140820161455/hcatalog/commons-io-2.1.jar
  This will be an error in Hadoop 2.0
   2014-09-09 17:27:46,894 [JobControl] INFO  
 org.apache.hadoop.yarn.client.api.impl.YarnClientImpl  - Submitted 
 application application_1410291186220_0006
   2014-09-09 17:27:46,968 [JobControl] INFO  org.apache.hadoop.mapreduce.Job  
 - The url to track the job: 
 http://txwlcloud2:8088/proxy/application_1410291186220_0006/
   2014-09-09 17:27:46,969 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
   - HadoopJobId: job_1410291186220_0006
   2014-09-09 17:27:46,969 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
   - Processing aliases A
   2014-09-09 17:27:46,969 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
   - detailed locations: M: A[1,4] C:  R:
   2014-09-09 17:27:46,969 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
   - More information at: 
 http://txwlcloud2:50030/jobdetails.jsp?jobid=job_1410291186220_0006
   2014-09-09 17:27:47,019 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
   - 0% complete
   Heart beat
   Heart beat
   Heart beat
   Heart beat
   Heart beat


Re: Indexes vs Partitions in hive

2014-09-09 Thread Lefty Leverenz
Thanks very much Nick, it's yours for the taking.

-- Lefty

On Tue, Sep 9, 2014 at 2:37 PM, Martin, Nick nimar...@pssd.com wrote:

  Lefty, that’s the single best description of indexes/partitions I’ve yet
 encountered. Stealing it.



 Nice J



 *From:* Lefty Leverenz [mailto:leftylever...@gmail.com]
 *Sent:* Tuesday, September 09, 2014 2:28 PM
 *To:* user@hive.apache.org
 *Subject:* Re: Indexes vs Partitions in hive



 Others can give technical explanations, but I'll give you a simple
 analogy:  a book might have an index as well as chapters.  Both help you
 find information more quickly.  The index directs you to particular
 information, and chapters partition the book into smaller pieces that are
 organized around a common theme.



 To stretch the analogy, a book can only have one set of chapters but it
 can have multiple indexes (topic index, scientific name index, poem title
 index, poem author index, and so on).




   -- Lefty



 On Mon, Sep 8, 2014 at 6:26 AM, Chhaya Vishwakarma 
 chhaya.vishwaka...@lntinfotech.com wrote:

 Hi All,

 How indexes in hive are different than partitions? both improves query
 performance as per my knowledge then in what way they differ?

 What are the situations I'll be using indexing or partitioning? Can i use
 them together?

 Kindly suggest





 Regards,

 Chhaya Vishwakarma




  --

 The contents of this e-mail and any attachment(s) may contain confidential
 or privileged information for the intended recipient(s). Unintended
 recipients are prohibited from taking action on the basis of information in
 this e-mail and using or disseminating the information, and must notify the
 sender and delete it from their system. LT Infotech will not accept
 responsibility or liability for the accuracy or completeness of, or the
 presence of any virus or disabling code in this e-mail