Avro Backed Hive tables

2013-03-06 Thread Murtaza Doctor
Folks,

Wanted to get some help or feedback from the community on this one:

We are trying to create a table backed by avro schema:
Hive Version: 0.10

This is the syntax we are using to create the table. As you can notice the avro 
schema is defined inline.
Question:-  If you look at the avro schema we have a union of two records and 
when we issue the command it successfully creates the table with only the first 
record PageView and the columns searchTerms & filterCategories. It completely 
ignored the second union record.
It is probably where hive is not supporting avro union records as part of the 
CREATE table statement. Any clues on this one would help.

CREATE TABLE foo
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES ('avro.schema.literal'='{
  "type" : "record",
  "name" : "PageView",
  "fields" : [ {
"name" : "searchTerms",
"type" : {
  "type" : "array",
  "items" : "string"
}
  }, {
"name" : "filterCategories",
"type" : {
  "type" : "array",
  "items" : "string"
}
  } ]
}, {
  "type" : "record",
  "name" : "CategoryView",
  "fields" : [ {
"name" : "categoryId",
"type" : "string"
  }, {
"name" : "refinements",
"type" : {
  "type" : "array",
  "items" : {
"type" : "record",
"name" : "Refinement",
"fields" : [ {
  "name" : "attribute",
  "type" : "string"
}, {
  "name" : "value",
  "type" : "string"
} ]
  }
},
"default" : [ ]
  } ]
}');

Thanks,
murtaza



RE: Hadoop cluster hangs on big hive job

2013-03-06 Thread Chalcy Raja
In my case, it was not a bug.  The temp data was filling up the data space and 
it appeared like hanging, but the last reducer job was still running trying to 
move data.  Once there is absolutely no space for data then, cluster goes into 
safemode and it hangs. In my case it did not get to the absolute hanging part.  
I terminated the query and broken down the query so the final table is 
partitioned and that worked fine.

If you provide you hive query and also give more information about your cluster 
size and the data size you are trying to run the query, I can analyze your 
issue and may be provide a solution.

Thanks,
Chalcy

From: Daning Wang [dan...@netseer.com]
Sent: Wednesday, March 06, 2013 4:17 PM
To: user@hive.apache.org
Subject: Re: Hadoop cluster hangs on big hive job

Thanks Chalcy! But the hadoop cluster should not hang in any way, is that a bug?

On Wed, Mar 6, 2013 at 12:33 PM, Chalcy Raja 
mailto:chalcy.r...@careerbuilder.com>> wrote:
You could try breaking up the hive query to return smaller datasets.  I have 
noticed this behavior when the hive query has ‘in’ in where clause.

Thanks,
Chalcy
From: Daning Wang [mailto:dan...@netseer.com]
Sent: Wednesday, March 06, 2013 3:08 PM
To: user@hive.apache.org
Subject: Hadoop cluster hangs on big hive job

We have 5 nodes cluster(Hadoop 1.0.4), It hung a couple of times while running 
big hive jobs(hive-0.8.1). Basically all the nodes are dead, from that 
trasktracker's log looks it went into some kinds of loop forever.

All the log entries like this when problem happened.

Any idea how to debug the issue?

Thanks in advance.


2013-03-05 15:13:19,526 INFO org.apache.hadoop.mapred.TaskTracker: 
attempt_201302270947_0010_r_12_0 0.131468% reduce > copy (19706 of 49964 at 
0.00 MB/s) >
2013-03-05 15:13:19,552 INFO org.apache.hadoop.mapred.TaskTracker: 
attempt_201302270947_0010_r_28_0 0.131468% reduce > copy (19706 of 49964 at 
0.00 MB/s) >
2013-03-05 15:13:20,858 INFO org.apache.hadoop.mapred.TaskTracker: 
attempt_201302270947_0010_r_36_0 0.131468% reduce > copy (19706 of 49964 at 
0.00 MB/s) >
2013-03-05 15:13:21,141 INFO org.apache.hadoop.mapred.TaskTracker: 
attempt_201302270947_0010_r_16_0 0.131468% reduce > copy (19706 of 49964 at 
0.00 MB/s) >
2013-03-05 15:13:21,486 INFO org.apache.hadoop.mapred.TaskTracker: 
attempt_201302270947_0010_r_19_0 0.131468% reduce > copy (19706 of 49964 at 
0.00 MB/s) >
2013-03-05 15:13:21,692 INFO org.apache.hadoop.mapred.TaskTracker: 
attempt_201302270947_0010_r_39_0 0.131468% reduce > copy (19706 of 49964 at 
0.00 MB/s) >
2013-03-05 15:13:22,448 INFO org.apache.hadoop.mapred.TaskTracker: 
attempt_201302270947_0010_r_32_0 0.131468% reduce > copy (19706 of 49964 at 
0.00 MB/s) >
2013-03-05 15:13:22,643 INFO org.apache.hadoop.mapred.TaskTracker: 
attempt_201302270947_0010_r_00_0 0.131468% reduce > copy (19706 of 49964 at 
0.00 MB/s) >
2013-03-05 15:13:22,840 INFO org.apache.hadoop.mapred.TaskTracker: 
attempt_201302270947_0010_r_24_0 0.131468% reduce > copy (19706 of 49964 at 
0.00 MB/s) >
2013-03-05 15:13:24,628 INFO org.apache.hadoop.mapred.TaskTracker: 
attempt_201302270947_0010_r_08_0 0.131468% reduce > copy (19706 of 49964 at 
0.00 MB/s) >
2013-03-05 15:13:24,723 INFO org.apache.hadoop.mapred.TaskTracker: 
attempt_201302270947_0010_r_39_0 0.131468% reduce > copy (19706 of 49964 at 
0.00 MB/s) >
2013-03-05 15:13:25,336 INFO org.apache.hadoop.mapred.TaskTracker: 
attempt_201302270947_0010_r_04_0 0.131468% reduce > copy (19706 of 49964 at 
0.00 MB/s) >
2013-03-05 15:13:25,539 INFO org.apache.hadoop.mapred.TaskTracker: 
attempt_201302270947_0010_r_43_0 0.131468% reduce > copy (19706 of 49964 at 
0.00 MB/s) >
2013-03-05 15:13:25,545 INFO org.apache.hadoop.mapred.TaskTracker: 
attempt_201302270947_0010_r_12_0 0.131468% reduce > copy (19706 of 49964 at 
0.00 MB/s) >
2013-03-05 15:13:25,569 INFO org.apache.hadoop.mapred.TaskTracker: 
attempt_201302270947_0010_r_28_0 0.131468% reduce > copy (19706 of 49964 at 
0.00 MB/s) >
2013-03-05 15:13:25,855 INFO org.apache.hadoop.mapred.TaskTracker: 
attempt_201302270947_0010_r_24_0 0.131468% reduce > copy (19706 of 49964 at 
0.00 MB/s) >
2013-03-05 15:13:26,876 INFO org.apache.hadoop.mapred.TaskTracker: 
attempt_201302270947_0010_r_36_0 0.131468% reduce > copy (19706 of 49964 at 
0.00 MB/s) >
2013-03-05 15:13:27,159 INFO org.apache.hadoop.mapred.TaskTracker: 
attempt_201302270947_0010_r_16_0 0.131468% reduce > copy (19706 of 49964 at 
0.00 MB/s) >
2013-03-05 15:13:27,505 INFO org.apache.hadoop.mapred.TaskTracker: 
attempt_201302270947_0010_r_19_0 0.131468% reduce > copy (19706 of 49964 at 
0.00 MB/s) >
2013-03-05 15:13:28,464 INFO org.apache.hadoop.mapred.TaskTracker: 
attempt_201302270947_0010_r_32_0 0.131468% reduce > copy (19706 of 49964 at 
0.00 MB/s) >
2013-03-05 15:13:28,553 INFO org.apache.hadoop.mapred.T

Re: Variable Substitution

2013-03-06 Thread Matt Tucker
I'm fine with the variable placeholder not being removed in cases where the 
variable is not defined (until I change my mind). When I define var2 and var3, 
though, their placeholders aren't swapped for their values.

My reasoning for this was that I'm moving from one execution script that 
defines var1, to a newer script that defines var2 and var3.  The goal for this 
was that I could use the same hive script with either execution script until 
we've fully migrated.

I don't disagree with the placeholders remaining if they're not defined, I'm 
just unsure that having 1 variable reference that's not defined should prevent 
the others from working.


On Mar 6, 2013, at 4:41 PM, Edward Capriolo  wrote:

> It was done like this in hive because that is what hadoops variable 
> substitution does, namely if it does not understand the variable it does not 
> replace it.
> 
> On Wed, Mar 6, 2013 at 4:30 PM, Dean Wampler 
>  wrote:
>> Even newer versions of Hive do this. Any reason you don't want to provide a 
>> definition for all of them? You could argue that an undefined variable is a 
>> bug and leaving the literal text in place makes it easier to notice. 
>> Although, Unix shells would insert an empty string, so never mind ;)
>> 
>> On Wed, Mar 6, 2013 at 3:13 PM, Matt Tucker  wrote:
>>> Using CDH3u3 (Hive 0.7.1), it appears that variable substitution becomes 
>>> disabled when I use a variable in a query that hasn't been defined.
>>> 
>>> For instance, using the following script:
>>> 
>>> set var2=2013-02-01;
>>> set var3=2013-02-10;
>>> 
>>> SELECT clndr_dt FROM calendar WHERE clndr_dt LIKE "${hiveconf:var1}" OR 
>>> (clndr_dt >= "${hiveconf:var2}" AND clndr_dt <= "${hiveconf:var3}") LIMIT 1;
>>> 
>>> I would expect the results job configuration to list the hive.query.string 
>>> as:
>>> 
>>> SELECT clndr_dt FROM calendar WHERE clndr_dt LIKE "${hiveconf:var1}" OR 
>>> (clndr_dt >= "2013-02-01" AND clndr_dt <= "2013-02-10") LIMIT 1;
>>> 
>>> (or with ${hiveconf:var1} removed, leaving an empty string).
>>> 
>>> Instead, it prints the query as-is (with the variable placeholders).  If I 
>>> set var1=2012-01-01, it properly substitutes the variables, but it only 
>>> returns the '2012-01-01' record (not the 2013 records).
>>> 
>>> SELECT clndr_dt FROM calendar WHERE clndr_dt LIKE "2012-01-01" OR ( 
>>> clndr_dt >= "2013-02-01" AND clndr_dt <= "2013-02-10" ) LIMIT 1
>>> 
>>> I was originally planning to use this for partition pruning, but it doesn't 
>>> appear to be the cause as the calendar table is not partitioned.
>>> 
>>> Is there something that I've overlooked?
>>> 
>>> Thanks!
>> 
>> 
>> 
>> -- 
>> Dean Wampler, Ph.D.
>> thinkbiganalytics.com
>> +1-312-339-1330
> 


Re: Variable Substitution

2013-03-06 Thread Edward Capriolo
It was done like this in hive because that is what hadoops variable
substitution does, namely if it does not understand the variable it does
not replace it.

On Wed, Mar 6, 2013 at 4:30 PM, Dean Wampler <
dean.wamp...@thinkbiganalytics.com> wrote:

> Even newer versions of Hive do this. Any reason you don't want to provide
> a definition for all of them? You could argue that an undefined variable is
> a bug and leaving the literal text in place makes it easier to notice.
> Although, Unix shells would insert an empty string, so never mind ;)
>
> On Wed, Mar 6, 2013 at 3:13 PM, Matt Tucker  wrote:
>
>> Using CDH3u3 (Hive 0.7.1), it appears that variable substitution becomes
>> disabled when I use a variable in a query that hasn't been defined.
>>
>> For instance, using the following script:
>>
>> set var2=2013-02-01;
>> set var3=2013-02-10;
>>
>> SELECT clndr_dt FROM calendar WHERE clndr_dt LIKE "${hiveconf:var1}" OR
>> (clndr_dt >= "${hiveconf:var2}" AND clndr_dt <= "${hiveconf:var3}") LIMIT 1;
>>
>>
>> I would expect the results job configuration to list the
>> hive.query.string as:
>>
>> SELECT clndr_dt FROM calendar WHERE clndr_dt LIKE "${hiveconf:var1}" OR
>> (clndr_dt >= "2013-02-01" AND clndr_dt <= "2013-02-10") LIMIT 1;
>>
>>
>> (or with ${hiveconf:var1} removed, leaving an empty string).
>>
>> Instead, it prints the query as-is (with the variable placeholders).  If
>> I set var1=2012-01-01, it properly substitutes the variables, but it only
>> returns the '2012-01-01' record (not the 2013 records).
>>
>> SELECT clndr_dt FROM calendar WHERE clndr_dt LIKE "2012-01-01" OR (
>> clndr_dt >= "2013-02-01" AND clndr_dt <= "2013-02-10" ) LIMIT 1
>>
>> I was originally planning to use this for partition pruning, but it
>> doesn't appear to be the cause as the calendar table is not partitioned.
>>
>> Is there something that I've overlooked?
>>
>> Thanks!
>>
>>
>
>
> --
> *Dean Wampler, Ph.D.*
> thinkbiganalytics.com
> +1-312-339-1330
>
>


Re: Variable Substitution

2013-03-06 Thread Dean Wampler
Even newer versions of Hive do this. Any reason you don't want to provide a
definition for all of them? You could argue that an undefined variable is a
bug and leaving the literal text in place makes it easier to notice.
Although, Unix shells would insert an empty string, so never mind ;)

On Wed, Mar 6, 2013 at 3:13 PM, Matt Tucker  wrote:

> Using CDH3u3 (Hive 0.7.1), it appears that variable substitution becomes
> disabled when I use a variable in a query that hasn't been defined.
>
> For instance, using the following script:
>
> set var2=2013-02-01;
> set var3=2013-02-10;
>
> SELECT clndr_dt FROM calendar WHERE clndr_dt LIKE "${hiveconf:var1}" OR
> (clndr_dt >= "${hiveconf:var2}" AND clndr_dt <= "${hiveconf:var3}") LIMIT 1;
>
>
> I would expect the results job configuration to list the hive.query.string
> as:
>
> SELECT clndr_dt FROM calendar WHERE clndr_dt LIKE "${hiveconf:var1}" OR
> (clndr_dt >= "2013-02-01" AND clndr_dt <= "2013-02-10") LIMIT 1;
>
>
> (or with ${hiveconf:var1} removed, leaving an empty string).
>
> Instead, it prints the query as-is (with the variable placeholders).  If I
> set var1=2012-01-01, it properly substitutes the variables, but it only
> returns the '2012-01-01' record (not the 2013 records).
>
> SELECT clndr_dt FROM calendar WHERE clndr_dt LIKE "2012-01-01" OR (
> clndr_dt >= "2013-02-01" AND clndr_dt <= "2013-02-10" ) LIMIT 1
>
> I was originally planning to use this for partition pruning, but it
> doesn't appear to be the cause as the calendar table is not partitioned.
>
> Is there something that I've overlooked?
>
> Thanks!
>
>


-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330


Re: Hadoop cluster hangs on big hive job

2013-03-06 Thread Daning Wang
Thanks Chalcy! But the hadoop cluster should not hang in any way, is that a
bug?

On Wed, Mar 6, 2013 at 12:33 PM, Chalcy Raja
wrote:

>  You could try breaking up the hive query to return smaller datasets.  I
> have noticed this behavior when the hive query has ‘in’ in where clause.**
> **
>
> ** **
>
> Thanks,
>
> Chalcy
>
> *From:* Daning Wang [mailto:dan...@netseer.com]
> *Sent:* Wednesday, March 06, 2013 3:08 PM
> *To:* user@hive.apache.org
> *Subject:* Hadoop cluster hangs on big hive job
>
> ** **
>
> We have 5 nodes cluster(Hadoop 1.0.4), It hung a couple of times while
> running big hive jobs(hive-0.8.1). Basically all the nodes are dead, from
> that trasktracker's log looks it went into some kinds of loop forever.
>
> ** **
>
> All the log entries like this when problem happened.
>
> ** **
>
> Any idea how to debug the issue?
>
> ** **
>
> Thanks in advance.
>
> ** **
>
> ** **
>
> 2013-03-05 15:13:19,526 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201302270947_0010_r_12_0 0.131468% reduce > copy (19706 of
> 49964 at 0.00 MB/s) > 
>
> 2013-03-05 15:13:19,552 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201302270947_0010_r_28_0 0.131468% reduce > copy (19706 of
> 49964 at 0.00 MB/s) > 
>
> 2013-03-05 15:13:20,858 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201302270947_0010_r_36_0 0.131468% reduce > copy (19706 of
> 49964 at 0.00 MB/s) > 
>
> 2013-03-05 15:13:21,141 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201302270947_0010_r_16_0 0.131468% reduce > copy (19706 of
> 49964 at 0.00 MB/s) > 
>
> 2013-03-05 15:13:21,486 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201302270947_0010_r_19_0 0.131468% reduce > copy (19706 of
> 49964 at 0.00 MB/s) > 
>
> 2013-03-05 15:13:21,692 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201302270947_0010_r_39_0 0.131468% reduce > copy (19706 of
> 49964 at 0.00 MB/s) > 
>
> 2013-03-05 15:13:22,448 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201302270947_0010_r_32_0 0.131468% reduce > copy (19706 of
> 49964 at 0.00 MB/s) > 
>
> 2013-03-05 15:13:22,643 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201302270947_0010_r_00_0 0.131468% reduce > copy (19706 of
> 49964 at 0.00 MB/s) > 
>
> 2013-03-05 15:13:22,840 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201302270947_0010_r_24_0 0.131468% reduce > copy (19706 of
> 49964 at 0.00 MB/s) > 
>
> 2013-03-05 15:13:24,628 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201302270947_0010_r_08_0 0.131468% reduce > copy (19706 of
> 49964 at 0.00 MB/s) > 
>
> 2013-03-05 15:13:24,723 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201302270947_0010_r_39_0 0.131468% reduce > copy (19706 of
> 49964 at 0.00 MB/s) > 
>
> 2013-03-05 15:13:25,336 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201302270947_0010_r_04_0 0.131468% reduce > copy (19706 of
> 49964 at 0.00 MB/s) > 
>
> 2013-03-05 15:13:25,539 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201302270947_0010_r_43_0 0.131468% reduce > copy (19706 of
> 49964 at 0.00 MB/s) > 
>
> 2013-03-05 15:13:25,545 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201302270947_0010_r_12_0 0.131468% reduce > copy (19706 of
> 49964 at 0.00 MB/s) > 
>
> 2013-03-05 15:13:25,569 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201302270947_0010_r_28_0 0.131468% reduce > copy (19706 of
> 49964 at 0.00 MB/s) > 
>
> 2013-03-05 15:13:25,855 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201302270947_0010_r_24_0 0.131468% reduce > copy (19706 of
> 49964 at 0.00 MB/s) > 
>
> 2013-03-05 15:13:26,876 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201302270947_0010_r_36_0 0.131468% reduce > copy (19706 of
> 49964 at 0.00 MB/s) > 
>
> 2013-03-05 15:13:27,159 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201302270947_0010_r_16_0 0.131468% reduce > copy (19706 of
> 49964 at 0.00 MB/s) > 
>
> 2013-03-05 15:13:27,505 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201302270947_0010_r_19_0 0.131468% reduce > copy (19706 of
> 49964 at 0.00 MB/s) > 
>
> 2013-03-05 15:13:28,464 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201302270947_0010_r_32_0 0.131468% reduce > copy (19706 of
> 49964 at 0.00 MB/s) > 
>
> 2013-03-05 15:13:28,553 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201302270947_0010_r_43_0 0.131468% reduce > copy (19706 of
> 49964 at 0.00 MB/s) > 
>
> 2013-03-05 15:13:28,561 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201302270947_0010_r_12_0 0.131468% reduce > copy (19706 of
> 49964 at 0.00 MB/s) > 
>
> 2013-03-05 15:13:28,659 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201302270947_0010_r_00_0 0.131468% reduce > copy (19706 of
> 49964 at 0.00 MB/s) > 
>
> 2013-03-05 15:13:30,519 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_2013022709

Re: Error while exporting table data from hive to Oracle through Sqoop

2013-03-06 Thread Jarek Jarcec Cecho
Hi Ajit,
would you mind upgrading to Sqoop 1.4.3 RC 0 [1]? It has been already voted to 
be released as the final 1.4.3, so it should be safe to use.

One of the improvements in 1.4.3 is SQOOP-720 [2] that significantly improves 
the error message in this scenario.

Jarcec

Links:
1: http://people.apache.org/~hshreedharan/sqoop-1.4.3-rc0/
2: https://issues.apache.org/jira/browse/SQOOP-720

On Wed, Mar 06, 2013 at 05:44:44AM +, Ajit Kumar Shreevastava wrote:
> Hi Abhijeet,
> 
> Data is fine. Firstly map reducer running for 48% then failed. After that map 
> reducer again tried to load the same data due to that unique constraints 
> error came.
> 
> 
> 
> Regards,
> 
> Ajit Kumar Shreevastava
> 
> abhijeet gaikwad mailto:abygaikwa...@gmail.com>> 
> wrote:
> 
> 
> + sqoop user
> 
> The answer is in your exception! Check your data, your hitting unique key 
> violation.
> 
> Thanks,
> Abhijeet
> On Tue, Mar 5, 2013 at 7:24 PM, Ajit Kumar Shreevastava 
> mailto:ajit.shreevast...@hcl.com>> wrote:
> Hi All,
> 
> I am facing following issue while exporting table from hive to Oracle. 
> Importing table from Oracle to Hive and HDFS is working fine. Please let me 
> know where I lag. I am pasting my screen output here.
> 
> 
> [hadoop@NHCLT-PC44-2 sqoop-oper]$ sqoop export --connect 
> jdbc:oracle:thin:@10.99.42.11:1521/clouddb
>  --username HDFSUSER  --table BTTN_BKP --export-dir  
> /home/hadoop/user/hive/warehouse/bttn  -P --verbose  -m 1  
> --input-fields-terminated-by '\001'
> Warning: /usr/lib/hbase does not exist! HBase imports will fail.
> Please set $HBASE_HOME to the root of your HBase installation.
> 13/03/05 19:20:11 DEBUG tool.BaseSqoopTool: Enabled debug logging.
> Enter password:
> 13/03/05 19:20:16 DEBUG sqoop.ConnFactory: Loaded manager factory: 
> com.cloudera.sqoop.manager.DefaultManagerFactory
> 13/03/05 19:20:16 DEBUG sqoop.ConnFactory: Trying ManagerFactory: 
> com.cloudera.sqoop.manager.DefaultManagerFactory
> 13/03/05 19:20:16 DEBUG manager.DefaultManagerFactory: Trying with scheme: 
> jdbc:oracle:thin:@10.99.42.11
> 13/03/05 19:20:16 DEBUG manager.OracleManager$ConnCache: Instantiated new 
> connection cache.
> 13/03/05 19:20:16 INFO manager.SqlManager: Using default fetchSize of 1000
> 13/03/05 19:20:16 DEBUG sqoop.ConnFactory: Instantiated ConnManager 
> org.apache.sqoop.manager.OracleManager@2abe0e27
> 13/03/05 19:20:16 INFO tool.CodeGenTool: Beginning code generation
> 13/03/05 19:20:16 DEBUG manager.OracleManager: Using column names query: 
> SELECT t.* FROM BTTN_BKP t WHERE 1=0
> 13/03/05 19:20:16 DEBUG manager.OracleManager: Creating a new connection for 
> jdbc:oracle:thin:@10.99.42.11:1521/clouddb,
>  using username: HDFSUSER
> 13/03/05 19:20:16 DEBUG manager.OracleManager: No connection paramenters 
> specified. Using regular API for making connection.
> 13/03/05 19:20:16 INFO manager.OracleManager: Time zone has been set to GMT
> 13/03/05 19:20:16 DEBUG manager.SqlManager: Using fetchSize for next query: 
> 1000
> 13/03/05 19:20:16 INFO manager.SqlManager: Executing SQL statement: SELECT 
> t.* FROM BTTN_BKP t WHERE 1=0
> 13/03/05 19:20:16 DEBUG manager.OracleManager$ConnCache: Caching released 
> connection for 
> jdbc:oracle:thin:@10.99.42.11:1521/clouddb/HDFSUSER
> 13/03/05 19:20:16 DEBUG orm.ClassWriter: selected columns:
> 13/03/05 19:20:16 DEBUG orm.ClassWriter:   BTTN_ID
> 13/03/05 19:20:16 DEBUG orm.ClassWriter:   DATA_INST_ID
> 13/03/05 19:20:16 DEBUG orm.ClassWriter:   SCR_ID
> 13/03/05 19:20:16 DEBUG orm.ClassWriter:   BTTN_NU
> 13/03/05 19:20:16 DEBUG orm.ClassWriter:   CAT
> 13/03/05 19:20:16 DEBUG orm.ClassWriter:   WDTH
> 13/03/05 19:20:16 DEBUG orm.ClassWriter:   HGHT
> 13/03/05 19:20:16 DEBUG orm.ClassWriter:   KEY_SCAN
> 13/03/05 19:20:16 DEBUG orm.ClassWriter:   KEY_SHFT
> 13/03/05 19:20:16 DEBUG orm.ClassWriter:   FRGND_CPTN_COLR
> 13/03/05 19:20:16 DEBUG orm.ClassWriter:   FRGND_CPTN_COLR_PRSD
> 13/03/05 19:20:16 DEBUG orm.ClassWriter:   BKGD_CPTN_COLR
> 13/03/05 19:20:16 DEBUG orm.ClassWriter:   BKGD_CPTN_COLR_PRSD
> 13/03/05 19:20:16 DEBUG orm.ClassWriter:   BLM_FL
> 13/03/05 19:20:16 DEBUG orm.ClassWriter:   LCLZ_FL
> 13/03/05 19:20:16 DEBUG orm.ClassWriter:   MENU_ITEM_NU
> 13/03/05 19:20:16 DEBUG orm.ClassWriter:   BTTN_ASGN_LVL_ID
> 13/03/05 19:20:16 DEBUG orm.ClassWriter:   ON_ATVT
> 13/03/05 19:20:16 DEBUG orm.ClassWriter:   ON_CLIK
> 13/03/05 19:20:16 DEBUG orm.ClassWriter:   ENBL_FL
> 13/03/05 19:20:16 DEBUG orm.ClassWriter:   BLM_SET_ID
> 13/03/05 19:20:16 DEBUG orm.ClassWriter:   BTTN_ASGN_LVL_NAME
> 13/03/05 19:20:16 DEBUG orm.ClassWriter:   MKT_ID
> 13/03/05 19:20:16 DEBUG orm.ClassWriter:   CRTE_TS
> 13/03/05 19:20:16 DEBUG orm.ClassWriter:   CRTE_USER_ID
> 13/03/05 19:20:16 DEBUG orm.

RE: Hadoop cluster hangs on big hive job

2013-03-06 Thread Chalcy Raja
You could try breaking up the hive query to return smaller datasets.  I have 
noticed this behavior when the hive query has 'in' in where clause.

Thanks,
Chalcy
From: Daning Wang [mailto:dan...@netseer.com]
Sent: Wednesday, March 06, 2013 3:08 PM
To: user@hive.apache.org
Subject: Hadoop cluster hangs on big hive job

We have 5 nodes cluster(Hadoop 1.0.4), It hung a couple of times while running 
big hive jobs(hive-0.8.1). Basically all the nodes are dead, from that 
trasktracker's log looks it went into some kinds of loop forever.

All the log entries like this when problem happened.

Any idea how to debug the issue?

Thanks in advance.


2013-03-05 15:13:19,526 INFO org.apache.hadoop.mapred.TaskTracker: 
attempt_201302270947_0010_r_12_0 0.131468% reduce > copy (19706 of 49964 at 
0.00 MB/s) >
2013-03-05 15:13:19,552 INFO org.apache.hadoop.mapred.TaskTracker: 
attempt_201302270947_0010_r_28_0 0.131468% reduce > copy (19706 of 49964 at 
0.00 MB/s) >
2013-03-05 15:13:20,858 INFO org.apache.hadoop.mapred.TaskTracker: 
attempt_201302270947_0010_r_36_0 0.131468% reduce > copy (19706 of 49964 at 
0.00 MB/s) >
2013-03-05 15:13:21,141 INFO org.apache.hadoop.mapred.TaskTracker: 
attempt_201302270947_0010_r_16_0 0.131468% reduce > copy (19706 of 49964 at 
0.00 MB/s) >
2013-03-05 15:13:21,486 INFO org.apache.hadoop.mapred.TaskTracker: 
attempt_201302270947_0010_r_19_0 0.131468% reduce > copy (19706 of 49964 at 
0.00 MB/s) >
2013-03-05 15:13:21,692 INFO org.apache.hadoop.mapred.TaskTracker: 
attempt_201302270947_0010_r_39_0 0.131468% reduce > copy (19706 of 49964 at 
0.00 MB/s) >
2013-03-05 15:13:22,448 INFO org.apache.hadoop.mapred.TaskTracker: 
attempt_201302270947_0010_r_32_0 0.131468% reduce > copy (19706 of 49964 at 
0.00 MB/s) >
2013-03-05 15:13:22,643 INFO org.apache.hadoop.mapred.TaskTracker: 
attempt_201302270947_0010_r_00_0 0.131468% reduce > copy (19706 of 49964 at 
0.00 MB/s) >
2013-03-05 15:13:22,840 INFO org.apache.hadoop.mapred.TaskTracker: 
attempt_201302270947_0010_r_24_0 0.131468% reduce > copy (19706 of 49964 at 
0.00 MB/s) >
2013-03-05 15:13:24,628 INFO org.apache.hadoop.mapred.TaskTracker: 
attempt_201302270947_0010_r_08_0 0.131468% reduce > copy (19706 of 49964 at 
0.00 MB/s) >
2013-03-05 15:13:24,723 INFO org.apache.hadoop.mapred.TaskTracker: 
attempt_201302270947_0010_r_39_0 0.131468% reduce > copy (19706 of 49964 at 
0.00 MB/s) >
2013-03-05 15:13:25,336 INFO org.apache.hadoop.mapred.TaskTracker: 
attempt_201302270947_0010_r_04_0 0.131468% reduce > copy (19706 of 49964 at 
0.00 MB/s) >
2013-03-05 15:13:25,539 INFO org.apache.hadoop.mapred.TaskTracker: 
attempt_201302270947_0010_r_43_0 0.131468% reduce > copy (19706 of 49964 at 
0.00 MB/s) >
2013-03-05 15:13:25,545 INFO org.apache.hadoop.mapred.TaskTracker: 
attempt_201302270947_0010_r_12_0 0.131468% reduce > copy (19706 of 49964 at 
0.00 MB/s) >
2013-03-05 15:13:25,569 INFO org.apache.hadoop.mapred.TaskTracker: 
attempt_201302270947_0010_r_28_0 0.131468% reduce > copy (19706 of 49964 at 
0.00 MB/s) >
2013-03-05 15:13:25,855 INFO org.apache.hadoop.mapred.TaskTracker: 
attempt_201302270947_0010_r_24_0 0.131468% reduce > copy (19706 of 49964 at 
0.00 MB/s) >
2013-03-05 15:13:26,876 INFO org.apache.hadoop.mapred.TaskTracker: 
attempt_201302270947_0010_r_36_0 0.131468% reduce > copy (19706 of 49964 at 
0.00 MB/s) >
2013-03-05 15:13:27,159 INFO org.apache.hadoop.mapred.TaskTracker: 
attempt_201302270947_0010_r_16_0 0.131468% reduce > copy (19706 of 49964 at 
0.00 MB/s) >
2013-03-05 15:13:27,505 INFO org.apache.hadoop.mapred.TaskTracker: 
attempt_201302270947_0010_r_19_0 0.131468% reduce > copy (19706 of 49964 at 
0.00 MB/s) >
2013-03-05 15:13:28,464 INFO org.apache.hadoop.mapred.TaskTracker: 
attempt_201302270947_0010_r_32_0 0.131468% reduce > copy (19706 of 49964 at 
0.00 MB/s) >
2013-03-05 15:13:28,553 INFO org.apache.hadoop.mapred.TaskTracker: 
attempt_201302270947_0010_r_43_0 0.131468% reduce > copy (19706 of 49964 at 
0.00 MB/s) >
2013-03-05 15:13:28,561 INFO org.apache.hadoop.mapred.TaskTracker: 
attempt_201302270947_0010_r_12_0 0.131468% reduce > copy (19706 of 49964 at 
0.00 MB/s) >
2013-03-05 15:13:28,659 INFO org.apache.hadoop.mapred.TaskTracker: 
attempt_201302270947_0010_r_00_0 0.131468% reduce > copy (19706 of 49964 at 
0.00 MB/s) >
2013-03-05 15:13:30,519 INFO org.apache.hadoop.mapred.TaskTracker: 
attempt_201302270947_0010_r_19_0 0.131468% reduce > copy (19706 of 49964 at 
0.00 MB/s) >
2013-03-05 15:13:30,644 INFO org.apache.hadoop.mapred.TaskTracker: 
attempt_201302270947_0010_r_08_0 0.131468% reduce > copy (19706 of 49964 at 
0.00 MB/s) >
2013-03-05 15:13:30,741 INFO org.apache.hadoop.mapred.TaskTracker: 
attempt_201302270947_0010_r_39_0 0.131468% reduce > copy (19706 of 49964 at 
0.00 MB/s) >
2013-03-05 15:13:31,369 INFO org.apache.hadoop.mapred.TaskTracker: 
attempt_201302270947_0010_r_04_0 0.131468% reduce > copy (19706 of 

Hadoop cluster hangs on big hive job

2013-03-06 Thread Daning Wang
We have 5 nodes cluster(Hadoop 1.0.4), It hung a couple of times while
running big hive jobs(hive-0.8.1). Basically all the nodes are dead, from
that trasktracker's log looks it went into some kinds of loop forever.

All the log entries like this when problem happened.

Any idea how to debug the issue?

Thanks in advance.


2013-03-05 15:13:19,526 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201302270947_0010_r_12_0 0.131468% reduce > copy (19706 of
49964 at 0.00 MB/s) >
2013-03-05 15:13:19,552 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201302270947_0010_r_28_0 0.131468% reduce > copy (19706 of
49964 at 0.00 MB/s) >
2013-03-05 15:13:20,858 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201302270947_0010_r_36_0 0.131468% reduce > copy (19706 of
49964 at 0.00 MB/s) >
2013-03-05 15:13:21,141 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201302270947_0010_r_16_0 0.131468% reduce > copy (19706 of
49964 at 0.00 MB/s) >
2013-03-05 15:13:21,486 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201302270947_0010_r_19_0 0.131468% reduce > copy (19706 of
49964 at 0.00 MB/s) >
2013-03-05 15:13:21,692 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201302270947_0010_r_39_0 0.131468% reduce > copy (19706 of
49964 at 0.00 MB/s) >
2013-03-05 15:13:22,448 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201302270947_0010_r_32_0 0.131468% reduce > copy (19706 of
49964 at 0.00 MB/s) >
2013-03-05 15:13:22,643 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201302270947_0010_r_00_0 0.131468% reduce > copy (19706 of
49964 at 0.00 MB/s) >
2013-03-05 15:13:22,840 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201302270947_0010_r_24_0 0.131468% reduce > copy (19706 of
49964 at 0.00 MB/s) >
2013-03-05 15:13:24,628 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201302270947_0010_r_08_0 0.131468% reduce > copy (19706 of
49964 at 0.00 MB/s) >
2013-03-05 15:13:24,723 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201302270947_0010_r_39_0 0.131468% reduce > copy (19706 of
49964 at 0.00 MB/s) >
2013-03-05 15:13:25,336 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201302270947_0010_r_04_0 0.131468% reduce > copy (19706 of
49964 at 0.00 MB/s) >
2013-03-05 15:13:25,539 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201302270947_0010_r_43_0 0.131468% reduce > copy (19706 of
49964 at 0.00 MB/s) >
2013-03-05 15:13:25,545 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201302270947_0010_r_12_0 0.131468% reduce > copy (19706 of
49964 at 0.00 MB/s) >
2013-03-05 15:13:25,569 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201302270947_0010_r_28_0 0.131468% reduce > copy (19706 of
49964 at 0.00 MB/s) >
2013-03-05 15:13:25,855 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201302270947_0010_r_24_0 0.131468% reduce > copy (19706 of
49964 at 0.00 MB/s) >
2013-03-05 15:13:26,876 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201302270947_0010_r_36_0 0.131468% reduce > copy (19706 of
49964 at 0.00 MB/s) >
2013-03-05 15:13:27,159 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201302270947_0010_r_16_0 0.131468% reduce > copy (19706 of
49964 at 0.00 MB/s) >
2013-03-05 15:13:27,505 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201302270947_0010_r_19_0 0.131468% reduce > copy (19706 of
49964 at 0.00 MB/s) >
2013-03-05 15:13:28,464 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201302270947_0010_r_32_0 0.131468% reduce > copy (19706 of
49964 at 0.00 MB/s) >
2013-03-05 15:13:28,553 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201302270947_0010_r_43_0 0.131468% reduce > copy (19706 of
49964 at 0.00 MB/s) >
2013-03-05 15:13:28,561 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201302270947_0010_r_12_0 0.131468% reduce > copy (19706 of
49964 at 0.00 MB/s) >
2013-03-05 15:13:28,659 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201302270947_0010_r_00_0 0.131468% reduce > copy (19706 of
49964 at 0.00 MB/s) >
2013-03-05 15:13:30,519 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201302270947_0010_r_19_0 0.131468% reduce > copy (19706 of
49964 at 0.00 MB/s) >
2013-03-05 15:13:30,644 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201302270947_0010_r_08_0 0.131468% reduce > copy (19706 of
49964 at 0.00 MB/s) >
2013-03-05 15:13:30,741 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201302270947_0010_r_39_0 0.131468% reduce > copy (19706 of
49964 at 0.00 MB/s) >
2013-03-05 15:13:31,369 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201302270947_0010_r_04_0 0.131468% reduce > copy (19706 of
49964 at 0.00 MB/s) >
2013-03-05 15:13:31,675 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201302270947_0010_r_00_0 0.131468% reduce > copy (19706 of
49964 at 0.00 MB/s) >
2013-03-05 15:13:31,875 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201302270947_0010_r_24_0 0.131468% reduce > copy (19706 of
49964 at 0.00 MB/s) >
2013-03-05 15:13:32,372 INFO org.

Re: Combine two overlapping schema?

2013-03-06 Thread Keith Wiley
Ah.  I was stuck on the requirement that the two schema match, but I see your 
point.  I'll see if that works.

On Mar 6, 2013, at 10:11 , Dean Wampler wrote:

> Of the top of my head, I think UNION ALL should work if you explicitly 
> project out the missing columns with NULL or other values, e.g. using nested 
> SELECTs, something like
> 
> SELECT * FROM (
>  SELECT a,b,c, Y, NULL AS Z FROM table1
>  UNION ALL
>  SELECT a,b,c, NULL AS Y, Z FROM table2
> ) table12; 
> 
> 
> 
> On Wed, Mar 6, 2013 at 12:03 PM, Keith Wiley  wrote:
> I have two tables which have overlapping but nonidentical schema.  I want to 
> creating a new table that unions them, leaving nulls in any given row where a 
> column name doesn't occur in the other table:
> 
> SCHEMA 1: { a, b, c, Y }
> row:  { 1, 2, 3, 4 }
> 
> SCHEMA 2: { a, b, c, Z }
> row:  { 5, 6, 7, 8 }
> 
> NEW SCHEMA: { a, b, c, Y, Z }
> new row:  { a:1, b:2, c:3, Y:4, Z:null }
> new row:  { a:5, b:6, c:7, Y:null, Z:8 }
> 
> I don't think either "full outer join" or "union all" works.  I'm not sure 
> how to do this.  Any ideas?
> 
> Thanks.
> 
> 
> Keith Wiley kwi...@keithwiley.com keithwiley.com
> music.keithwiley.com
> 
> "You can scratch an itch, but you can't itch a scratch. Furthermore, an itch 
> can
> itch but a scratch can't scratch. Finally, a scratch can itch, but an itch 
> can't
> scratch. All together this implies: He scratched the itch from the scratch 
> that
> itched but would never itch the scratch from the itch that scratched."
>--  Keith Wiley
> 
> 
> 
> 
> 
> -- 
> Dean Wampler, Ph.D.
> thinkbiganalytics.com
> +1-312-339-1330
> 



Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"I do not feel obliged to believe that the same God who has endowed us with
sense, reason, and intellect has intended us to forgo their use."
   --  Galileo Galilei




Re: Combine two overlapping schema?

2013-03-06 Thread Dean Wampler
Of the top of my head, I think UNION ALL should work if you explicitly
project out the missing columns with NULL or other values, e.g. using
nested SELECTs, something like

SELECT * FROM (
 SELECT a,b,c, Y, NULL AS Z FROM table1
 UNION ALL
 SELECT a,b,c, NULL AS Y, Z FROM table2
) table12;



On Wed, Mar 6, 2013 at 12:03 PM, Keith Wiley  wrote:

> I have two tables which have overlapping but nonidentical schema.  I want
> to creating a new table that unions them, leaving nulls in any given row
> where a column name doesn't occur in the other table:
>
> SCHEMA 1: { a, b, c, Y }
> row:  { 1, 2, 3, 4 }
>
> SCHEMA 2: { a, b, c, Z }
> row:  { 5, 6, 7, 8 }
>
> NEW SCHEMA: { a, b, c, Y, Z }
> new row:  { a:1, b:2, c:3, Y:4, Z:null }
> new row:  { a:5, b:6, c:7, Y:null, Z:8 }
>
> I don't think either "full outer join" or "union all" works.  I'm not sure
> how to do this.  Any ideas?
>
> Thanks.
>
>
> 
> Keith Wiley kwi...@keithwiley.com keithwiley.com
> music.keithwiley.com
>
> "You can scratch an itch, but you can't itch a scratch. Furthermore, an
> itch can
> itch but a scratch can't scratch. Finally, a scratch can itch, but an itch
> can't
> scratch. All together this implies: He scratched the itch from the scratch
> that
> itched but would never itch the scratch from the itch that scratched."
>--  Keith Wiley
>
> 
>
>


-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330


Combine two overlapping schema?

2013-03-06 Thread Keith Wiley
I have two tables which have overlapping but nonidentical schema.  I want to 
creating a new table that unions them, leaving nulls in any given row where a 
column name doesn't occur in the other table:

SCHEMA 1: { a, b, c, Y }
row:  { 1, 2, 3, 4 }

SCHEMA 2: { a, b, c, Z }
row:  { 5, 6, 7, 8 }

NEW SCHEMA: { a, b, c, Y, Z }
new row:  { a:1, b:2, c:3, Y:4, Z:null }
new row:  { a:5, b:6, c:7, Y:null, Z:8 }

I don't think either "full outer join" or "union all" works.  I'm not sure how 
to do this.  Any ideas?

Thanks.


Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"You can scratch an itch, but you can't itch a scratch. Furthermore, an itch can
itch but a scratch can't scratch. Finally, a scratch can itch, but an itch can't
scratch. All together this implies: He scratched the itch from the scratch that
itched but would never itch the scratch from the itch that scratched."
   --  Keith Wiley




Re: Data mismatch when importing data from Oracle to Hive through Sqoop without an error

2013-03-06 Thread Jarek Jarcec Cecho
Hi Ajit,
I've seen similar issue many times. Does your table have textual data? If so, 
can it happen that your textual data contains hive delimiters like new line 
characters? Because if so then Sqoop might create two lines in for one single 
row in the table that will be consequently seen as two rows in Hive. As Hive 
will implicitly convert any invalid values into NULL, it would also explain 
your NULL values.

Sqoop offers arguments --hive-drop-import-delims and --hive-delims-replacement 
to deal with this problem. More information can be found in Sqoop user guide 
[1].

Jarcec

Links:
1: 
http://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_importing_data_into_hive

On Wed, Mar 06, 2013 at 10:50:26PM +0530, abhijeet gaikwad wrote:
> If you see Hadoop job counters in your logs - numbers are correct till that
> point. So normal import in HDFS is working fine. Only reasonable
> explanation is that there is an issue inserting data in Hive.
> 
> Have a look at the file on HDFS it should contain the correct number of
> records with correct data. But if you do "select * ..." from Hive console
> it will give you NULL values for invalid data, as in the data that cannot
> be converted to the respective data type of the column of Hive table. This
> is because Hive imposes data type constraints on the data in files on HDFS.
> Those extra records is a mystry for me too. I don't think actual file on
> HDFS has extra records.
> 
> Sqoop is creating Hive table incorrectly which **may** have caused this
> issue. Create table manually with correct data types if equivalent
> available or as string and try loading data. If this works fine we know
> where the issue is!
> 
> Thanks,
> Abhijeet
> On Wed, Mar 6, 2013 at 2:05 PM, Ajit Kumar Shreevastava <
> ajit.shreevast...@hcl.com> wrote:
> 
> 
> >  Hi Abhijeet,
> >
> > ** **
> >
> > Thanks for your response.
> >
> > If values that don’t fit in double must be getting inserted as Null is the
> > case then count should not be mis-match in both the case.
> >
> > Here the null value inserted are extra value apart from the other value
> > which is already present in both Oracle Table and Hive table.
> >
> > ** **
> >
> > Correct me if I am wrong in interpretation.
> >
> > ** **
> >
> > Thanks and Regards,
> >
> > Ajit Kumar Shreevastava
> >
> > ** **
> >
> > *From:* abhijeet gaikwad [mailto:abygaikwa...@gmail.com]
> > *Sent:* Wednesday, March 06, 2013 1:46 PM
> > *To:* user@hive.apache.org
> > *Cc:* u...@sqoop.apache.org
> > *Subject:* Re: Data mismatch when importing data from Oracle to Hive
> > through Sqoop without an error
> >
> >   ** **
> >
> > Sqoop maps numeric and decimal types (RDBMS) to double (Hive). I think the
> > values that don't fit in double must be getting inserted as NULL.
> > You can see this warning in your logs.
> >
> > Thanks,
> > Abhijeet
> >
> > On Wed, Mar 6, 2013 at 1:32 PM, Ajit Kumar Shreevastava <
> > ajit.shreevast...@hcl.com> wrote:
> >
> > Hi all,
> >
> > I have notice one interesting thing in the below result-set.
> >
> > I have fired one query in both Oracle and Hive shell and found the
> > following result set:à
> >
> >  
> >
> > *SQL> select count(1) from bttn*
> >
> > *  2  where bttn_id is null or data_inst_id is null or scr_id is null;
> > *
> >
> > * *
> >
> > *  COUNT(1)*
> >
> > *--*
> >
> > * 0*
> >
> > *hive> select count(1) from bttn*
> >
> > *> where bttn_id is null or data_inst_id is null or scr_id is null;***
> > **
> >
> > Total MapReduce jobs = 1
> >
> > Launching Job 1 out of 1
> >
> > Number of reduce tasks determined at compile time: 1
> >
> > In order to change the average load for a reducer (in bytes):
> >
> >   set hive.exec.reducers.bytes.per.reducer=
> >
> > In order to limit the maximum number of reducers:
> >
> >   set hive.exec.reducers.max=
> >
> > In order to set a constant number of reducers:
> >
> >   set mapred.reduce.tasks=
> >
> > Starting Job = job_201303051835_0020, Tracking URL =
> > http://NHCLT-PC44-2:50030/jobdetails.jsp?jobid=job_201303051835_0020
> > 
> >
> > Kill Command = /home/hadoop/hadoop-1.0.3/bin/hadoop job  -kill
> > job_201303051835_0020
> >
> > Hadoop job information for Stage-1: number of mappers: 1; number of
> > reducers: 1
> >
> > 2013-03-06 13:22:56,908 Stage-1 map = 0%,  reduce = 0%
> >
> > 2013-03-06 13:23:05,928 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU
> > 5.2 sec
> >
> > 2013-03-06 13:23:06,931 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU
> > 5.2 sec
> >
> > 2013-03-06 13:23:07,934 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU
> > 5.2 sec
> >
> > 2013-03-06 13:23:08,938 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU
> > 5.2 sec
> >
> > 2013-03-06 13:23:09,941 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU
> > 5.2 sec
> >

Re: Where is the location of hive queries

2013-03-06 Thread Dean Wampler
Or use a variant of the INSERT statement to write to a directory or a table.

On Wed, Mar 6, 2013 at 10:05 AM, Nitin Pawar wrote:

> the results are not stored to any file .. they are available on console
> only
>
> if you want to save to the results then write execute your query like hive
> -e "query" > file
>
>
> On Wed, Mar 6, 2013 at 9:32 PM, Sai Sai  wrote:
>
>> After we run a query in hive shell as:
>> Select * from myTable;
>>
>> Are these results getting saved to any file apart from the
>> console/terminal display.
>> If so where is the location of the results.
>> Thanks
>> Sai
>>
>
>
>
> --
> Nitin Pawar
>



-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330


Re: Data mismatch when importing data from Oracle to Hive through Sqoop without an error

2013-03-06 Thread Venkat Ranganathan
Hi Ajit

Do you know if rest of the columns also are null when the three non
null columns are null

Venkat

On Wed, Mar 6, 2013 at 12:35 AM, Ajit Kumar Shreevastava
 wrote:
> Hi Abhijeet,
>
>
>
> Thanks for your response.
>
> If values that don’t fit in double must be getting inserted as Null is the
> case then count should not be mis-match in both the case.
>
> Here the null value inserted are extra value apart from the other value
> which is already present in both Oracle Table and Hive table.
>
>
>
> Correct me if I am wrong in interpretation.
>
>
>
> Thanks and Regards,
>
> Ajit Kumar Shreevastava
>
>
>
> From: abhijeet gaikwad [mailto:abygaikwa...@gmail.com]
> Sent: Wednesday, March 06, 2013 1:46 PM
> To: user@hive.apache.org
> Cc: u...@sqoop.apache.org
> Subject: Re: Data mismatch when importing data from Oracle to Hive through
> Sqoop without an error
>
>
>
> Sqoop maps numeric and decimal types (RDBMS) to double (Hive). I think the
> values that don't fit in double must be getting inserted as NULL.
> You can see this warning in your logs.
>
> Thanks,
> Abhijeet
>
> On Wed, Mar 6, 2013 at 1:32 PM, Ajit Kumar Shreevastava
>  wrote:
>
> Hi all,
>
> I have notice one interesting thing in the below result-set.
>
> I have fired one query in both Oracle and Hive shell and found the following
> result set:à
>
>
>
> SQL> select count(1) from bttn
>
>   2  where bttn_id is null or data_inst_id is null or scr_id is null;
>
>
>
>   COUNT(1)
>
> --
>
>  0
>
> hive> select count(1) from bttn
>
> > where bttn_id is null or data_inst_id is null or scr_id is null;
>
> Total MapReduce jobs = 1
>
> Launching Job 1 out of 1
>
> Number of reduce tasks determined at compile time: 1
>
> In order to change the average load for a reducer (in bytes):
>
>   set hive.exec.reducers.bytes.per.reducer=
>
> In order to limit the maximum number of reducers:
>
>   set hive.exec.reducers.max=
>
> In order to set a constant number of reducers:
>
>   set mapred.reduce.tasks=
>
> Starting Job = job_201303051835_0020, Tracking URL =
> http://NHCLT-PC44-2:50030/jobdetails.jsp?jobid=job_201303051835_0020
>
> Kill Command = /home/hadoop/hadoop-1.0.3/bin/hadoop job  -kill
> job_201303051835_0020
>
> Hadoop job information for Stage-1: number of mappers: 1; number of
> reducers: 1
>
> 2013-03-06 13:22:56,908 Stage-1 map = 0%,  reduce = 0%
>
> 2013-03-06 13:23:05,928 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.2
> sec
>
> 2013-03-06 13:23:06,931 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.2
> sec
>
> 2013-03-06 13:23:07,934 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.2
> sec
>
> 2013-03-06 13:23:08,938 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.2
> sec
>
> 2013-03-06 13:23:09,941 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.2
> sec
>
> 2013-03-06 13:23:10,944 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.2
> sec
>
> 2013-03-06 13:23:11,947 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.2
> sec
>
> 2013-03-06 13:23:12,956 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.2
> sec
>
> 2013-03-06 13:23:13,959 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.2
> sec
>
> 2013-03-06 13:23:14,962 Stage-1 map = 100%,  reduce = 33%, Cumulative CPU
> 5.2 sec
>
> 2013-03-06 13:23:15,965 Stage-1 map = 100%,  reduce = 33%, Cumulative CPU
> 5.2 sec
>
> 2013-03-06 13:23:16,969 Stage-1 map = 100%,  reduce = 33%, Cumulative CPU
> 5.2 sec
>
> 2013-03-06 13:23:17,974 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU
> 6.95 sec
>
> 2013-03-06 13:23:18,977 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU
> 6.95 sec
>
> 2013-03-06 13:23:19,981 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU
> 6.95 sec
>
> 2013-03-06 13:23:20,985 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU
> 6.95 sec
>
> 2013-03-06 13:23:21,988 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU
> 6.95 sec
>
> 2013-03-06 13:23:22,995 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU
> 6.95 sec
>
> 2013-03-06 13:23:23,998 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU
> 6.95 sec
>
> MapReduce Total cumulative CPU time: 6 seconds 950 msec
>
> Ended Job = job_201303051835_0020
>
> MapReduce Jobs Launched:
>
> Job 0: Map: 1  Reduce: 1   Cumulative CPU: 6.95 sec   HDFS Read: 184270926
> HDFS Write: 4 SUCCESS
>
> Total MapReduce CPU Time Spent: 6 seconds 950 msec
>
> OK
>
> 986
>
> Time taken: 35.983 seconds
>
> hive>
>
>
>
> and  739169 – 738183=986
>
>
>
> can anyone tell me why this happened as BTTN_ID  ,  DATA_INST_ID, SCR_ID
> are not null constrains of BTTN table and also composite Primary Key of the
> table?
>
> Also tell me how can I prevent this unnecessary data generation in HIVE
> table.
>
>
>
> Regards
>
> Ajit Kumar Shreevastava
>
>
>
> From: Ajit Kumar Shreevastava
> Sent: Wednesday, March 06, 2013 12:40 PM
> To: 'u...@sqoop.apache.org'
> Cc: user@hive.apache.org
> Subject: Data mismatch when importing data from Oracle to Hive through Sqoop
> without an error
>
>
>
> HI,
>
>
>
> I have a table BTTN in Oracle 

Re: Where is the location of hive queries

2013-03-06 Thread Nitin Pawar
the results are not stored to any file .. they are available on console only

if you want to save to the results then write execute your query like hive
-e "query" > file


On Wed, Mar 6, 2013 at 9:32 PM, Sai Sai  wrote:

> After we run a query in hive shell as:
> Select * from myTable;
>
> Are these results getting saved to any file apart from the
> console/terminal display.
> If so where is the location of the results.
> Thanks
> Sai
>



-- 
Nitin Pawar


Re: Where is the location of hive queries

2013-03-06 Thread Sai Sai
After we run a query in hive shell as:
Select * from myTable;

Are these results getting saved to any file apart from the console/terminal 
display.
If so where is the location of the results.
Thanks
Sai


Re: Best table storage for analytical use case

2013-03-06 Thread Dean Wampler
MapReduce is very course-grained. It might seem that more cores is better,
but once the data sizes get well below the block threshold in size, the
overhead of starting JVM processes and all the other background becomes a
significant percentage of the overall runtime. So, you quickly reach the
point of diminishing returns. Hadoop wasn't designed for small data, so
this issue was never optimized. For big data, you amortize that overhead
over the actual processing time.

It sounds like 1 day is a good granularity. Your production cluster might
use a larger block size, a multiple of 64MB, which is increasingly common
these days.

On Wed, Mar 6, 2013 at 9:44 AM, Sékine Coulibaly wrote:

> Hi Dean,
>
> Indeed, switching from RCFiles to SequenceFiles yield a query duration
> down 35% (82secs down to 53secs) ! I added Snappy/Gzip block compression
> altogether. Things are getting better, down to 30secs (sequenceFile+snappy).
>
> Yes, most request have a WHERE clause with a time range, will have
> partitionning a try. For now, my tests span over 1 day long log data. I
> will ingest more of them and partition and see how it goes.
>
> However, it's not clear to me why I should minimize the mappers ? Having
> 16 cores, would it make sens to use as many of them as possible to
> parallelize ? So far, 1 day worth log is 256 MB. In my understanding,
> provided that HDFS has 64MB blocks, I should use 4 mappers right ?
>
> If this is the case, since I'm in pseudo distrib for the moment, my number
> of mappers =1, so I could try to configure my setup with additional mappers.
>
>
> Does this make sense ?
>
> Thank you for your help !
>
> Sekine
>
>
>
>
> 2013/3/4 Dean Wampler 
>
>> RCFile won't help much (and apparently not all in this case ;) unless you
>> have a lot of columns and you always query just a few of them. However, you
>> should get better results with Sequence Files (binary format) and usually
>> with a compression scheme like BZip that supports block-level (as opposed
>> to file-level) compression.  Why? compressed files and also using sequence
>> files reduces the amount of disk IO and hence improves IO performance (a
>> bottleneck).
>>
>> Do you almost always query with a WHERE clause with a time range? If so,
>> consider partitioning your data by time ranges, e.g., year/month/day. Your
>> actual timestamp granularity would be chosen so that each folder (and yes,
>> they'll be individual folders) has data files at least 64MB or whatever
>> multiple of 64MB your using in your cluster. It could be that per-day is
>> the finest granularity or even per hour or minute, if you really have a lot
>> of data. Briefly, you want to minimize the number of mapper processes used
>> to process the data, and this is the granularity per mapper. Why partition,
>> because when you do SELECT * FROM mytable WHERE year = 2012 AND month = 3
>> AND day = 4, Hive knows it only has to read the contents of that single
>> directory, not all the directories...
>>
>> You might also consider clustering by URL. This feature (and the others)
>> is described on the Hive wiki. It can also speed up sampling of large data
>> sets and joins.
>>
>> I assume you're just using the virtual machine for experimenting. Lots of
>> overhead there, too!
>>
>> Hope this helps.
>> dean
>>
>> On Mon, Mar 4, 2013 at 4:33 PM, Sékine Coulibaly wrote:
>>
>>> Hi there,
>>>
>>> I've setup a virtual machine hosting Hive.
>>> My use case is a Web traffic analytics, hence most of requests are :
>>>
>>> - how many requests today ?
>>> - how many request today, grouped by country ?
>>> - most requested urls ?
>>> - average http server response time (5 minutes slots) ?
>>>
>>> In other words, lets consider :
>>> CREATE TABLE logs ( url STRING, orig_country STRING, http_rt INT )
>>> and
>>>
>>> SELECT COUNT(*) FROM logs;
>>> SELECT COUNT(*),orig_country FROM logs GROUP BY orig_country;
>>> SELECT COUNT(*),url FROM logs BROUP BY url;
>>> SELECT AVG(http_rt) FROM logs ...
>>>
>>> 2 questions here :
>>> - How to generate 5 minutes slots to make my averages (in Postgresql, I
>>> used to generate_series() and JOIN) ? I wish I could avoid doing multiple
>>> requests each with a 'WHERE date>... AND date <...'. Maybe a mapper,
>>> mapping the date string to a aslot number ?
>>>
>>> - What is the best storage method pour this table ? Since it's purpose
>>> is analytical, I thought columnar format was the way to go. So I tried
>>> RCFILE buy the results are as follow for around 1 million rows (quite
>>> small, I know) and are quite the opposite I was expecting :
>>>
>>> Storage / query duration / disk table size
>>> TEXTFILE / 22 seconds / 250MB
>>> RCFILE / 31 seconds / 320 MB
>>>
>>>  I thought getting values in columns would speed up the aggregate
>>> process. Maybe the dataset is too small to tell, or I missed something ?
>>> Will adding Snappy compression help (not sure whether RCFiles are
>>> compressed or not) ?
>>>
>>> Thank you !
>>>
>>>
>>>
>>>
>>
>>
>> --
>> *Dean Wample

Re: Best table storage for analytical use case

2013-03-06 Thread Sékine Coulibaly
Hi Dean,

Indeed, switching from RCFiles to SequenceFiles yield a query duration down
35% (82secs down to 53secs) ! I added Snappy/Gzip block compression
altogether. Things are getting better, down to 30secs (sequenceFile+snappy).

Yes, most request have a WHERE clause with a time range, will have
partitionning a try. For now, my tests span over 1 day long log data. I
will ingest more of them and partition and see how it goes.

However, it's not clear to me why I should minimize the mappers ? Having 16
cores, would it make sens to use as many of them as possible to parallelize
? So far, 1 day worth log is 256 MB. In my understanding, provided that
HDFS has 64MB blocks, I should use 4 mappers right ?

If this is the case, since I'm in pseudo distrib for the moment, my number
of mappers =1, so I could try to configure my setup with additional mappers.


Does this make sense ?

Thank you for your help !

Sekine




2013/3/4 Dean Wampler 

> RCFile won't help much (and apparently not all in this case ;) unless you
> have a lot of columns and you always query just a few of them. However, you
> should get better results with Sequence Files (binary format) and usually
> with a compression scheme like BZip that supports block-level (as opposed
> to file-level) compression.  Why? compressed files and also using sequence
> files reduces the amount of disk IO and hence improves IO performance (a
> bottleneck).
>
> Do you almost always query with a WHERE clause with a time range? If so,
> consider partitioning your data by time ranges, e.g., year/month/day. Your
> actual timestamp granularity would be chosen so that each folder (and yes,
> they'll be individual folders) has data files at least 64MB or whatever
> multiple of 64MB your using in your cluster. It could be that per-day is
> the finest granularity or even per hour or minute, if you really have a lot
> of data. Briefly, you want to minimize the number of mapper processes used
> to process the data, and this is the granularity per mapper. Why partition,
> because when you do SELECT * FROM mytable WHERE year = 2012 AND month = 3
> AND day = 4, Hive knows it only has to read the contents of that single
> directory, not all the directories...
>
> You might also consider clustering by URL. This feature (and the others)
> is described on the Hive wiki. It can also speed up sampling of large data
> sets and joins.
>
> I assume you're just using the virtual machine for experimenting. Lots of
> overhead there, too!
>
> Hope this helps.
> dean
>
> On Mon, Mar 4, 2013 at 4:33 PM, Sékine Coulibaly wrote:
>
>> Hi there,
>>
>> I've setup a virtual machine hosting Hive.
>> My use case is a Web traffic analytics, hence most of requests are :
>>
>> - how many requests today ?
>> - how many request today, grouped by country ?
>> - most requested urls ?
>> - average http server response time (5 minutes slots) ?
>>
>> In other words, lets consider :
>> CREATE TABLE logs ( url STRING, orig_country STRING, http_rt INT )
>> and
>>
>> SELECT COUNT(*) FROM logs;
>> SELECT COUNT(*),orig_country FROM logs GROUP BY orig_country;
>> SELECT COUNT(*),url FROM logs BROUP BY url;
>> SELECT AVG(http_rt) FROM logs ...
>>
>> 2 questions here :
>> - How to generate 5 minutes slots to make my averages (in Postgresql, I
>> used to generate_series() and JOIN) ? I wish I could avoid doing multiple
>> requests each with a 'WHERE date>... AND date <...'. Maybe a mapper,
>> mapping the date string to a aslot number ?
>>
>> - What is the best storage method pour this table ? Since it's purpose is
>> analytical, I thought columnar format was the way to go. So I tried RCFILE
>> buy the results are as follow for around 1 million rows (quite small, I
>> know) and are quite the opposite I was expecting :
>>
>> Storage / query duration / disk table size
>> TEXTFILE / 22 seconds / 250MB
>> RCFILE / 31 seconds / 320 MB
>>
>>  I thought getting values in columns would speed up the aggregate
>> process. Maybe the dataset is too small to tell, or I missed something ?
>> Will adding Snappy compression help (not sure whether RCFiles are
>> compressed or not) ?
>>
>> Thank you !
>>
>>
>>
>>
>
>
> --
> *Dean Wampler, Ph.D.*
> thinkbiganalytics.com
> +1-312-339-1330
>
>


Re: Oozie - using with datastax hadoop - cassandra file system

2013-03-06 Thread Viral Bajaria
Though I had love to hear the rationale behind using the DataStax Hadoop,
we can do that off the list (will email you separately for that). But this
list is for Hive related questions and since the error is in Oozie you will
be better off asking this question on the Oozie mailing list.

-Viral

On Wed, Mar 6, 2013 at 2:39 AM, shreehari padaki
wrote:

>
> Hi All,
>
> We are using DataStax for Hadoop with Cassandra, now we are trying to run
> a job through Oozie, but while running the workflow job we are getting
> below error
>
> java.io.IOException: No FileSystem for scheme: cfs
>
> We have added property
>
> 
> oozie.filesystems.supported
> hdfs,cfs
> 
> Enlist the different filesystems supported for federation.
>  
> 
>
> in oozie-default.xml file.
>
> We are using oozie-2.3.2 version.
>
> Please can some one help me in fixing this?
>
> Thanks
> Shreehari
>
>
>
>


Oozie - using with datastax hadoop - cassandra file system

2013-03-06 Thread shreehari padaki


Hi All,


We are using DataStax for Hadoop with Cassandra, now we are trying to run a job 
through Oozie, but while running the workflow job we are getting below error 


java.io.IOException: No FileSystem for scheme: cfs

We have added property 
 oozie.filesystems.supported hdfs,cfs 
 Enlist the different filesystems supported for federation. 
 

in oozie-default.xml file.

We are using oozie-2.3.2 version.

Please can some one help me in fixing this?

Thanks
Shreehari

Re: Read map value from a table

2013-03-06 Thread Sai Sai
Here is my data in a file which i have successfully loaded into a table test 
and successfully get the data for:

Select * from test;

Name    ph    category


Name1    ph1    {"type":1000,"color":200,"shape":610}
Name2    ph2    {"type":2000,"color":200,"shape":150}
Name3    ph3    {"type":3000,"color":700,"shape":167}

But when i execute this query:

select category["type"] from test;

I get null values;

Please help.
Thanks
Sai


Re: Hive insert into RCFILE issue with timestamp columns

2013-03-06 Thread Sékine Coulibaly
Prasad,

Isn't the fractional part of the TIMESTAMP type supposed to be optional, as
per the error message :

Failed with exception
java.io.IOException:java.lang.IllegalArgumentException: Timestamp format
must be -mm-dd hh:mm:ss[.f]

Shall we understand 9 digits for fractional part are mandatory ?

Thanks




2013/3/6 Prasad Mujumdar 

> Dilip,
>
>Looks like you are using the data from the original schema for this new
> table that has single timestamp column. When I tried with just the
> timestamp from your data, the query runs fine. I guess the original issue
> you hit on the data that didn't have fraction part (1969-12-31 19:00:00, no
> .f).
>
> thanks
> Prasad
>
>
> On Tue, Mar 5, 2013 at 2:56 PM, Dileep Kumar wrote:
>
>> --hdfs dfs -mkdir /hive/tpcds/date_ts
>>
>> create external table date_ts
>> (
>> d_datetimestamp
>> )
>> row format delimited fields terminated by '|'
>> location '/hive/tpcds/date_ts';
>>
>> [cloudera@localhost tmp-work]$ hive -e "select * from date_ts"
>> Logging initialized using configuration in
>> file:/etc/hive/conf.dist/hive-log4j.properties
>> Hive history
>> file=/tmp/cloudera/hive_job_log_cloudera_201303052251_950655265.txt
>> OK
>> Failed with exception
>> java.io.IOException:java.lang.IllegalArgumentException: Timestamp format
>> must be -mm-dd hh:mm:ss[.f]
>> Time taken: 3.556 seconds
>> [cloudera@localhost tmp-work]$ hdfs dfs -cat /hive/tpcds/date_ts/*
>> 2415022|OKJNECAA|1900-01-02
>> 02:00:21.0|0|1|1|1900|1|1|2|1|1900|1|1|Monday|1900Q1|N|N|Y|2415021|2415020|2414657|2414930|N|N|N|N|N|
>>
>>
>>
>>
>>
>> On Mon, Mar 4, 2013 at 6:00 PM, Dileep Kumar wrote:
>>
>>> No.
>>> Here are the errors:
>>> Task with the most failures(4):
>>> -
>>> Task ID:
>>>   task_1361599885844_0013_m_00
>>>
>>> URL:
>>>
>>> http://localhost.localdomain:50030/taskdetails.jsp?jobid=job_1361599885844_0013&tipid=task_1361599885844_0013_m_00
>>> -
>>> Diagnostic Messages for this Task:
>>> Error: java.lang.RuntimeException:
>>> org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while
>>> processing row
>>> {"d_date_sk":2415022,"d_date_id":"OKJNECAA","d_date":"1969-12-31
>>> 19:00:00","d_month_seq":0,"d_week_seq":1,"d_quarter_seq":1,"d_year":1900,"d_dow":1,"d_moy":1,"d_dom":2,"d_qoy":1,"d_fy_year":1900,"d_fy_quarter_seq":1,"d_fy_week_seq":1,"d_day_name":"Monday","d_quarter_name":"1900Q1","d_holiday":"N","d_weekend":"N","d_following_holiday":"Y","d_first_dom":2415021,"d_last_dom":2415020,"d_same_day_ly":2414657,"d_same_day_lq":2414930,"d_current_day":"N","d_current_week":"N","d_current_month":"N","d_current_quarter":"N","d_current_year":"N"}
>>> at
>>> org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:161)
>>> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
>>> at
>>> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:399)
>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:334)
>>> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:152)
>>> at java.security.AccessController.doPrivileged(Native Method)
>>> at javax.security.auth.Subject.doAs(Subject.java:396)
>>> at
>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332)
>>> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:147)
>>> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive
>>> Runtime Error while processing row
>>> {"d_date_sk":2415022,"d_date_id":"OKJNECAA","d_date":"1969-12-31
>>> 19:00:00","d_month_seq":0,"d_week_seq":1,"d_quarter_seq":1,"d_year":1900,"d_dow":1,"d_moy":1,"d_dom":2,"d_qoy":1,"d_fy_year":1900,"d_fy_quarter_seq":1,"d_fy_week_seq":1,"d_day_name":"Monday","d_quarter_name":"1900Q1","d_holiday":"N","d_weekend":"N","d_following_holiday":"Y","d_first_dom":2415021,"d_last_dom":2415020,"d_same_day_ly":2414657,"d_same_day_lq":2414930,"d_current_day":"N","d_current_week":"N","d_current_month":"N","d_current_quarter":"N","d_current_year":"N"}
>>> at
>>> org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:548)
>>> at
>>> org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:143)
>>> ... 8 more
>>> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Error
>>> evaluating d_date
>>> at
>>> org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:80)
>>> at
>>> org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471)
>>> at
>>> org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:762)
>>> at
>>> org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:83)
>>> at
>>> org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471)
>>> at
>>> org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:762)
>>> at
>>> org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperato

Re: Hive insert into RCFILE issue with timestamp columns

2013-03-06 Thread Prasad Mujumdar
Dilip,

   Looks like you are using the data from the original schema for this new
table that has single timestamp column. When I tried with just the
timestamp from your data, the query runs fine. I guess the original issue
you hit on the data that didn't have fraction part (1969-12-31 19:00:00, no
.f).

thanks
Prasad

On Tue, Mar 5, 2013 at 2:56 PM, Dileep Kumar wrote:

> --hdfs dfs -mkdir /hive/tpcds/date_ts
>
> create external table date_ts
> (
> d_datetimestamp
> )
> row format delimited fields terminated by '|'
> location '/hive/tpcds/date_ts';
>
> [cloudera@localhost tmp-work]$ hive -e "select * from date_ts"
> Logging initialized using configuration in
> file:/etc/hive/conf.dist/hive-log4j.properties
> Hive history
> file=/tmp/cloudera/hive_job_log_cloudera_201303052251_950655265.txt
> OK
> Failed with exception
> java.io.IOException:java.lang.IllegalArgumentException: Timestamp format
> must be -mm-dd hh:mm:ss[.f]
> Time taken: 3.556 seconds
> [cloudera@localhost tmp-work]$ hdfs dfs -cat /hive/tpcds/date_ts/*
> 2415022|OKJNECAA|1900-01-02
> 02:00:21.0|0|1|1|1900|1|1|2|1|1900|1|1|Monday|1900Q1|N|N|Y|2415021|2415020|2414657|2414930|N|N|N|N|N|
>
>
>
>
>
> On Mon, Mar 4, 2013 at 6:00 PM, Dileep Kumar wrote:
>
>> No.
>> Here are the errors:
>> Task with the most failures(4):
>> -
>> Task ID:
>>   task_1361599885844_0013_m_00
>>
>> URL:
>>
>> http://localhost.localdomain:50030/taskdetails.jsp?jobid=job_1361599885844_0013&tipid=task_1361599885844_0013_m_00
>> -
>> Diagnostic Messages for this Task:
>> Error: java.lang.RuntimeException:
>> org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while
>> processing row
>> {"d_date_sk":2415022,"d_date_id":"OKJNECAA","d_date":"1969-12-31
>> 19:00:00","d_month_seq":0,"d_week_seq":1,"d_quarter_seq":1,"d_year":1900,"d_dow":1,"d_moy":1,"d_dom":2,"d_qoy":1,"d_fy_year":1900,"d_fy_quarter_seq":1,"d_fy_week_seq":1,"d_day_name":"Monday","d_quarter_name":"1900Q1","d_holiday":"N","d_weekend":"N","d_following_holiday":"Y","d_first_dom":2415021,"d_last_dom":2415020,"d_same_day_ly":2414657,"d_same_day_lq":2414930,"d_current_day":"N","d_current_week":"N","d_current_month":"N","d_current_quarter":"N","d_current_year":"N"}
>> at
>> org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:161)
>> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
>> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:399)
>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:334)
>> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:152)
>> at java.security.AccessController.doPrivileged(Native Method)
>> at javax.security.auth.Subject.doAs(Subject.java:396)
>> at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332)
>> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:147)
>> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime
>> Error while processing row
>> {"d_date_sk":2415022,"d_date_id":"OKJNECAA","d_date":"1969-12-31
>> 19:00:00","d_month_seq":0,"d_week_seq":1,"d_quarter_seq":1,"d_year":1900,"d_dow":1,"d_moy":1,"d_dom":2,"d_qoy":1,"d_fy_year":1900,"d_fy_quarter_seq":1,"d_fy_week_seq":1,"d_day_name":"Monday","d_quarter_name":"1900Q1","d_holiday":"N","d_weekend":"N","d_following_holiday":"Y","d_first_dom":2415021,"d_last_dom":2415020,"d_same_day_ly":2414657,"d_same_day_lq":2414930,"d_current_day":"N","d_current_week":"N","d_current_month":"N","d_current_quarter":"N","d_current_year":"N"}
>> at
>> org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:548)
>> at
>> org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:143)
>> ... 8 more
>> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Error
>> evaluating d_date
>> at
>> org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:80)
>> at
>> org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471)
>> at
>> org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:762)
>> at
>> org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:83)
>> at
>> org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471)
>> at
>> org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:762)
>> at
>> org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:529)
>> ... 9 more
>> Caused by: java.lang.IllegalArgumentException: Timestamp format must be
>> -mm-dd hh:mm:ss[.f]
>> at java.sql.Timestamp.valueOf(Timestamp.java:185)
>> at
>> org.apache.hadoop.hive.serde2.lazy.LazyTimestamp.init(LazyTimestamp.java:74)
>> at
>> org.apache.hadoop.hive.serde2.lazy.LazyStruct.uncheckedGetField(LazyStruct.java:219)
>> at
>> org.apache.hadoop.hive.serde2.lazy.LazyStr