subquery syntax error

2012-06-05 Thread Avdeev V . M .
Hello!

I can't understand what's wrong.

 

drop table if exists t0;
create table t0 (short_descr string) STORED AS SEQUENCEFILE;
from (
 select cast(2 as int) identifier2, short_descr area_name from t0) asdf
select *--area_name
order by identifier2 desc, area_name asc
;

The query works fine but if I remove * then

drop table if exists t0;
create table t0 (short_descr string) STORED AS SEQUENCEFILE;
from (
 select cast(2 as int) identifier2, short_descr area_name from t0) asdf
select area_name
order by identifier2 desc, area_name asc
;

fails with error "FAILED: Error in semantic analysis: Line 6:9 Invalid table 
alias or column reference identifier2"
The query

drop table if exists t0;
create table t0 (short_descr string) STORED AS SEQUENCEFILE;
from (
 select short_descr area_name from t0) asdf
select area_name
order by area_name asc
;

works fine too. What's wrong with 'no mapped' columns?

Thanks!


Re[2]: table design and performance questions

2012-06-02 Thread Avdeev V . M .
Thank for the information Ruben.

1. I found the issue https://issues.apache.org/jira/browse/HIVE-1642
does it mean that MAPJOIN hint is obsolete since 2010 and I can avoid this hint 
absolutely?

2. sorry for stupid questions, but I can't understand bucketing still. 
partitioning is ok, it is hdfs folders and I able to understand how it improve 
query execution. but what is bucketing in terms of storing data?

3. I embarrassed to ask such stupid questions, but is there 'how hive works' 
manual or something like?

And again  - sorry for bad English.

Vyacheslav

Tue, 29 May 2012 10:02:14 +0200 от Ruben de Vries :
> Partitioning can greatly increase performance for WHERE clauses since hive 
> can omit parsing the data in the partitions which do no meet the requirement.
> 
> For example if you partition by date (I do it by INT dateint, in which case I 
> set dateint to be MMDD) and you do WHERE dateint >= 20120101 then it 
> won't even have to touch any of the data from before 2012-01-01 and in my 
> case that means I don't parse the last 2 years of data, reducing the time the 
> query takes by about 70% :-)
> 
> 
> 
> Buckets are the second awesome way of getting a big optimization in, 
> specifically for joins! If you have 2 tables you're joining onto each other 
> then if they're both bucketed on their join column it will also greatly 
> increase speed.
> 
> Another good join optimization is MAPJOIN, if one of the tables you're 
> joining is rather small (below 30mb) then you can force it to MAPJOIN or you 
> can enable automatic mapjoin, I personally prefere explicit behavory instead 
> of automagic so use a hint:
> 
> SELECT /* +MAPJOIN(the_small_table) */ fields FROM table JOIN 
> the_small_table, etc.
> 
> Sorted by is for sorting within buckets, only relevant if you're doing a lot 
> of ordering I think.
> 
> 
> 
> I'm assuming sequencefiles are faster, but I wouldn't really know :( need 
> someone else to tell us more about that ;)
> 
> 
> 
> 
> 
> -Original Message-
> 
> From: Avdeev V. M. [mailto:ls...@list.ru] 
> 
> Sent: Monday, May 28, 2012 7:17 AM
> 
> To: user@hive.apache.org
> 
> Subject: table design and performance questions
> 
> 
> 
> Question from novice.
> 
> 
> 
> Where I can read table design best practices? I have a measure table with 
> millions of rows and many dimension tables with less than 1000 rows each. I 
> can't find out the way to get optimal design of both kind of tables. Is there 
> performance tuning guides or performance FAQ?
> 
> 
> 
> Specifically
> 
> 1) PARTITIONED BY, CLUSTERED BY, SORTED BY statements. In which cases using 
> these statements make sense?
> 
> 2) DDL language manual says 'This can improve performance on certain kinds of 
> queries.' about CLUSTERED BY statement. What kind of queries can be improved?
> 
> 3) What is preferable - SEQUENCEFILE, RCFILE or TEXTFILE - in terms of 
> performance? What aspects should be taken into account when choosing a file 
> format?
> 
> 4) Compressed storage article says 'Keeping data compressed in Hive tables 
> has, in some cases, known to give better performance that uncompressed 
> storage;' and again - What is these cases? 
> 
> 
> 
> Thanks!
> 
> Vyacheslav
> 
> 


table design and performance questions

2012-05-27 Thread Avdeev V . M .
Question from novice.

Where I can read table design best practices? I have a measure table with 
millions of rows and many dimension tables with less than 1000 rows each. I 
can't find out the way to get optimal design of both kind of tables. Is there 
performance tuning guides or performance FAQ?

Specifically
1) PARTITIONED BY, CLUSTERED BY, SORTED BY statements. In which cases using 
these statements make sense?
2) DDL language manual says 'This can improve performance on certain kinds of 
queries.' about CLUSTERED BY statement. What kind of queries can be improved?
3) What is preferable - SEQUENCEFILE, RCFILE or TEXTFILE - in terms of 
performance? What aspects should be taken into account when choosing a file 
format?
4) Compressed storage article says 'Keeping data compressed in Hive tables has, 
in some cases, known to give better performance that uncompressed storage;' and 
again - What is these cases? 

Thanks!
Vyacheslav

Re[2]: from-insert-select trouble

2012-05-22 Thread Avdeev V . M .
Found.

2012-05-22 17:52:47,117 FATAL org.apache.hadoop.mapred.Child: Error running 
child : java.lang.OutOfMemoryError: Java heap space
 at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$Packet.(DFSClient.java:2790)
 at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.writeChunk(DFSClient.java:3733)
 at 
org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunk(FSOutputSummer.java:150)
 at org.apache.hadoop.fs.FSOutputSummer.write1(FSOutputSummer.java:100)
 at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:86)
 at 
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:49)
 at java.io.DataOutputStream.write(DataOutputStream.java:90)
 at org.apache.hadoop.hive.ql.io.RCFile$ValueBuffer.write(RCFile.java:450)
 at org.apache.hadoop.hive.ql.io.RCFile$Writer.flushRecords(RCFile.java:867)
 at org.apache.hadoop.hive.ql.io.RCFile$Writer.close(RCFile.java:884)
 at 
org.apache.hadoop.hive.ql.io.RCFileOutputFormat$2.close(RCFileOutputFormat.java:147)
 at 
org.apache.hadoop.hive.ql.exec.FileSinkOperator$FSPaths.abortWriters(FileSinkOperator.java:196)
 at 
org.apache.hadoop.hive.ql.exec.FileSinkOperator.closeOp(FileSinkOperator.java:653)
 at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:557)
 at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:566)
 at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:566)
 at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:566)
 at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:566)
 at org.apache.hadoop.hive.ql.exec.ExecMapper.close(ExecMapper.java:193)
 at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
 at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:391)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)
 at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:396)
 at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177)
 at org.apache.hadoop.mapred.Child.main(Child.java:264)

I will experiment with smaller data set!
Thank you Bejoy! 

Tue, 22 May 2012 03:40:20 -0700 (PDT) от Bejoy Ks :
 
 
 
Hi Vyacheslav
       Can you post in the error log from your failed mapreduce tasks? You can 
get the error logs from the Jobtracker web UI drilling down to task level. 
Those errors will give you abetter understanding on what could be 
going wrong here.

Regards
Bejoy

 
 
 
  
--
 From: Avdeev V. M. 
 To: user@hive.apache.org 
 Sent: Tuesday, May 22, 2012 3:50 PM
 Subject: from-insert-select trouble
   
Hello!

I'm very new to the world Hadoop and Hive so I cannot solve a problem that I 
encountered.

Hadoop has been deployed on a single-node in a pseudo-distributed mode.
I'm trying to copy data from one table to another. Source table created by 
Sqoop, destination table created by query



create table if not exists rev0.operation_list (
    id bigint,
    id_paper bigint,
    
    lgot_code int,
    id_region int,
    id_tarif_type int,
    id_annulate int,
    id_from int,
    id_to int,
    id_train int,
    id_emitent int,
    id_carriage int,
    id_place int,
    id_ticket_type int,

    sell_date string,
    trip_date string,

    amount int,
    cash int,
    ticket_count int,
    price_tarif_place int,
    price_tarif_transfer int,
    km float,
    passengers int,
    pkm float)
PARTITIONED BY(id_sell_date string)
stored as RCFILE;



Source table contains about 23 000 000 rows. When I try to execute 



set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;

from rev0.operation_list_temp
insert overwrite table rev0.operation_list PARTITION(id_sell_date)
select
    id,
    id_paper,
    lgot_code,
    id_region,
    id_tarif_type,
    id_annulate,
    id_from,
    id_to,
    id_train,
    id_emitent,
    id_carriage,
    id_place,
    id_ticket_type,

    sell_date,
    trip_date,

    amount,
    cash,
    ticket_count,
    price_tarif_place,
    price_tarif_transfer,
    km,
    passengers,
    pkm,

    to_date(sell_date) id_sell_date;



I see strange progress report:



Hive history file=/tmp/user/hive_job_log_user_201205221419_1856534995.txt
Total MapReduce jobs = 2
Launching Job 1 out of 2
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201205191141_0110, Tracking URL = 
http://localhost:50030/jobdetails.jsp?jobid=job_201205191141_0110
Kill Command = /usr/lib/hadoop/bin/hadoop job  
-Dmapred.job.tracker=localhost:8021 -kill job_201205191141_0110
2012-05-22 14:19:59,092 Stage-1 map = 0%,  reduce = 0%
2012-05-22 14:21:00,000 Stage-1 map = 0%,  reduce = 0%
2012-05-22 14:21:46,527 Stage-1 map = 13%,  reduce = 0%
2012-05-22 14:21:52,664 Stage-1 map = 41%,  reduce = 0%
2012-05-22 14:22:53,357 Stage-1 map = 41%,  reduce = 0%
2012-05-22 14:23:06,747 S

from-insert-select trouble

2012-05-22 Thread Avdeev V . M .
Hello!

I'm very new to the world Hadoop and Hive so I cannot solve a problem that I 
encountered.

Hadoop has been deployed on a single-node in a pseudo-distributed mode.
I'm trying to copy data from one table to another. Source table created by 
Sqoop, destination table created by query



create table if not exists rev0.operation_list (
id bigint,
id_paper bigint,

lgot_code int,
id_region int,
id_tarif_type int,
id_annulate int,
id_from int,
id_to int,
id_train int,
id_emitent int,
id_carriage int,
id_place int,
id_ticket_type int,

sell_date string,
trip_date string,

amount int,
cash int,
ticket_count int,
price_tarif_place int,
price_tarif_transfer int,
km float,
passengers int,
pkm float)
PARTITIONED BY(id_sell_date string)
stored as RCFILE;



Source table contains about 23 000 000 rows. When I try to execute 



set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;

from rev0.operation_list_temp
insert overwrite table rev0.operation_list PARTITION(id_sell_date)
select
id,
id_paper,
lgot_code,
id_region,
id_tarif_type,
id_annulate,
id_from,
id_to,
id_train,
id_emitent,
id_carriage,
id_place,
id_ticket_type,

sell_date,
trip_date,

amount,
cash,
ticket_count,
price_tarif_place,
price_tarif_transfer,
km,
passengers,
pkm,

to_date(sell_date) id_sell_date;



I see strange progress report:



Hive history file=/tmp/user/hive_job_log_user_201205221419_1856534995.txt
Total MapReduce jobs = 2
Launching Job 1 out of 2
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201205191141_0110, Tracking URL = 
http://localhost:50030/jobdetails.jsp?jobid=job_201205191141_0110
Kill Command = /usr/lib/hadoop/bin/hadoop job  
-Dmapred.job.tracker=localhost:8021 -kill job_201205191141_0110
2012-05-22 14:19:59,092 Stage-1 map = 0%,  reduce = 0%
2012-05-22 14:21:00,000 Stage-1 map = 0%,  reduce = 0%
2012-05-22 14:21:46,527 Stage-1 map = 13%,  reduce = 0%
2012-05-22 14:21:52,664 Stage-1 map = 41%,  reduce = 0%
2012-05-22 14:22:53,357 Stage-1 map = 41%,  reduce = 0%
2012-05-22 14:23:06,747 Stage-1 map = 63%,  reduce = 0%
2012-05-22 14:23:28,409 Stage-1 map = 75%,  reduce = 0%
2012-05-22 14:24:29,322 Stage-1 map = 75%,  reduce = 0%
2012-05-22 14:25:28,276 Stage-1 map = 88%,  reduce = 0%
2012-05-22 14:25:31,397 Stage-1 map = 50%,  reduce = 0% <-- my comment: 88% 
downs to 50%!
2012-05-22 14:26:32,332 Stage-1 map = 50%,  reduce = 0%
2012-05-22 14:27:02,701 Stage-1 map = 63%,  reduce = 0%
2012-05-22 14:28:03,314 Stage-1 map = 63%,  reduce = 0%
2012-05-22 14:28:21,919 Stage-1 map = 75%,  reduce = 0%
2012-05-22 14:29:22,023 Stage-1 map = 75%,  reduce = 0%
2012-05-22 14:30:22,081 Stage-1 map = 75%,  reduce = 0%
2012-05-22 14:30:32,182 Stage-1 map = 88%,  reduce = 0%
2012-05-22 14:30:34,227 Stage-1 map = 50%,  reduce = 0% <-- my comment: again!
2012-05-22 14:31:34,948 Stage-1 map = 50%,  reduce = 0%
2012-05-22 14:32:01,198 Stage-1 map = 63%,  reduce = 0%
2012-05-22 14:33:01,904 Stage-1 map = 63%,  reduce = 0%
2012-05-22 14:33:20,150 Stage-1 map = 75%,  reduce = 0%
2012-05-22 14:34:21,127 Stage-1 map = 75%,  reduce = 0%
2012-05-22 14:35:22,018 Stage-1 map = 75%,  reduce = 0%
2012-05-22 14:35:33,295 Stage-1 map = 88%,  reduce = 0%
2012-05-22 14:35:43,137 Stage-1 map = 50%,  reduce = 0% <-- my comment: and 
again!
2012-05-22 14:36:44,057 Stage-1 map = 50%,  reduce = 0%
2012-05-22 14:37:17,486 Stage-1 map = 63%,  reduce = 0%
2012-05-22 14:38:18,116 Stage-1 map = 63%,  reduce = 0%
2012-05-22 14:38:36,327 Stage-1 map = 75%,  reduce = 0%
2012-05-22 14:39:36,936 Stage-1 map = 75%,  reduce = 0%
2012-05-22 14:40:37,660 Stage-1 map = 75%,  reduce = 0%
2012-05-22 14:40:41,731 Stage-1 map = 88%,  reduce = 0%
2012-05-22 14:40:43,759 Stage-1 map = 50%,  reduce = 0% <-- my comment: last 
one!
2012-05-22 14:40:47,815 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_201205191141_0110 with errors
FAILED: Execution Error, return code 2 from 
org.apache.hadoop.hive.ql.exec.MapRedTask



I can not understand why the process was completed with an error and why the 
progress of the MAP procedure is so strange.

I have found 2 'workarounds':
1) divide original query for two by adding 'WHERE to_date(sell_date) < 
to_date(border_date)' and 'WHERE to_date(sell_date) >= to_date(border_date)'. 
As a result each query contains 11 500 000 rows and the copying process is 
completed without errors.
2) In other hand, change 'stored by rcfile' to 'stored by sequencefile' without 
WHERE predicate. Again, the query will complete without errors.
I have no idea about this behavior. Maybe I have not enough knowledge to 
unde