[jira] [Commented] (HIVE-16666) Set hive.exec.stagingdir a relative directory or a sub directory of distination data directory will cause Hive to delete the intermediate query results

yangfang (JIRA) Sun, 21 May 2017 19:49:56 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-16666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16019097#comment-16019097
 ]


yangfang commented on HIVE-16666:
---------------------------------

[~aihuaxu],[~pvary], thanks for your advice.
 In my opinion, the staging directory is just a temporary directory, users may 
not be concerned with where the directory is, they only care about the final 
result. For users, any staging directory name may be allowed, throw an 
exception may be a  little rough.
 Even if we add a validation against the configuration, for example suppose 
/tmp/hive/.hive-staging is a valide directory because it's a empty directory 
that no one has used, but now, someone may create table like this:
 create table test(a int, b string) location '/tmp'
Now the staging directory is a sub directory of  table data directory, this 
will still to delete the intermediate query results in execution.
 Looking forward to your comments.

> Set hive.exec.stagingdir a relative directory or a sub directory of 
> distination data directory will cause Hive to delete the intermediate query 
> results
> -------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-16666
>                 URL: https://issues.apache.org/jira/browse/HIVE-16666
>             Project: Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 3.0.0
>            Reporter: yangfang
>            Assignee: yangfang
>            Priority: Critical
>         Attachments: HIVE-16666.1.patch
>
>
> Set hive.exec.stagingdir=./*,  for example set hive.exec.stagingdir=./opq8.
> Then excute a query like this:
> insert overwrite table test2 select * from test3; 
> You will get the error like this:
> hive> set hive.exec.stagingdir=./opq8;
> hive> insert overwrite table test2 select * from test3;
> Query ID = mr_20170515134831_28ee392d-0d5a-4e47-b80c-dfcd31691b02
> Total jobs = 3
> Launching Job 1 out of 3
> Number of reduce tasks is set to 0 since there's no reduce operator
> Starting Job = job_1494818119523_0008, Tracking URL = 
> http://zdh77:8088/proxy/application_1494818119523_0008/
> Kill Command = /opt/ZDH/parcels/lib/hadoop/bin/hadoop job  -kill 
> job_1494818119523_0008
> Hadoop job information for Stage-1: number of mappers: 0; number of reducers: > 0
> 2017-05-15 13:48:51,487 Stage-1 map = 0%,  reduce = 0%
> Ended Job = job_1494818119523_0008
> Stage-3 is selected by condition resolver.
> Stage-2 is filtered out by condition resolver.
> Stage-4 is filtered out by condition resolver.
> Moving data to directory 
> hdfs://nameservice/hive/test2/opqt8_hive_2017-05-15_13-48-31_558_6151032330134038151-1/-ext-10000
> Loading data to table default.test2
> Moved: 
> 'hdfs://nameservice/hive/test2/opqt8_hive_2017-05-15_13-48-31_558_6151032330134038151-1'
>  to trash at: hdfs://nameservice/user/mr/.Trash/Current
> Failed with exception Unable to move source 
> hdfs://nameservice/hive/test2/opqt8_hive_2017-05-15_13-48-31_558_6151032330134038151-1/-ext-10000
>  to destination hdfs://nameservice/hive/test2
> FAILED: Execution Error, return code 1 from 
> org.apache.hadoop.hive.ql.exec.MoveTask. Unable to move source 
> hdfs://nameservice/hive/test2/opqt8_hive_2017-05-15_13-48-31_558_6151032330134038151-1/-ext-10000
>  to destination hdfs://nameservice/hive/test2
> MapReduce Jobs Launched: 
> Stage-Stage-1:  HDFS Read: 0 HDFS Write: 0 SUCCESS
> Total MapReduce CPU Time Spent: 0 msec
> hive>
> hive.exec.stagingdir=./opq8 is a relative path for destination write 
> directory  /hive/test2.  Hive will create a temporary directory 
> /hive/test2/opq8_hive* for intermediate query results.  Later in the move 
> staging, Hive will delete or trash the sub directory under the /hive/test2 
> who's name does not begin with "_" or "."  in order to move data to this 
> directory. You can see its processing logic in 
> org.apache.hadoop.hive.ql.metadata.trashFilesUnderDir.
> My modification method is: if  stagingdir is a sub directory of the 
> destination write directory. I add a "."   in front of stagingdir. now 
> temporary directory will be /hive/test2/.opq8_hive* , because the sub 
> directory .opq8_hive* starts with ".",  Hive will not delete it.
> hive> set hive.exec.stagingdir=./opq8;
> hive>  insert overwrite table test2 select * from test3;
> Query ID = mr_20170515143940_ae48a65e-42be-4f50-b974-b713ca902867
> Total jobs = 3
> Launching Job 1 out of 3
> Number of reduce tasks is set to 0 since there's no reduce operator
> Starting Job = job_1494818119523_0012, Tracking URL = 
> http://zdh77:8088/proxy/application_1494818119523_0012/
> Kill Command = /opt/ZDH/parcels/lib/hadoop/bin/hadoop job  -kill 
> job_1494818119523_0012
> Hadoop job information for Stage-1: number of mappers: 0; number of reducers: > 0
> 2017-05-15 14:40:04,547 Stage-1 map = 0%,  reduce = 0%
> Ended Job = job_1494818119523_0012
> Stage-3 is selected by condition resolver.
> Stage-2 is filtered out by condition resolver.
> Stage-4 is filtered out by condition resolver.
> Moving data to directory 
> hdfs://nameservice/hive/test2/.opqt8_hive_2017-05-15_14-39-40_751_1221840798987515724-1/-ext-10000
> Loading data to table default.test2
> MapReduce Jobs Launched: 
> Stage-Stage-1:  HDFS Read: 0 HDFS Write: 0 SUCCESS
> Total MapReduce CPU Time Spent: 0 msec
> OK
> Time taken: 26.751 seconds
> hive> 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (HIVE-16666) Set hive.exec.stagingdir a relative directory or a sub directory of distination data directory will cause Hive to delete the intermediate query results

Reply via email to