We're trying to insert into a table, using a dynamic partition, but the
query runs for a while and then dies with a LeaseExpiredException.  The
hadoop details & some discussion is at
https://issues.apache.org/jira/browse/HDFS-198  Is there a way to configure
hive, or our query, to work around this?  If we adjust our query to handle
less data at once, it can complete in under 10 minutes, but then we have to
run the query many more times to get all the data processed.

The query is:
FROM (
    FROM (
        SELECT file, os, country, dt, project
        FROM downloads WHERE dt='2010-10-01'
        DISTRIBUTE BY project
        SORT BY project asc, file asc
    ) a
    SELECT TRANSFORM(file, os, country, dt, project)
    USING 'transformwrap reduce.py'
    AS (file, downloads, os, country, project)
) b
INSERT OVERWRITE TABLE dl_day PARTITION (dt='2010-10-01', project)
SELECT file, downloads, os, country, FALSE, project

The project partition has roughly 100000 values.

We're using Hive trunk from about a month ago.
 Hadoop 0.18.3-14.cloudera.CH0_3

-- 
Dave Brondsema
Software Engineer
Geeknet

www.geek.net

Reply via email to