Re: hive multiple inserts

Min Zhou Thu, 21 Jan 2010 00:40:33 -0800

It should be a bug of hive. see below

hive> set hive.merge.mapfiles=true;
hive> explain from netflix insert overwrite table t1 select movie_id
insert overwrite table t2 select user_id;
OK
ABSTRACT SYNTAX TREE:
  (TOK_QUERY (TOK_FROM (TOK_TABREF netflix)) (TOK_INSERT
(TOK_DESTINATION (TOK_TAB t1)) (TOK_SELECT (TOK_SELEXPR
(TOK_TABLE_OR_COL movie_id)))) (TOK_INSERT (TOK_DESTINATION (TOK_TAB
t2)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL user_id)))))


STAGE DEPENDENCIES:
  Stage-2 is a root stage
  Stage-5 depends on stages: Stage-2
  Stage-0 depends on stages: Stage-5
  Stage-8 depends on stages: Stage-2
  Stage-1 depends on stages: Stage-8

STAGE PLANS:
  Stage: Stage-2
    Map Reduce
      Alias -> Map Operator Tree:
        netflix
          TableScan
            alias: netflix
            Select Operator
              expressions:
                    expr: movie_id
                    type: string
              outputColumnNames: _col0
              File Output Operator
                compressed: true
                GlobalTableId: 1
                table:
                    input format: org.apache.hadoop.mapred.TextInputFormat
                    output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                    serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                    name: t1
            Select Operator
              expressions:
                    expr: user_id
                    type: string
              outputColumnNames: _col0
              File Output Operator
                compressed: true
                GlobalTableId: 2
                table:
                    input format: org.apache.hadoop.mapred.TextInputFormat
                    output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                    serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                    name: t2

  Stage: Stage-5
    Conditional Operator
      list of dependent Tasks:
          Move Operator
            files:
                hdfs directory: true
                destination:
hdfs://hdpnn.cm3:9000/group/tbdev/zhoumin/hive-tmp/794320195/10000
          Map Reduce
            Alias -> Map Operator Tree:
              hdfs://hdpnn.cm3:9000/group/tbdev/zhoumin/hive-tmp/570535800/10004
                  Reduce Output Operator
                    sort order:
                    Map-reduce partition columns:
                          expr: rand()
                          type: double
                    tag: -1
                    value expressions:
                          expr: foo
                          type: string
            Reduce Operator Tree:
              Extract
                File Output Operator
                  compressed: true
                  GlobalTableId: 0
                  table:
                      input format: org.apache.hadoop.mapred.TextInputFormat
                      output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                      serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                      name: t1

  Stage: Stage-0
    Move Operator
      tables:
          replace: true
          table:
              input format: org.apache.hadoop.mapred.TextInputFormat
              output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
              serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
              name: t1

  Stage: Stage-8
    Conditional Operator
      list of dependent Tasks:
          Move Operator
            files:
                hdfs directory: true
                destination:
hdfs://hdpnn.cm3:9000/group/tbdev/zhoumin/hive-tmp/794320195/10002
          Map Reduce
            Alias -> Map Operator Tree:
              hdfs://hdpnn.cm3:9000/group/tbdev/zhoumin/hive-tmp/570535800/10005
                  Reduce Output Operator
                    sort order:
                    Map-reduce partition columns:
                          expr: rand()
                          type: double
                    tag: -1
                    value expressions:
                          expr: bar
                          type: string
            Reduce Operator Tree:
              Extract
                File Output Operator
                  compressed: true
                  GlobalTableId: 0
                  table:
                      input format: org.apache.hadoop.mapred.TextInputFormat
                      output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                      serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                      name: t2

  Stage: Stage-1
    Move Operator
      tables:
          replace: true
          table:
              input format: org.apache.hadoop.mapred.TextInputFormat
              output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
              serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
              name: t2

-----------------------------------------------------------------------------
hive> set hive.merge.mapfiles=false;
hive> explain from netflix insert overwrite table t1 select movie_id
insert overwrite table t2 select user_id;
OK
ABSTRACT SYNTAX TREE:
  (TOK_QUERY (TOK_FROM (TOK_TABREF netflix)) (TOK_INSERT
(TOK_DESTINATION (TOK_TAB t1)) (TOK_SELECT (TOK_SELEXPR
(TOK_TABLE_OR_COL movie_id)))) (TOK_INSERT (TOK_DESTINATION (TOK_TAB
t2)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL user_id)))))

STAGE DEPENDENCIES:
  Stage-2 is a root stage
  Stage-0 depends on stages: Stage-2

STAGE PLANS:
  Stage: Stage-2
    Map Reduce
      Alias -> Map Operator Tree:
        netflix
          TableScan
            alias: netflix
            Select Operator
              expressions:
                    expr: movie_id
                    type: string
              outputColumnNames: _col0
              File Output Operator
                compressed: true
                GlobalTableId: 1
                table:
                    input format: org.apache.hadoop.mapred.TextInputFormat
                    output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                    serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                    name: t1
            Select Operator
              expressions:
                    expr: user_id
                    type: string
              outputColumnNames: _col0
              File Output Operator
                compressed: true
                GlobalTableId: 2
                table:
                    input format: org.apache.hadoop.mapred.TextInputFormat
                    output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                    serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                    name: t2

  Stage: Stage-0
    Move Operator
      tables:
          replace: true
          table:
              input format: org.apache.hadoop.mapred.TextInputFormat
              output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
              serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
              name: t1

>From explanations above, we can see the query under the second
condition would only generate one move operator. This is the reason
why t2 will be empty.



Regards,
Min

On Wed, Jan 13, 2010 at 5:01 PM, Zheng Shao <zsh...@gmail.com> wrote:
> https://issues.apache.org/jira/browse/HIVE-634
> As far as I know there is nobody working on that right now. If you are
> interested, we can work together on that.
> Let's move the discussion to the JIRA.
>
> Zheng
>
> On Tue, Jan 12, 2010 at 3:27 AM, Anty <anty....@gmail.com> wrote:
>>
>> Thanks Zheng.
>> We have used RegexSerDe in some use cases, but the speed is indeed slower,
>> so we don't want to  use regular expression if not necessary.
>>
>> yes, we have used RegexSerDe in some use cases.
>> I found HIVE-634 is what i need ,allowing for the user to specify field
>> delimiter with any format.
>>
>> INSERT OVERWRITE LOCAL DIRECTORY '/mnt/daily_timelines'
>> [ ROW FORMAT DELIMITED | SERDE ... ]
>> [ FILE FORMAT ...]
>> SELECT * FROM daily_timelines;
>>
>> Is somebody still working on this feature?
>>
>> On Tue, Jan 12, 2010 at 2:28 PM, Zheng Shao <zsh...@gmail.com> wrote:
>> > Yes we only support one-byte delimiter for performance reasons.
>> >
>> > You can use the RegexSerDe in the contrib package for any row format
>> > that
>> > allows a regular expression (including your case "<>"), but the speed
>> > will
>> > be slower.
>> >
>> > Zheng
>> >
>> > On Mon, Jan 11, 2010 at 5:54 PM, Anty <anty....@gmail.com> wrote:
>> >>
>> >> Thanks Zheng.
>> >> It does works.
>> >> I have a another question,if the field delimiter is a string ,e.g.
>> >> "<>",it looks like the LazySimpleSerDe can't works.Does the
>> >> LazySimpleSerDe didn't support string field delimiter,only one byte of
>> >> control characters?
>> >>
>> >> On Tue, Jan 12, 2010 at 3:05 AM, Zheng Shao <zsh...@gmail.com> wrote:
>> >> > For your second question, currently we can do it with a little extra
>> >> > work:
>> >> > 1. Create an external table on the target directory with the field
>> >> > delimiter you want;
>> >> > 2. Run the query and insert overwrite the target external table.
>> >> >
>> >> > For the first question we can also do the similar thing (create a
>> >> > bunch of external table and then insert), but I think we should fix
>> >> > the problem.
>> >> >
>> >> > Zheng
>> >> >
>> >> > On Mon, Jan 11, 2010 at 8:31 AM, Anty <anty....@gmail.com> wrote:
>> >> >> HI:
>> >> >>    I came across the same problean, therein is no data.I have one
>> >> >> more question,can i specify the field delimiter for the output
>> >> >> file,not just the default ctrl-a field delimiter?
>> >> >>
>> >> >> On Fri, Jan 8, 2010 at 2:23 PM, wd <w...@wdicc.com> wrote:
>> >> >>> hi,
>> >> >>>
>> >> >>> I'v tried use hive svn version, seems this bug still exists.
>> >> >>>
>> >> >>> svn st -v
>> >> >>>
>> >> >>>           896805   896744 namit        .
>> >> >>>            896805   894292 namit        eclipse-templates
>> >> >>>            896805   894292 namit
>> >> >>> eclipse-templates/.classpath
>> >> >>>            896805   765509 zshao
>> >> >>> eclipse-templates/TestHive.launchtemplate
>> >> >>>            896805   765509 zshao
>> >> >>> eclipse-templates/TestMTQueries.l
>> >> >>>            ..........
>> >> >>>
>> >> >>> svn reversion 896805 ?
>> >> >>>
>> >> >>> follows is the execute log.
>> >> >>>
>> >> >>> hive> from
>> >> >>> test
>> >> >>>     > INSERT OVERWRITE LOCAL DIRECTORY '/home/stefdong/tmp/0'
>> >> >>> select *
>> >> >>> where
>> >> >>> a = 1
>> >> >>>     > INSERT OVERWRITE LOCAL DIRECTORY '/home/stefdong/tmp/1'
>> >> >>> select *
>> >> >>> where
>> >> >>> a = 3;
>> >> >>> Total MapReduce jobs = 1
>> >> >>> Launching Job 1 out of 1
>> >> >>> Number of reduce tasks is set to 0 since there's no reduce operator
>> >> >>> Starting Job = job_201001071716_4691, Tracking URL =
>> >> >>> http://abc.com:50030/jobdetails.jsp?jobid=job_201001071716_4691
>> >> >>> Kill Command = hadoop job  -Dmapred.job.tracker=abc.com:9001 -kill
>> >> >>> job_201001071716_4691
>> >> >>> 2010-01-08 14:14:55,442 Stage-2 map = 0%,  reduce = 0%
>> >> >>> 2010-01-08 14:15:00,643 Stage-2 map = 100%,  reduce = 0%
>> >> >>> Ended Job = job_201001071716_4691
>> >> >>> Copying data to local directory /home/stefdong/tmp/0
>> >> >>> Copying data to local directory /home/stefdong/tmp/0
>> >> >>> 13 Rows loaded to /home/stefdong/tmp/0
>> >> >>> 9 Rows loaded to /home/stefdong/tmp/1
>> >> >>> OK
>> >> >>> Time taken: 9.409 seconds
>> >> >>>
>> >> >>>
>> >> >>> thx.
>> >> >>>
>> >> >>> 2010/1/6 wd <w...@wdicc.com>
>> >> >>>>
>> >> >>>> hi,
>> >> >>>>
>> >> >>>> Single insert can extract data into '/tmp/out/1'.I even can see
>> >> >>>> "xxx
>> >> >>>> rows
>> >> >>>> loaded to '/tmp/out/0', xxx rows loaded to '/tmp/out/1'...etc in
>> >> >>>> multi
>> >> >>>> inserts, but there is no data in fact.
>> >> >>>>
>> >> >>>> Havn't try svn revision, will try it today.thx.
>> >> >>>>
>> >> >>>> 2010/1/5 Zheng Shao <zsh...@gmail.com>
>> >> >>>>>
>> >> >>>>> Looks like a bug.
>> >> >>>>> What is the svn revision of Hive?
>> >> >>>>>
>> >> >>>>> Did you verify that single insert into '/tmp/out/1' produces
>> >> >>>>> non-empty
>> >> >>>>> files?
>> >> >>>>>
>> >> >>>>> Zheng
>> >> >>>>>
>> >> >>>>> On Tue, Jan 5, 2010 at 12:51 AM, wd <w...@wdicc.com> wrote:
>> >> >>>>> > In hive wiki:
>> >> >>>>> >
>> >> >>>>> > Hive extension (multiple inserts):
>> >> >>>>> > FROM from_statement
>> >> >>>>> > INSERT OVERWRITE [LOCAL] DIRECTORY directory1 select_statement1
>> >> >>>>> >
>> >> >>>>> > [INSERT OVERWRITE [LOCAL] DIRECTORY directory2
>> >> >>>>> > select_statement2]
>> >> >>>>> > ...
>> >> >>>>> >
>> >> >>>>> > I'm try to use hive multi inserts to extract data from hive to
>> >> >>>>> > local
>> >> >>>>> > disk.
>> >> >>>>> > Follows is the hql
>> >> >>>>> >
>> >> >>>>> > from test_tbl
>> >> >>>>> > INSERT OVERWRITE LOCAL DIRECTORY '/tmp/out/0' select select *
>> >> >>>>> > where
>> >> >>>>> > id%10=0
>> >> >>>>> > INSERT OVERWRITE LOCAL DIRECTORY '/tmp/out/1' select select *
>> >> >>>>> > where
>> >> >>>>> > id%10=1
>> >> >>>>> > INSERT OVERWRITE LOCAL DIRECTORY '/tmp/out/2' select select *
>> >> >>>>> > where
>> >> >>>>> > id%10=2
>> >> >>>>> >
>> >> >>>>> > This hql can execute, but only /tmp/out/0 have datafile in it,
>> >> >>>>> > other
>> >> >>>>> > directories are empty. why this happen? bug?
>> >> >>>>> >
>> >> >>>>> >
>> >> >>>>> >
>> >> >>>>> >
>> >> >>>>> >
>> >> >>>>>
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> --
>> >> >>>>> Yours,
>> >> >>>>> Zheng
>> >> >>>>
>> >> >>>
>> >> >>>
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Best Regards
>> >> >> Anty Rao
>> >> >>
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Yours,
>> >> > Zheng
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Best Regards
>> >> Anty Rao
>> >
>> >
>> >
>> > --
>> > Yours,
>> > Zheng
>> >
>>
>>
>>
>> --
>> Best Regards
>> Anty Rao
>>
>
>
>
> --
> Yours,
> Zheng
>



-- 
My research interests are distributed systems, parallel computing and
bytecode based virtual machine.

My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com

Re: hive multiple inserts

Reply via email to