RE: Add partition data to an external ORC table.

2016-02-11 Thread no jihun
actually original source is flume stream, avro formatted rows.
flume sink stream to hdfs's partition directory.

data flow.
flume  > avro > hdfs sink > daily partition dir.

my expected best flow
flume > orc > hdfs sink > partition dir

another option
flume > hdfs sink
then hive 'load data' command.
this let hive load text to hive with irc formatted.

because large amount of data should be processed hdfs sink distributes the
load.
if I use hive sink of flume, hive daemon may be a bottleneck, i think.

there seems many cases that change avro to orc.
if their previous data flow is based on flume + hdfs sink I am curious how
they did in detail.
2016. 2. 12. 오전 4:34에 "Ryan Harris" 님이 작성:

> If your original source is text, why don't you make your ORC-based table a
> hive managed table instead of an external table.
>
> Then you can load/partition your text data into the external table, query
> from that and insert into your ORC-backed Hive managed table.
>
>
>
> Theoretically, if you had your data in ORC files, you could just copy them
> to the external table/partition like you do with the text data, but the
> challenge is, how are you going to create the ORC source data?  You can
> create it with Hive, Pig, custom Java, etc, but **somehow** you are going
> to have to get your data into ORC format.  Hive is probably the easiest
> tool to use to do that.  You could load the data into a hive managed table,
> and then copy the ORC files back to an external table, but why?
>
>
>
> *From:* no jihun [mailto:jees...@gmail.com]
> *Sent:* Thursday, February 11, 2016 11:48 AM
> *To:* user@hive.apache.org
> *Subject:* Add partition data to an external ORC table.
>
>
>
> hello.
>
> I wanna know this could be possible or not.
>
> There would be an table which created by
>
> create external table test (
> date_string String,
> message String)
> STORED AS ORC
> PARTIONED BY (date_string STRING)
> LOCATION '/message';
>
> with this table
> I will never add row by 'insert' statement
> but want to
> #1. add data of each day to hdfs's partition location directly.
>   e.g /message/20160212
>   ( by $ hadoop fs -put )
> #2. then i will add partition everyday morning.
> ALTER TABLE test
> ADD PARTITION (date_string=’20160212’)
> location '/message/20160212';
> #3. query for the added data.
>
> with this scenario what or how can I prepare the ORC formatted data in
> step#1? when stored format is textfile I just need to copy raw file to
> partition directory, but with orc table I dont think this possible so
> easily.
>
> raw application log is json formatted and each day may have 1M json rows.
>
> Actually I do this jobs on my cluster with textfile table not ORC. now I
> am trying to table format.
>
> Any advise would be great.
> thanks
> --
> THIS ELECTRONIC MESSAGE, INCLUDING ANY ACCOMPANYING DOCUMENTS, IS
> CONFIDENTIAL and may contain information that is privileged and exempt from
> disclosure under applicable law. If you are neither the intended recipient
> nor responsible for delivering the message to the intended recipient,
> please note that any dissemination, distribution, copying or the taking of
> any action in reliance upon the message is strictly prohibited. If you have
> received this communication in error, please notify the sender immediately.
> Thank you.
>


RE: Add partition data to an external ORC table.

2016-02-11 Thread Ryan Harris
If your original source is text, why don't you make your ORC-based table a hive 
managed table instead of an external table.
Then you can load/partition your text data into the external table, query from 
that and insert into your ORC-backed Hive managed table.

Theoretically, if you had your data in ORC files, you could just copy them to 
the external table/partition like you do with the text data, but the challenge 
is, how are you going to create the ORC source data?  You can create it with 
Hive, Pig, custom Java, etc, but *somehow* you are going to have to get your 
data into ORC format.  Hive is probably the easiest tool to use to do that.  
You could load the data into a hive managed table, and then copy the ORC files 
back to an external table, but why?

From: no jihun [mailto:jees...@gmail.com]
Sent: Thursday, February 11, 2016 11:48 AM
To: user@hive.apache.org
Subject: Add partition data to an external ORC table.


hello.

I wanna know this could be possible or not.

There would be an table which created by

create external table test (
date_string String,
message String)
STORED AS ORC
PARTIONED BY (date_string STRING)
LOCATION '/message';

with this table
I will never add row by 'insert' statement
but want to
#1. add data of each day to hdfs's partition location directly.
  e.g /message/20160212
  ( by $ hadoop fs -put )
#2. then i will add partition everyday morning.
ALTER TABLE test
ADD PARTITION (date_string=’20160212’)
location '/message/20160212';
#3. query for the added data.

with this scenario what or how can I prepare the ORC formatted data in step#1? 
when stored format is textfile I just need to copy raw file to partition 
directory, but with orc table I dont think this possible so easily.

raw application log is json formatted and each day may have 1M json rows.

Actually I do this jobs on my cluster with textfile table not ORC. now I am 
trying to table format.

Any advise would be great.
thanks

==
THIS ELECTRONIC MESSAGE, INCLUDING ANY ACCOMPANYING DOCUMENTS, IS CONFIDENTIAL 
and may contain information that is privileged and exempt from disclosure under 
applicable law. If you are neither the intended recipient nor responsible for 
delivering the message to the intended recipient, please note that any 
dissemination, distribution, copying or the taking of any action in reliance 
upon the message is strictly prohibited. If you have received this 
communication in error, please notify the sender immediately.  Thank you.