actually original source is flume stream, avro formatted rows.
flume sink stream to hdfs's partition directory.
data flow.
flume > avro > hdfs sink > daily partition dir.
my expected best flow
flume > orc > hdfs sink > partition dir
another option
flume > hdfs sink
then hive 'load data' command.
this let hive load text to hive with irc formatted.
because large amount of data should be processed hdfs sink distributes the
load.
if I use hive sink of flume, hive daemon may be a bottleneck, i think.
there seems many cases that change avro to orc.
if their previous data flow is based on flume + hdfs sink I am curious how
they did in detail.
2016. 2. 12. 오전 4:34에 "Ryan Harris" 님이 작성:
> If your original source is text, why don't you make your ORC-based table a
> hive managed table instead of an external table.
>
> Then you can load/partition your text data into the external table, query
> from that and insert into your ORC-backed Hive managed table.
>
>
>
> Theoretically, if you had your data in ORC files, you could just copy them
> to the external table/partition like you do with the text data, but the
> challenge is, how are you going to create the ORC source data? You can
> create it with Hive, Pig, custom Java, etc, but **somehow** you are going
> to have to get your data into ORC format. Hive is probably the easiest
> tool to use to do that. You could load the data into a hive managed table,
> and then copy the ORC files back to an external table, but why?
>
>
>
> *From:* no jihun [mailto:jees...@gmail.com]
> *Sent:* Thursday, February 11, 2016 11:48 AM
> *To:* user@hive.apache.org
> *Subject:* Add partition data to an external ORC table.
>
>
>
> hello.
>
> I wanna know this could be possible or not.
>
> There would be an table which created by
>
> create external table test (
> date_string String,
> message String)
> STORED AS ORC
> PARTIONED BY (date_string STRING)
> LOCATION '/message';
>
> with this table
> I will never add row by 'insert' statement
> but want to
> #1. add data of each day to hdfs's partition location directly.
> e.g /message/20160212
> ( by $ hadoop fs -put )
> #2. then i will add partition everyday morning.
> ALTER TABLE test
> ADD PARTITION (date_string=’20160212’)
> location '/message/20160212';
> #3. query for the added data.
>
> with this scenario what or how can I prepare the ORC formatted data in
> step#1? when stored format is textfile I just need to copy raw file to
> partition directory, but with orc table I dont think this possible so
> easily.
>
> raw application log is json formatted and each day may have 1M json rows.
>
> Actually I do this jobs on my cluster with textfile table not ORC. now I
> am trying to table format.
>
> Any advise would be great.
> thanks
> --
> THIS ELECTRONIC MESSAGE, INCLUDING ANY ACCOMPANYING DOCUMENTS, IS
> CONFIDENTIAL and may contain information that is privileged and exempt from
> disclosure under applicable law. If you are neither the intended recipient
> nor responsible for delivering the message to the intended recipient,
> please note that any dissemination, distribution, copying or the taking of
> any action in reliance upon the message is strictly prohibited. If you have
> received this communication in error, please notify the sender immediately.
> Thank you.
>