Re: clustered bucket and tablesample
| > []| > NULL| > | Storage Desc Params: | > NULL | > NULL| > | | > serialization.format | > 1 | > > +---+---+-+--+ > > > select count(1) from bucket_x; > INFO : OK > +---+--+ > | c0 | > +---+--+ > | 1000 | > > Now let us try and iterate through all those 256 buckets for the > following. Create the script ${IN_FILE} and run it against Hive. In my case > is pretty fast as I use Hive on Spark engine > > function genrandomnumber { > integer BUCKETNUMBER=1 > integer BUCKETS=256 > while ((BUCKETNUMBER <= BUCKETS)) > do >echo "SELECT ${BUCKETNUMBER} AS BucketNumber, COUNT(1) AS Occurance > FROM bucket_x tablesample(BUCKET ${BUCKETNUMBER} OUT of ${BUCKETS} ON > classifier='cl_900');" >> ${IN_FILE} >((BUCKETNUMBER = BUCKETNUMBER + 1)) > done > } > > And the results don't make sense! > > | bucketnumber | occurance | > +---++--+ > | 1 | 999| > +---++--+ > | bucketnumber | occurance | > +---++--+ > | 2 | 1 | > +---++--+ > | bucketnumber | occurance | > +---++--+ > | 3 | 0 | > +---++--+ > | bucketnumber | occurance | > +---++--+ > | 4 | 0 | > > So apparently Bucket 1 has 999 entry and bucket 2 has 1. > > Let us try it for classifier='cl_103' > > | bucketnumber | occurance | > +---++--+ > | 1 | 999| > +---++--+ > | bucketnumber | occurance | > +---++--+ > | 2 | 1 | > +---++--+ > | bucketnumber | occurance | > +---++--+ > | 3 | 0 | > +---++--+ > The same crap. To me hash partitioning on a string column is > unpredictable. With integer it is fine. I believe there is an underlying > bug in here. Other alternative is to an integer as a surrogate column for > hash partitioning. like a seqiuence in Oracle or identity in Sybase/MSSQL > > HTH > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On 15 May 2016 at 12:29, no jihun <jees...@gmail.com> wrote: > >> OK Talebzadeh thanks. >> >> Have you ever tried tablesample with string value hash? >> something like 'tablesample( bucket 1 out of 256 on >> some_field='somedata')' >> >> >> I wrote a full scenario. >> >> # table creating >> Time taken: 0.155 seconds, Fetched: 36 row(s) >> hive> CREATE TABLE `bucket_x` ( >> > `classifier` string) >> > CLUSTERED BY ( classifier) >> > INTO 256 BUCKETS >> > STORED AS ORC; >> OK >> >> >> *# check option before data insert;* >> hive> set hive.enforce.bucketing; >> hive.enforce.bucketing=true >> >> >> *# insert 1,000 data* >> (also number of distinct value is 1,000) >> >> hive > insert into bucket_x >> values >> ('cl_0'),('cl_1'),('cl_2'),('cl_3'),('cl_4'),('cl_5'),('cl_6'),('cl_7'),('cl_8'),('cl_9'),('cl_10'),('cl_11'),('cl_12'),('cl_13'),('cl_14'),('cl_15'),('cl_16'),('cl_17'),('cl_18'),('cl_19'),('cl_20'),('cl_21'),('cl_22'),('cl_23'),('cl_24'),('cl_25'),('cl_26'),('cl_27'),('cl_28'),('cl_29'),('cl_30'),('cl_31'),('cl_32'),('cl_33'),('cl_34'),('cl_35'),('cl_36'),('cl_37'),('cl_38'),('cl_39'),('cl_40'),('cl_41'),('cl_42'),('cl_43'),('cl_44'),('cl_45'),('cl_46'),('cl_47'),('cl_48'),('cl_49'),('cl_50'),('cl_51'),('cl_52'),('cl_53'),('cl_54'),('cl_55'),('cl_56'),('cl_57'),('cl_58'),('cl_59'),('cl_60'),('cl_61'),('cl_62'),('cl_63'),('cl_64'),('cl_65'),('cl_66'),('cl_67'),('cl_68'),('cl_69'),('cl_70'),('cl_71'),('cl_72'),('cl_73'),('cl_74'),('cl_75'),('cl_76'),('cl_77'),('cl_78'),('cl_79'),('cl_80'),('cl_81'),('cl_82'),('cl_83'),('cl_84'),('cl_85'),('cl_86'),('cl_87'),('cl_88'),('cl_89'),('cl_90'),('cl_91'),('cl_92'),('cl_93'),('cl_94'),('cl_95'),('cl_96'),('cl_97'),('cl_98'),('cl_99'),('cl_100'),('cl_101'),('cl_102'),('cl_103'),('cl_104'),('cl_105'),('cl_106'),('cl_107'),('cl_108'),
Re: clustered bucket and tablesample
stats: > COMPLETE Column stats: NONE > Reduce Output Operator > sort order: > Statistics: Num rows: 1 Data size: 8 Basic > stats: COMPLETE Column stats: NONE > value expressions: _col0 (type: bigint) > Reducer 2 > Reduce Operator Tree: > Group By Operator > aggregations: count(VALUE._col0) > mode: mergepartial > outputColumnNames: _col0 > Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE > Column stats: NONE > File Output Operator > compressed: false > Statistics: Num rows: 1 Data size: 8 Basic stats: > COMPLETE Column stats: NONE > table: > input format: > org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > serde: > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > Stage: Stage-0 > Fetch Operator > limit: -1 > Processor Tree: > ListSink > > > Otherwise I don't see much happening > > hive> EXPLAIN SELECT COUNT(1) FROM dummy > > TABLESAMPLE (BUCKET 1 OUT OF 256 ON ID) > > WHERE ID = 10; > OK > STAGE DEPENDENCIES: > Stage-1 is a root stage > Stage-0 depends on stages: Stage-1 > STAGE PLANS: > Stage: Stage-1 > Spark > Edges: > Reducer 2 <- Map 1 (GROUP, 1) > DagName: > hduser_20160514181029_3deafc4c-30da-4113-a425-d8db8f63daeb:13 > Vertices: > Map 1 > Map Operator Tree: > TableScan > alias: dummy > Statistics: Num rows: 1 Data size: 5662644736 > Basic stats: COMPLETE Column stats: NONE > Filter Operator > predicate: (false and (id = 10)) (type: boolean) > Statistics: Num rows: 2500 Data size: 1415661184 > Basic stats: COMPLETE Column stats: NONE > Select Operator > Statistics: Num rows: 2500 Data size: 1415661184 > Basic stats: COMPLETE Column stats: NONE > Group By Operator > aggregations: count(1) > mode: hash > outputColumnNames: _col0 > Statistics: Num rows: 1 Data size: 8 Basic stats: > COMPLETE Column stats: NONE > Reduce Output Operator > sort order: > Statistics: Num rows: 1 Data size: 8 Basic > stats: COMPLETE Column stats: NONE > value expressions: _col0 (type: bigint) > Reducer 2 > Reduce Operator Tree: > Group By Operator > aggregations: count(VALUE._col0) > mode: mergepartial > outputColumnNames: _col0 > Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE > Column stats: NONE > File Output Operator > compressed: false > Statistics: Num rows: 1 Data size: 8 Basic stats: > COMPLETE Column stats: NONE > table: > input format: > org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > serde: > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > Stage: Stage-0 > Fetch Operator > limit: -1 > Processor Tree: > ListSink > > In general in my experience bucketing in ORC is the only area where ORC > tables come handy. > > HTH > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On 14 May 2016 at 13:38, no jihun <jees...@gmail.com> wrote: > >> ah, as i mentioned >> both field type of action_id and classifier is STRING. and I can not >> change the type. >> >> CREATE TABLE `X`(`action_id` string,`classifier` string) >> CLUSTERED BY (action_id,classifier) INTO 256 BUCKETS >> STORED AS ORC >> >> I use two fields for hash then bucketing because each one field is not so >> well distributed. >> >> my concern is not about the strong hash source but about How
Re: clustered bucket and tablesample
ah, as i mentioned both field type of action_id and classifier is STRING. and I can not change the type. CREATE TABLE `X`(`action_id` string,`classifier` string) CLUSTERED BY (action_id,classifier) INTO 256 BUCKETS STORED AS ORC I use two fields for hash then bucketing because each one field is not so well distributed. my concern is not about the strong hash source but about How can I tablesample to the a bucket by field value what provided by 'where clause' when I clustered by string fields which one is right for tablesample? 1. provide fields TABLESAMPLE(BUCKET 1 OUT OF 256 ON action_id, classifier) 2. provide values TABLESAMPLE(BUCKET 1 OUT OF 256 ON 'aaa', 'bbb') 2016. 5. 14. 오후 8:48에 "Mich Talebzadeh" <mich.talebza...@gmail.com>님이 작성: > Is action_id can be created as a numeric column: > > CREATE TABLE X ( action_id bigint, ..) > > Bucketing or hash partitioning best works on numeric columns with high > cardinality (say a primary key). > > From my old notes: > > Bucketing in Hive refers to hash partitioning where a hashing function is > applied. Likewise an RDBMS like Oracle, Hive will apply a linear hashing > algorithm to prevent data from clustering within specific partitions. > Hashing is very effective if the column selected for bucketing has very > high selectivity like an ID column where selectivity (select > count(distinct(column))/count(column) ) = 1. In this case, the created > partitions/ files will be as evenly sized as possible. In a nutshell > bucketing is a method to get data evenly distributed over many > partitions/files. One should define the number of buckets by a power of > two -- 2^n, like 2, 4, 8, 16 etc to achieve best results. Again bucketing > will help concurrency in Hive. It may even allow a partition wise join i.e. > a join between two tables that are bucketed on the same column with the > same number of buckets (anyone has tried this?) > > > > One more things. When one defines the number of buckets at table creation > level in Hive, the number of partitions/files will be fixed. In contrast, > with partitioning you do not have this limitation. > > can you do > > show create table X > > and send the output. please. > > > > Thanks > > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On 14 May 2016 at 12:23, no jihun <jees...@gmail.com> wrote: > >> Hello. >> >> I want to ask the correct bucketing and tablesample way. >> >> There is a table X which I created by >> >> CREATE TABLE `X`(`action_id` string,`classifier` string) >> CLUSTERED BY (action_id,classifier) INTO 256 BUCKETS >> STORED AS ORC >> >> Then I inserted 500M of rows into X by >> >> set hive.enforce.bucketing=true; >> INSERT OVERWRITE INTO X SELECT * FROM X_RAW >> >> Then I want to count or search some rows with condition. roughly, >> >> SELECT COUNT(*) FROM X WHERE action_id='aaa' AND classifier='bbb' >> >> But I'd better to USE tablesample as I clustered X (action_id, >> classifier). So, the better query will be >> >> SELECT COUNT(*) FROM X >> TABLESAMPLE(BUCKET 1 OUT OF 256 ON action_id, classifier) >> WHERE action_id='aaa' AND classifier='bbb' >> >> Is there any wrong above? But I can't not find any performance gain >> between these two query. >> >> query1 and RESULT( with no tablesample.) >> >> SELECT COUNT(*)) from X >> WHERE action_id='aaa' and classifier='bbb' >> >> >> >> VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED >> KILLED >> >> >> Map 1 .. SUCCEEDED25625600 >> 0 0 >> Reducer 2 .. SUCCEEDED 1 100 >> 0 0 >> >> >> VERTICES: 02/02 [==>>] 100% ELAPSED TIME: 15.35 >> s >> >> >> It scans full data. >> >> query 2 and RESULT >> >> SELECT COUNT(*)) from X >> TABLESAMPLE(BUCKET 1 OUT OF 256 ON action_id, classifier) >> WHERE action_id='aaa' and classifier='bbb' >> >> >> --
Re: ORC file sort order ..
You can enforce to insert sorted data into *SORTED BY *table by set hive.enforce.sorting=true https://github.com/apache/hive/blob/branch-1.2/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java#L1131 but this configuration seems removed by 2.0 https://issues.apache.org/jira/browse/HIVE-12331 2016-04-10 1:41 GMT+09:00 Mich Talebzadeh <mich.talebza...@gmail.com>: > Have you tried bucketing by the column plus setting orce,create.index and > orc.bloom.filter.columns > > CREATE TABLE dummy ( > ID INT >, CLUSTERED INT >, SCATTERED INT >, RANDOMISED INT >, RANDOM_STRING VARCHAR(50) >, SMALL_VC VARCHAR(10) >, PADDING VARCHAR(10) > ) > > *CLUSTERED BY (ID) INTO 256 BUCKETS*STORED AS ORC > TBLPROPERTIES ( > > > *"orc.create.index"="true","orc.bloom.filter.columns"="ID","* > orc.bloom.filter.fpp"="0.05", > "orc.compress"="SNAPPY", > "orc.stripe.size"="16777216", > "orc.row.index.stride"="1" ) > ; > > > HTH > > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On 9 April 2016 at 01:53, Gautam <gautamkows...@gmail.com> wrote: > >> Hey, >> >>This might be too obvious a question but I haven't found a way >> to validate ordering in an ORC file. I need each file to be ordered by a >> column, Is there a sure shot way of ensuring the sort order in an ORC file >> is as I expect it? >> >> The closest i'v come to is using the hive --orcfiledump --rowindex >> which prints that columns min/max values in the index. But that is >> still not saying if the data within the stripes is sorted. >> >> Cheers, >> -Gautam. >> > > -- -- Jihun No ( 노지훈 ) -- Twitter : @nozisim Facebook : nozisim Website : http://jeesim2.godohosting.com - Market Apps : android market products. <https://market.android.com/developer?pub=%EB%85%B8%EC%A7%80%ED%9B%88>
Re: ORC file sort order ..
I think there is no explicit sort info in orc meta. because sorting is just rule of inserting. table may created with sorted by. then must be inserted with sorted data. now we can check both of table spec and orc stats. 2016. 4. 9. 오전 9:57에 "Owen O'Malley"님이 작성: Use orcfiledump with the -d parameter. It will print the contents of the orc file. You could also use the file-contents executable from the C++ ORC reader. .. Owen On Fri, Apr 8, 2016 at 5:53 PM, Gautam wrote: > Hey, > >This might be too obvious a question but I haven't found a way > to validate ordering in an ORC file. I need each file to be ordered by a > column, Is there a sure shot way of ensuring the sort order in an ORC file > is as I expect it? > > The closest i'v come to is using the hive --orcfiledump --rowindex > which prints that columns min/max values in the index. But that is > still not saying if the data within the stripes is sorted. > > Cheers, > -Gautam. >
DAG Kill while insert into A select * from B
Hi. I have trouble to run query "insert into table mylog_orc_bucketing select * from mylog_orc" *Job received Kill while in RUNNING state.Vertex killed, vertexName=Map 1, vertexId=vertex_1459773554571_0009_10_00, diagnostics=* * Vertex received Kill while in RUNNING state., Vertex did not succeed due to DAG_KILL, failedTasks:0 killedTasks:145, Vertex vertex_1459773554571_0009_10_00* * Map 1* *killed/failed due to:DAG_KILL* *DAG did not succeed due to DAG_KILL. failedVertices:0 killedVertices:1* Actually mylog_orc has 300million rows and I think my cluster may have not enough resource(ram,cpu...). Then I tried with small table. "insert into table mylog_orc_bucketing select * from mylog_orc_mini" mylog_orc_mini has 300K rows and this works well. with no fail. So I wonder how can I find the log about deep cause of the DAG KILL of first query? something like OOM.. which tell me what should I change or scallout/up. I tested this on hive 1.2 + tez0.7. Thanks!
hive on tez hadoop-common version problem.
Hi all. I have some problem with hive-on-tez. email thread below is forwarding originally wrote to tez users. Could someone please check this problem out? Any advice will be appreciated. thanks. -- Forwarded message -- From: no jihun <jees...@gmail.com> Date: 2016-02-15 17:04 GMT+09:00 Subject: Re: Failing attemption at org.apache.tez.client.TezClient.waitTillReady To: u...@tez.apache.org Thanks Hitesh Shah. btw. I am sorry but It is very wired. I can't find any previous version files of hadoop version conflict on any node. I've tried minimal.tar.gz and tez.use.cluster.hadoop-libs, but result was same. With my set up the examples of tez wiki seems work well. *$ ~/apps/hadoop/bin/hadoop jar tez-dist/target/tez-0.8.2/tez-tests-0.8.2.jar testorderedwordcount -DUSE_TEZ_SESSION=true /tmp/wc1 /tmp/wc1.r /tmp/wc2 /tmp/wc2.r* *SLF4J: Class path contains multiple SLF4J bindings.* *SLF4J: Found binding in [jar:file:/home1/irteam/apps/tez-0.8.2/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]* *SLF4J: Found binding in [jar:file:/home1/irteam/apps/hadoop-2.6.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]* *.* *Shuffle Errors* *BAD_ID=0* *CONNECTION=0* *IO_ERROR=0* *WRONG_LENGTH=0* *WRONG_MAP=0* *WRONG_REDUCE=0* *16/02/15 16:27:51 INFO examples.TestOrderedWordCount: DAG 2 completed. FinalState=SUCCEEDED* *16/02/15 16:27:51 INFO examples.TestOrderedWordCount: Shutting down session* *16/02/15 16:27:51 INFO client.TezClient: Shutting down Tez Session, sessionName=OrderedWordCountSession, applicationId=application_1452243782005_0323* The method with problem seems added from hadoop 2.4. *org.apache.hadoop.conf.Configuration.addDeprecations([Lorg/apache/hadoop/conf/Configuration$DeprecationDelta;)* hadoop-common jars.. *$ find ./ -name hadoop-common*jar* *./apps/tez-0.8.2/lib/hadoop-common-2.6.0.jar* *./apps/hadoop-2.6.0/share/hadoop/common/hadoop-common-2.6.0.jar* *./apps/hadoop-2.6.0/share/hadoop/common/hadoop-common-2.6.0-tests.jar* *./apps/hadoop-2.6.0/share/hadoop/common/sources/hadoop-common-2.6.0-sources.jar* *./apps/hadoop-2.6.0/share/hadoop/common/sources/hadoop-common-2.6.0-test-sources.jar* *./apps/hadoop-2.6.0/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/hadoop-common-2.6.0.jar* *./apps/hadoop-2.6.0/share/hadoop/kms/tomcat/webapps/kms/WEB-INF/lib/hadoop-common-2.6.0.jar* When I print jinfo of hive cli process there is no other version of hadoop jars... *$ jinfo 52974* *Attaching to process ID 52974, please wait...* *Debugger attached successfully.* *Server compiler detected.* *JVM version is 25.60-b23* *Java System Properties* *java.runtime.name <http://java.runtime.name/> = Java(TM) SE Runtime Environment* *java.vm.version = 25.60-b23* *sun.boot.library.path = /home1/apps/jdk1.8.0_60/jre/lib/amd64* *hadoop.root.logger = INFO,console* *java.vendor.url = http* *//java.oracle.com/ <http://java.oracle.com/>* *java.vm.vendor = Oracle Corporation* *path.separator = * *file.encoding.pkg = sun.io <http://sun.io/>* *java.vm.name <http://java.vm.name/> = Java HotSpot(TM) 64-Bit Server VM* *sun.os.patch.level = unknown* *sun.java.launcher = SUN_STANDARD* *user.country = KR* *user.dir = /home1/apps/apache-hive-1.2.1-bin/bin* *java.vm.specification.name <http://java.vm.specification.name/> = Java Virtual Machine Specification* *java.runtime.version = 1.8.0_60-b27* *java.awt.graphicsenv = sun.awt.X11GraphicsEnvironment* *os.arch = amd64* *java.endorsed.dirs = /home1/apps/jdk1.8.0_60/jre/lib/endorsed* *line.separator = * *java.io.tmpdir = /tmp* *hadoop.log.file = hadoop.log* *java.vm.specification.vendor = Oracle Corporation* *os.name <http://os.name/> = Linux* *hadoop.id.str = irteam* *sun.jnu.encoding = UTF-8* *java.library.path = /home1/apps/hadoop/lib/native* *hadoop.home.dir = /home1/apps/hadoop* *java.specification.name <http://java.specification.name/> = Java Platform API Specification* *java.class.version = 52.0* *java.net.preferIPv4Stack = true* *sun.management.compiler = HotSpot 64-Bit Tiered Compilers* *os.version = 2.6.32-504.3.3.el6.x86_64* *user.home = /home1* *user.timezone = ROK* *java.awt.printerjob = sun.print.PSPrinterJob* *file.encoding = UTF-8* *java.specification.version = 1.8* *user.name <http://user.name/> = irteam* *java.class.path = /home1/apps/tez-0.8.2/conf* */home1/apps/tez-0.8.2/tez-history-parser-0.8.2.jar* */home1/apps/tez-0.8.2/tez-yarn-timeline-history-with-acls-0.8.2.jar* */home1/apps/tez-0.8.2/tez-examples-0.8.2.jar* */home1/apps/tez-0.8.2/tez-javadoc-tools-0.8.2.jar* */home1/apps/tez-0.8.2/tez-runtime-internals-0.8.2.jar* */home1/apps/tez-0.8.2/hadoop-shim-2.6-0.8.2.jar* */home1/apps/tez-0.8.2/tez-runtime-library-0.8.2.jar* */home1/apps/tez-0.8.2/tez-job-analyzer-0.8.2.jar* */home1/apps/tez-0.8.2/tez-yarn-timeline-h
RE: Add partition data to an external ORC table.
actually original source is flume stream, avro formatted rows. flume sink stream to hdfs's partition directory. data flow. flume > avro > hdfs sink > daily partition dir. my expected best flow flume > orc > hdfs sink > partition dir another option flume > hdfs sink then hive 'load data' command. this let hive load text to hive with irc formatted. because large amount of data should be processed hdfs sink distributes the load. if I use hive sink of flume, hive daemon may be a bottleneck, i think. there seems many cases that change avro to orc. if their previous data flow is based on flume + hdfs sink I am curious how they did in detail. 2016. 2. 12. 오전 4:34에 "Ryan Harris" <ryan.har...@zionsbancorp.com>님이 작성: > If your original source is text, why don't you make your ORC-based table a > hive managed table instead of an external table. > > Then you can load/partition your text data into the external table, query > from that and insert into your ORC-backed Hive managed table. > > > > Theoretically, if you had your data in ORC files, you could just copy them > to the external table/partition like you do with the text data, but the > challenge is, how are you going to create the ORC source data? You can > create it with Hive, Pig, custom Java, etc, but **somehow** you are going > to have to get your data into ORC format. Hive is probably the easiest > tool to use to do that. You could load the data into a hive managed table, > and then copy the ORC files back to an external table, but why? > > > > *From:* no jihun [mailto:jees...@gmail.com] > *Sent:* Thursday, February 11, 2016 11:48 AM > *To:* user@hive.apache.org > *Subject:* Add partition data to an external ORC table. > > > > hello. > > I wanna know this could be possible or not. > > There would be an table which created by > > create external table test ( > date_string String, > message String) > STORED AS ORC > PARTIONED BY (date_string STRING) > LOCATION '/message'; > > with this table > I will never add row by 'insert' statement > but want to > #1. add data of each day to hdfs's partition location directly. > e.g /message/20160212 > ( by $ hadoop fs -put ) > #2. then i will add partition everyday morning. > ALTER TABLE test > ADD PARTITION (date_string=’20160212’) > location '/message/20160212'; > #3. query for the added data. > > with this scenario what or how can I prepare the ORC formatted data in > step#1? when stored format is textfile I just need to copy raw file to > partition directory, but with orc table I dont think this possible so > easily. > > raw application log is json formatted and each day may have 1M json rows. > > Actually I do this jobs on my cluster with textfile table not ORC. now I > am trying to table format. > > Any advise would be great. > thanks > -- > THIS ELECTRONIC MESSAGE, INCLUDING ANY ACCOMPANYING DOCUMENTS, IS > CONFIDENTIAL and may contain information that is privileged and exempt from > disclosure under applicable law. If you are neither the intended recipient > nor responsible for delivering the message to the intended recipient, > please note that any dissemination, distribution, copying or the taking of > any action in reliance upon the message is strictly prohibited. If you have > received this communication in error, please notify the sender immediately. > Thank you. >
Add partition data to an external ORC table.
hello. I wanna know this could be possible or not. There would be an table which created by create external table test ( date_string String, message String) STORED AS ORC PARTIONED BY (date_string STRING) LOCATION '/message'; with this table I will never add row by 'insert' statement but want to #1. add data of each day to hdfs's partition location directly. e.g /message/20160212 ( by $ hadoop fs -put ) #2. then i will add partition everyday morning. ALTER TABLE test ADD PARTITION (date_string=’20160212’) location '/message/20160212'; #3. query for the added data. with this scenario what or how can I prepare the ORC formatted data in step#1? when stored format is textfile I just need to copy raw file to partition directory, but with orc table I dont think this possible so easily. raw application log is json formatted and each day may have 1M json rows. Actually I do this jobs on my cluster with textfile table not ORC. now I am trying to table format. Any advise would be great. thanks