Re: clustered bucket and tablesample

2016-05-15 Thread no jihun
 |
> []|
> NULL|
> | Storage Desc Params:  |
> NULL  |
> NULL|
> |   |
> serialization.format  |
> 1   |
>
> +---+---+-+--+
>
>
> select count(1) from bucket_x;
> INFO  : OK
> +---+--+
> |  c0   |
> +---+--+
> | 1000  |
>
> Now let us try and iterate through all those 256 buckets for the
> following. Create the script ${IN_FILE} and run it against Hive. In my case
> is pretty fast as I use Hive on Spark engine
>
> function genrandomnumber {
> integer BUCKETNUMBER=1
> integer BUCKETS=256
> while ((BUCKETNUMBER <= BUCKETS))
> do
>echo "SELECT ${BUCKETNUMBER} AS BucketNumber, COUNT(1) AS Occurance
> FROM bucket_x tablesample(BUCKET ${BUCKETNUMBER} OUT of ${BUCKETS} ON
> classifier='cl_900');" >> ${IN_FILE}
>((BUCKETNUMBER = BUCKETNUMBER + 1))
> done
> }
>
> And the results don't make sense!
>
> | bucketnumber  | occurance  |
> +---++--+
> | 1 | 999|
> +---++--+
> | bucketnumber  | occurance  |
> +---++--+
> | 2 | 1  |
> +---++--+
> | bucketnumber  | occurance  |
> +---++--+
> | 3 | 0  |
> +---++--+
> | bucketnumber  | occurance  |
> +---++--+
> | 4 | 0  |
>
> So apparently Bucket 1 has 999 entry and bucket 2 has 1.
>
> Let us try it for classifier='cl_103'
>
> | bucketnumber  | occurance  |
> +---++--+
> | 1 | 999|
> +---++--+
> | bucketnumber  | occurance  |
> +---++--+
> | 2 | 1  |
> +---++--+
> | bucketnumber  | occurance  |
> +---++--+
> | 3 | 0  |
> +---++--+
> The same crap. To me hash partitioning on a string column is
> unpredictable. With integer it is fine. I believe there is an underlying
> bug in here. Other alternative is to an integer as a surrogate column for
> hash partitioning. like a seqiuence in Oracle or identity in Sybase/MSSQL
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 15 May 2016 at 12:29, no jihun <jees...@gmail.com> wrote:
>
>> OK Talebzadeh thanks.
>>
>> Have you ever tried tablesample with string value hash?
>> something like 'tablesample( bucket 1 out of 256 on
>> some_field='somedata')'
>>
>>
>> I wrote a full scenario.
>>
>> # table creating
>> Time taken: 0.155 seconds, Fetched: 36 row(s)
>> hive> CREATE TABLE `bucket_x` (
>> >   `classifier` string)
>> > CLUSTERED BY ( classifier)
>> > INTO 256 BUCKETS
>> > STORED AS ORC;
>> OK
>>
>>
>> *# check option before data insert;*
>> hive> set hive.enforce.bucketing;
>> hive.enforce.bucketing=true
>>
>>
>> *# insert 1,000 data*
>> (also number of distinct value is 1,000)
>>
>> hive > insert into bucket_x
>> values 
>> ('cl_0'),('cl_1'),('cl_2'),('cl_3'),('cl_4'),('cl_5'),('cl_6'),('cl_7'),('cl_8'),('cl_9'),('cl_10'),('cl_11'),('cl_12'),('cl_13'),('cl_14'),('cl_15'),('cl_16'),('cl_17'),('cl_18'),('cl_19'),('cl_20'),('cl_21'),('cl_22'),('cl_23'),('cl_24'),('cl_25'),('cl_26'),('cl_27'),('cl_28'),('cl_29'),('cl_30'),('cl_31'),('cl_32'),('cl_33'),('cl_34'),('cl_35'),('cl_36'),('cl_37'),('cl_38'),('cl_39'),('cl_40'),('cl_41'),('cl_42'),('cl_43'),('cl_44'),('cl_45'),('cl_46'),('cl_47'),('cl_48'),('cl_49'),('cl_50'),('cl_51'),('cl_52'),('cl_53'),('cl_54'),('cl_55'),('cl_56'),('cl_57'),('cl_58'),('cl_59'),('cl_60'),('cl_61'),('cl_62'),('cl_63'),('cl_64'),('cl_65'),('cl_66'),('cl_67'),('cl_68'),('cl_69'),('cl_70'),('cl_71'),('cl_72'),('cl_73'),('cl_74'),('cl_75'),('cl_76'),('cl_77'),('cl_78'),('cl_79'),('cl_80'),('cl_81'),('cl_82'),('cl_83'),('cl_84'),('cl_85'),('cl_86'),('cl_87'),('cl_88'),('cl_89'),('cl_90'),('cl_91'),('cl_92'),('cl_93'),('cl_94'),('cl_95'),('cl_96'),('cl_97'),('cl_98'),('cl_99'),('cl_100'),('cl_101'),('cl_102'),('cl_103'),('cl_104'),('cl_105'),('cl_106'),('cl_107'),('cl_108'),

Re: clustered bucket and tablesample

2016-05-15 Thread no jihun
stats:
> COMPLETE Column stats: NONE
> Reduce Output Operator
>   sort order:
>   Statistics: Num rows: 1 Data size: 8 Basic
> stats: COMPLETE Column stats: NONE
>   value expressions: _col0 (type: bigint)
> Reducer 2
> Reduce Operator Tree:
>   Group By Operator
> aggregations: count(VALUE._col0)
> mode: mergepartial
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE
> Column stats: NONE
> File Output Operator
>   compressed: false
>   Statistics: Num rows: 1 Data size: 8 Basic stats:
> COMPLETE Column stats: NONE
>   table:
>   input format:
> org.apache.hadoop.mapred.TextInputFormat
>   output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   serde:
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>   Stage: Stage-0
> Fetch Operator
>   limit: -1
>   Processor Tree:
> ListSink
>
>
> Otherwise I don't see much happening
>
> hive>  EXPLAIN SELECT COUNT(1) FROM dummy
> > TABLESAMPLE (BUCKET 1 OUT OF 256 ON ID)
> > WHERE ID = 10;
> OK
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-1
> Spark
>   Edges:
> Reducer 2 <- Map 1 (GROUP, 1)
>   DagName:
> hduser_20160514181029_3deafc4c-30da-4113-a425-d8db8f63daeb:13
>   Vertices:
> Map 1
> Map Operator Tree:
> TableScan
>   alias: dummy
>   Statistics: Num rows: 1 Data size: 5662644736
> Basic stats: COMPLETE Column stats: NONE
>   Filter Operator
> predicate: (false and (id = 10)) (type: boolean)
> Statistics: Num rows: 2500 Data size: 1415661184
> Basic stats: COMPLETE Column stats: NONE
> Select Operator
>   Statistics: Num rows: 2500 Data size: 1415661184
> Basic stats: COMPLETE Column stats: NONE
>   Group By Operator
> aggregations: count(1)
> mode: hash
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 8 Basic stats:
> COMPLETE Column stats: NONE
> Reduce Output Operator
>   sort order:
>   Statistics: Num rows: 1 Data size: 8 Basic
> stats: COMPLETE Column stats: NONE
>   value expressions: _col0 (type: bigint)
> Reducer 2
> Reduce Operator Tree:
>   Group By Operator
> aggregations: count(VALUE._col0)
> mode: mergepartial
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE
> Column stats: NONE
> File Output Operator
>   compressed: false
>   Statistics: Num rows: 1 Data size: 8 Basic stats:
> COMPLETE Column stats: NONE
>   table:
>   input format:
> org.apache.hadoop.mapred.TextInputFormat
>   output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   serde:
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>   Stage: Stage-0
> Fetch Operator
>   limit: -1
>   Processor Tree:
> ListSink
>
> In general in my experience bucketing in ORC is the only area where ORC
> tables come handy.
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 14 May 2016 at 13:38, no jihun <jees...@gmail.com> wrote:
>
>> ah, as i mentioned
>> both field type of action_id and classifier is STRING. and I can not
>> change the type.
>>
>> CREATE TABLE `X`(`action_id` string,`classifier` string)
>> CLUSTERED BY (action_id,classifier) INTO 256 BUCKETS
>> STORED AS ORC
>>
>> I use two fields for hash then bucketing because each one field is not so
>> well distributed.
>>
>> my concern is not about the strong hash source but about How

Re: clustered bucket and tablesample

2016-05-14 Thread no jihun
ah, as i mentioned
both field type of action_id and classifier is STRING. and I can not change
the type.

CREATE TABLE `X`(`action_id` string,`classifier` string)
CLUSTERED BY (action_id,classifier) INTO 256 BUCKETS
STORED AS ORC

I use two fields for hash then bucketing because each one field is not so
well distributed.

my concern is not about the strong hash source but about How can I
tablesample to the a bucket by field value what provided by 'where clause'

when I clustered by string fields which one is right for tablesample?
1. provide fields
TABLESAMPLE(BUCKET 1 OUT OF 256 ON  action_id, classifier)

2. provide values
TABLESAMPLE(BUCKET 1 OUT OF 256 ON  'aaa', 'bbb')
2016. 5. 14. 오후 8:48에 "Mich Talebzadeh" <mich.talebza...@gmail.com>님이 작성:

> Is action_id can be created as a numeric column:
>
> CREATE TABLE X ( action_id bigint,  ..)
>
> Bucketing or hash partitioning best works on numeric columns with high
> cardinality (say a primary key).
>
> From my old notes:
>
> Bucketing in Hive refers to hash partitioning where a hashing function is
> applied. Likewise an RDBMS like Oracle, Hive will apply a linear hashing
> algorithm to prevent data from clustering within specific partitions.
> Hashing is very effective if the column selected for bucketing has very
> high selectivity like an ID column where selectivity (select
> count(distinct(column))/count(column) ) = 1.  In this case, the created
> partitions/ files will be as evenly sized as possible. In a nutshell
> bucketing is a method to get data evenly distributed over many
> partitions/files.  One should define the number of buckets by a power of
> two -- 2^n,  like 2, 4, 8, 16 etc to achieve best results. Again bucketing
> will help concurrency in Hive. It may even allow a partition wise join i.e.
> a join between two tables that are bucketed on the same column with the
> same number of buckets (anyone has tried this?)
>
>
>
> One more things. When one defines the number of buckets at table creation
> level in Hive, the number of partitions/files will be fixed. In contrast,
> with partitioning you do not have this limitation.
>
> can you do
>
> show create table X
>
> and send the output. please.
>
>
>
> Thanks
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 14 May 2016 at 12:23, no jihun <jees...@gmail.com> wrote:
>
>> Hello.
>>
>> I want to ask the correct bucketing and tablesample way.
>>
>> There is a table X which I created by
>>
>> CREATE TABLE `X`(`action_id` string,`classifier` string)
>> CLUSTERED BY (action_id,classifier) INTO 256 BUCKETS
>> STORED AS ORC
>>
>> Then I inserted 500M of rows into X by
>>
>> set hive.enforce.bucketing=true;
>> INSERT OVERWRITE INTO X SELECT * FROM X_RAW
>>
>> Then I want to count or search some rows with condition. roughly,
>>
>> SELECT COUNT(*) FROM X WHERE action_id='aaa' AND classifier='bbb'
>>
>> But I'd better to USE tablesample as I clustered X (action_id,
>> classifier). So, the better query will be
>>
>> SELECT COUNT(*) FROM X
>> TABLESAMPLE(BUCKET 1 OUT OF 256 ON  action_id, classifier)
>> WHERE action_id='aaa' AND classifier='bbb'
>>
>> Is there any wrong above? But I can't not find any performance gain
>> between these two query.
>>
>> query1 and RESULT( with no tablesample.)
>>
>> SELECT COUNT(*)) from X
>> WHERE action_id='aaa' and classifier='bbb'
>>
>>
>> 
>> VERTICES  STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED
>> KILLED
>>
>> 
>> Map 1 ..   SUCCEEDED25625600
>> 0   0
>> Reducer 2 ..   SUCCEEDED  1  100
>> 0   0
>>
>> 
>> VERTICES: 02/02  [==>>] 100%  ELAPSED TIME: 15.35
>> s
>>
>> 
>> It scans full data.
>>
>> query 2 and RESULT
>>
>> SELECT COUNT(*)) from X
>> TABLESAMPLE(BUCKET 1 OUT OF 256 ON  action_id, classifier)
>> WHERE action_id='aaa' and classifier='bbb'
>>
>>
>> --

Re: ORC file sort order ..

2016-04-10 Thread no jihun
You can enforce to insert sorted data into *SORTED BY *table by set
hive.enforce.sorting=true
https://github.com/apache/hive/blob/branch-1.2/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java#L1131

but this configuration seems removed by 2.0
https://issues.apache.org/jira/browse/HIVE-12331

2016-04-10 1:41 GMT+09:00 Mich Talebzadeh <mich.talebza...@gmail.com>:

> Have you tried bucketing by the column plus setting orce,create.index and
> orc.bloom.filter.columns
>
> CREATE TABLE dummy (
>  ID INT
>, CLUSTERED INT
>, SCATTERED INT
>, RANDOMISED INT
>, RANDOM_STRING VARCHAR(50)
>, SMALL_VC VARCHAR(10)
>, PADDING  VARCHAR(10)
> )
>
> *CLUSTERED BY (ID) INTO 256 BUCKETS*STORED AS ORC
> TBLPROPERTIES (
>
>
> *"orc.create.index"="true","orc.bloom.filter.columns"="ID","*
> orc.bloom.filter.fpp"="0.05",
> "orc.compress"="SNAPPY",
> "orc.stripe.size"="16777216",
> "orc.row.index.stride"="1" )
> ;
>
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 9 April 2016 at 01:53, Gautam <gautamkows...@gmail.com> wrote:
>
>> Hey,
>>
>>This might be too obvious a question but I haven't found a way
>> to validate ordering in an ORC file. I need each file to be ordered by a
>> column, Is there a sure shot way of ensuring the sort order in an ORC file
>> is as I expect it?
>>
>> The closest i'v come to is using the hive --orcfiledump --rowindex
>>  which prints that columns min/max values in the index. But that is
>> still not saying if the data within the stripes is sorted.
>>
>> Cheers,
>> -Gautam.
>>
>
>


-- 
--
Jihun No ( 노지훈 )
--
Twitter  : @nozisim
Facebook   : nozisim
Website : http://jeesim2.godohosting.com
-
Market Apps   : android market products.
<https://market.android.com/developer?pub=%EB%85%B8%EC%A7%80%ED%9B%88>


Re: ORC file sort order ..

2016-04-08 Thread no jihun
I think there is no explicit sort info in orc meta. because sorting is just
rule of inserting.

table may created with sorted by.
then must be inserted with sorted data.
now we can check both of table spec and orc stats.
2016. 4. 9. 오전 9:57에 "Owen O'Malley" 님이 작성:

Use orcfiledump with the -d parameter. It will print the contents of the
orc file. You could also use the file-contents executable from the C++ ORC
reader.

.. Owen

On Fri, Apr 8, 2016 at 5:53 PM, Gautam  wrote:

> Hey,
>
>This might be too obvious a question but I haven't found a way
> to validate ordering in an ORC file. I need each file to be ordered by a
> column, Is there a sure shot way of ensuring the sort order in an ORC file
> is as I expect it?
>
> The closest i'v come to is using the hive --orcfiledump --rowindex
>  which prints that columns min/max values in the index. But that is
> still not saying if the data within the stripes is sorted.
>
> Cheers,
> -Gautam.
>


DAG Kill while insert into A select * from B

2016-04-06 Thread no jihun
Hi.

I have trouble to run query

"insert into table mylog_orc_bucketing select * from mylog_orc"


*Job received Kill while in RUNNING state.Vertex killed, vertexName=Map 1,
vertexId=vertex_1459773554571_0009_10_00, diagnostics=*
* Vertex received Kill while in RUNNING state., Vertex did not succeed due
to DAG_KILL, failedTasks:0 killedTasks:145, Vertex
vertex_1459773554571_0009_10_00*
* Map 1*
*killed/failed due to:DAG_KILL*

*DAG did not succeed due to DAG_KILL. failedVertices:0 killedVertices:1*

Actually mylog_orc has 300million rows and I think my cluster may have not
enough resource(ram,cpu...).

Then I tried with small table.
"insert into table mylog_orc_bucketing select * from mylog_orc_mini"
mylog_orc_mini has 300K rows and this works well. with no fail.

So I wonder how can I find the log about deep cause of the DAG KILL of
first query? something like OOM..  which tell me what should I change or
scallout/up.

I tested this on hive 1.2 + tez0.7.

Thanks!


hive on tez hadoop-common version problem.

2016-02-15 Thread no jihun
Hi all.

I have some problem with hive-on-tez.
email thread below is forwarding originally wrote to tez users.

Could someone please check this problem out?
Any advice will be appreciated.

thanks.

-- Forwarded message --
From: no jihun <jees...@gmail.com>
Date: 2016-02-15 17:04 GMT+09:00
Subject: Re: Failing attemption at
org.apache.tez.client.TezClient.waitTillReady
To: u...@tez.apache.org


Thanks Hitesh Shah.

btw. I am sorry but It is very wired.
I can't find any previous version files of hadoop version conflict on any
node.


I've tried minimal.tar.gz and tez.use.cluster.hadoop-libs, but result was
same.

With my set up the examples of tez wiki seems work well.
*$ ~/apps/hadoop/bin/hadoop jar
tez-dist/target/tez-0.8.2/tez-tests-0.8.2.jar testorderedwordcount
-DUSE_TEZ_SESSION=true /tmp/wc1 /tmp/wc1.r /tmp/wc2 /tmp/wc2.r*
*SLF4J: Class path contains multiple SLF4J bindings.*
*SLF4J: Found binding in
[jar:file:/home1/irteam/apps/tez-0.8.2/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]*
*SLF4J: Found binding in
[jar:file:/home1/irteam/apps/hadoop-2.6.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]*
*.*
*Shuffle Errors*
*BAD_ID=0*
*CONNECTION=0*
*IO_ERROR=0*
*WRONG_LENGTH=0*
*WRONG_MAP=0*
*WRONG_REDUCE=0*
*16/02/15 16:27:51 INFO examples.TestOrderedWordCount: DAG 2 completed.
FinalState=SUCCEEDED*
*16/02/15 16:27:51 INFO examples.TestOrderedWordCount: Shutting down
session*
*16/02/15 16:27:51 INFO client.TezClient: Shutting down Tez Session,
sessionName=OrderedWordCountSession,
applicationId=application_1452243782005_0323*


The method with problem seems added from hadoop 2.4.
*org.apache.hadoop.conf.Configuration.addDeprecations([Lorg/apache/hadoop/conf/Configuration$DeprecationDelta;)*


hadoop-common jars..
*$ find ./ -name hadoop-common*jar*
*./apps/tez-0.8.2/lib/hadoop-common-2.6.0.jar*
*./apps/hadoop-2.6.0/share/hadoop/common/hadoop-common-2.6.0.jar*
*./apps/hadoop-2.6.0/share/hadoop/common/hadoop-common-2.6.0-tests.jar*
*./apps/hadoop-2.6.0/share/hadoop/common/sources/hadoop-common-2.6.0-sources.jar*
*./apps/hadoop-2.6.0/share/hadoop/common/sources/hadoop-common-2.6.0-test-sources.jar*
*./apps/hadoop-2.6.0/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/hadoop-common-2.6.0.jar*
*./apps/hadoop-2.6.0/share/hadoop/kms/tomcat/webapps/kms/WEB-INF/lib/hadoop-common-2.6.0.jar*


When I print jinfo of hive cli process there is no other version of hadoop
jars...

*$ jinfo 52974*
*Attaching to process ID 52974, please wait...*
*Debugger attached successfully.*
*Server compiler detected.*
*JVM version is 25.60-b23*
*Java System Properties*


*java.runtime.name <http://java.runtime.name/> = Java(TM) SE Runtime
Environment*
*java.vm.version = 25.60-b23*
*sun.boot.library.path = /home1/apps/jdk1.8.0_60/jre/lib/amd64*
*hadoop.root.logger = INFO,console*
*java.vendor.url = http*
*//java.oracle.com/ <http://java.oracle.com/>*
*java.vm.vendor = Oracle Corporation*
*path.separator = *

*file.encoding.pkg = sun.io <http://sun.io/>*
*java.vm.name <http://java.vm.name/> = Java HotSpot(TM) 64-Bit Server VM*
*sun.os.patch.level = unknown*
*sun.java.launcher = SUN_STANDARD*
*user.country = KR*
*user.dir = /home1/apps/apache-hive-1.2.1-bin/bin*
*java.vm.specification.name <http://java.vm.specification.name/> = Java
Virtual Machine Specification*
*java.runtime.version = 1.8.0_60-b27*
*java.awt.graphicsenv = sun.awt.X11GraphicsEnvironment*
*os.arch = amd64*
*java.endorsed.dirs = /home1/apps/jdk1.8.0_60/jre/lib/endorsed*
*line.separator = *

*java.io.tmpdir = /tmp*
*hadoop.log.file = hadoop.log*
*java.vm.specification.vendor = Oracle Corporation*
*os.name <http://os.name/> = Linux*
*hadoop.id.str = irteam*
*sun.jnu.encoding = UTF-8*
*java.library.path = /home1/apps/hadoop/lib/native*
*hadoop.home.dir = /home1/apps/hadoop*
*java.specification.name <http://java.specification.name/> = Java Platform
API Specification*
*java.class.version = 52.0*
*java.net.preferIPv4Stack = true*
*sun.management.compiler = HotSpot 64-Bit Tiered Compilers*
*os.version = 2.6.32-504.3.3.el6.x86_64*
*user.home = /home1*
*user.timezone = ROK*
*java.awt.printerjob = sun.print.PSPrinterJob*
*file.encoding = UTF-8*
*java.specification.version = 1.8*
*user.name <http://user.name/> = irteam*
*java.class.path = /home1/apps/tez-0.8.2/conf*
*/home1/apps/tez-0.8.2/tez-history-parser-0.8.2.jar*
*/home1/apps/tez-0.8.2/tez-yarn-timeline-history-with-acls-0.8.2.jar*
*/home1/apps/tez-0.8.2/tez-examples-0.8.2.jar*
*/home1/apps/tez-0.8.2/tez-javadoc-tools-0.8.2.jar*
*/home1/apps/tez-0.8.2/tez-runtime-internals-0.8.2.jar*
*/home1/apps/tez-0.8.2/hadoop-shim-2.6-0.8.2.jar*
*/home1/apps/tez-0.8.2/tez-runtime-library-0.8.2.jar*
*/home1/apps/tez-0.8.2/tez-job-analyzer-0.8.2.jar*
*/home1/apps/tez-0.8.2/tez-yarn-timeline-h

RE: Add partition data to an external ORC table.

2016-02-11 Thread no jihun
actually original source is flume stream, avro formatted rows.
flume sink stream to hdfs's partition directory.

data flow.
flume  > avro > hdfs sink > daily partition dir.

my expected best flow
flume > orc > hdfs sink > partition dir

another option
flume > hdfs sink
then hive 'load data' command.
this let hive load text to hive with irc formatted.

because large amount of data should be processed hdfs sink distributes the
load.
if I use hive sink of flume, hive daemon may be a bottleneck, i think.

there seems many cases that change avro to orc.
if their previous data flow is based on flume + hdfs sink I am curious how
they did in detail.
2016. 2. 12. 오전 4:34에 "Ryan Harris" <ryan.har...@zionsbancorp.com>님이 작성:

> If your original source is text, why don't you make your ORC-based table a
> hive managed table instead of an external table.
>
> Then you can load/partition your text data into the external table, query
> from that and insert into your ORC-backed Hive managed table.
>
>
>
> Theoretically, if you had your data in ORC files, you could just copy them
> to the external table/partition like you do with the text data, but the
> challenge is, how are you going to create the ORC source data?  You can
> create it with Hive, Pig, custom Java, etc, but **somehow** you are going
> to have to get your data into ORC format.  Hive is probably the easiest
> tool to use to do that.  You could load the data into a hive managed table,
> and then copy the ORC files back to an external table, but why?
>
>
>
> *From:* no jihun [mailto:jees...@gmail.com]
> *Sent:* Thursday, February 11, 2016 11:48 AM
> *To:* user@hive.apache.org
> *Subject:* Add partition data to an external ORC table.
>
>
>
> hello.
>
> I wanna know this could be possible or not.
>
> There would be an table which created by
>
> create external table test (
> date_string String,
> message String)
> STORED AS ORC
> PARTIONED BY (date_string STRING)
> LOCATION '/message';
>
> with this table
> I will never add row by 'insert' statement
> but want to
> #1. add data of each day to hdfs's partition location directly.
>   e.g /message/20160212
>   ( by $ hadoop fs -put )
> #2. then i will add partition everyday morning.
> ALTER TABLE test
> ADD PARTITION (date_string=’20160212’)
> location '/message/20160212';
> #3. query for the added data.
>
> with this scenario what or how can I prepare the ORC formatted data in
> step#1? when stored format is textfile I just need to copy raw file to
> partition directory, but with orc table I dont think this possible so
> easily.
>
> raw application log is json formatted and each day may have 1M json rows.
>
> Actually I do this jobs on my cluster with textfile table not ORC. now I
> am trying to table format.
>
> Any advise would be great.
> thanks
> --
> THIS ELECTRONIC MESSAGE, INCLUDING ANY ACCOMPANYING DOCUMENTS, IS
> CONFIDENTIAL and may contain information that is privileged and exempt from
> disclosure under applicable law. If you are neither the intended recipient
> nor responsible for delivering the message to the intended recipient,
> please note that any dissemination, distribution, copying or the taking of
> any action in reliance upon the message is strictly prohibited. If you have
> received this communication in error, please notify the sender immediately.
> Thank you.
>


Add partition data to an external ORC table.

2016-02-11 Thread no jihun
hello.

I wanna know this could be possible or not.

There would be an table which created by

create external table test (
date_string String,
message String)
STORED AS ORC
PARTIONED BY (date_string STRING)
LOCATION '/message';

with this table
I will never add row by 'insert' statement
but want to
#1. add data of each day to hdfs's partition location directly.
  e.g /message/20160212
  ( by $ hadoop fs -put )
#2. then i will add partition everyday morning.
ALTER TABLE test
ADD PARTITION (date_string=’20160212’)
location '/message/20160212';
#3. query for the added data.

with this scenario what or how can I prepare the ORC formatted data in
step#1? when stored format is textfile I just need to copy raw file to
partition directory, but with orc table I dont think this possible so
easily.

raw application log is json formatted and each day may have 1M json rows.

Actually I do this jobs on my cluster with textfile table not ORC. now I am
trying to table format.

Any advise would be great.
thanks