Re: clustered bucket and tablesample

no jihun Sun, 15 May 2016 04:30:08 -0700

OK Talebzadeh thanks.

Have you ever tried tablesample with string value hash?
something like 'tablesample( bucket 1 out of 256 on
some_field='somedata....')'



I wrote a full scenario.

# table creating
Time taken: 0.155 seconds, Fetched: 36 row(s)
hive> CREATE TABLE `bucket_x` (
    >   `classifier` string)
    > CLUSTERED BY ( classifier)
    > INTO 256 BUCKETS
    > STORED AS ORC;
OK


*# check option before data insert;*
hive> set hive.enforce.bucketing;
hive.enforce.bucketing=true


*# insert 1,000 data*
(also number of distinct value is 1,000)

hive > insert into bucket_x
values 
('cl_0'),('cl_1'),('cl_2'),('cl_3'),('cl_4'),('cl_5'),('cl_6'),('cl_7'),('cl_8'),('cl_9'),('cl_10'),('cl_11'),('cl_12'),('cl_13'),('cl_14'),('cl_15'),('cl_16'),('cl_17'),('cl_18'),('cl_19'),('cl_20'),('cl_21'),('cl_22'),('cl_23'),('cl_24'),('cl_25'),('cl_26'),('cl_27'),('cl_28'),('cl_29'),('cl_30'),('cl_31'),('cl_32'),('cl_33'),('cl_34'),('cl_35'),('cl_36'),('cl_37'),('cl_38'),('cl_39'),('cl_40'),('cl_41'),('cl_42'),('cl_43'),('cl_44'),('cl_45'),('cl_46'),('cl_47'),('cl_48'),('cl_49'),('cl_50'),('cl_51'),('cl_52'),('cl_53'),('cl_54'),('cl_55'),('cl_56'),('cl_57'),('cl_58'),('cl_59'),('cl_60'),('cl_61'),('cl_62'),('cl_63'),('cl_64'),('cl_65'),('cl_66'),('cl_67'),('cl_68'),('cl_69'),('cl_70'),('cl_71'),('cl_72'),('cl_73'),('cl_74'),('cl_75'),('cl_76'),('cl_77'),('cl_78'),('cl_79'),('cl_80'),('cl_81'),('cl_82'),('cl_83'),('cl_84'),('cl_85'),('cl_86'),('cl_87'),('cl_88'),('cl_89'),('cl_90'),('cl_91'),('cl_92'),('cl_93'),('cl_94'),('cl_95'),('cl_96'),('cl_97'),('cl_98'),('cl_99'),('cl_100'),('cl_101'),('cl_102'),('cl_103'),('cl_104'),('cl_105'),('cl_106'),('cl_107'),('cl_108'),('cl_109'),('cl_110'),('cl_111'),('cl_112'),('cl_113'),('cl_114'),('cl_115'),('cl_116'),('cl_117'),('cl_118'),('cl_119'),('cl_120'),('cl_121'),('cl_122'),('cl_123'),('cl_124'),('cl_125'),('cl_126'),('cl_127'),('cl_128'),('cl_129'),('cl_130'),('cl_131'),('cl_132'),('cl_133'),('cl_134'),('cl_135'),('cl_136'),('cl_137'),('cl_138'),('cl_139'),('cl_140'),('cl_141'),('cl_142'),('cl_143'),('cl_144'),('cl_145'),('cl_146'),('cl_147'),('cl_148'),('cl_149'),('cl_150'),('cl_151'),('cl_152'),('cl_153'),('cl_154'),('cl_155'),('cl_156'),('cl_157'),('cl_158'),('cl_159'),('cl_160'),('cl_161'),('cl_162'),('cl_163'),('cl_164'),('cl_165'),('cl_166'),('cl_167'),('cl_168'),('cl_169'),('cl_170'),('cl_171'),('cl_172'),('cl_173'),('cl_174'),('cl_175'),('cl_176'),('cl_177'),('cl_178'),('cl_179'),('cl_180'),('cl_181'),('cl_182'),('cl_183'),('cl_184'),('cl_185'),('cl_186'),('cl_187'),('cl_188'),('cl_189'),('cl_190'),('cl_191'),('cl_192'),('cl_193'),('cl_194'),('cl_195'),('cl_196'),('cl_197'),('cl_198'),('cl_199'),('cl_200'),('cl_201'),('cl_202'),('cl_203'),('cl_204'),('cl_205'),('cl_206'),('cl_207'),('cl_208'),('cl_209'),('cl_210'),('cl_211'),('cl_212'),('cl_213'),('cl_214'),('cl_215'),('cl_216'),('cl_217'),('cl_218'),('cl_219'),('cl_220'),('cl_221'),('cl_222'),('cl_223'),('cl_224'),('cl_225'),('cl_226'),('cl_227'),('cl_228'),('cl_229'),('cl_230'),('cl_231'),('cl_232'),('cl_233'),('cl_234'),('cl_235'),('cl_236'),('cl_237'),('cl_238'),('cl_239'),('cl_240'),('cl_241'),('cl_242'),('cl_243'),('cl_244'),('cl_245'),('cl_246'),('cl_247'),('cl_248'),('cl_249'),('cl_250'),('cl_251'),('cl_252'),('cl_253'),('cl_254'),('cl_255'),('cl_256'),('cl_257'),('cl_258'),('cl_259'),('cl_260'),('cl_261'),('cl_262'),('cl_263'),('cl_264'),('cl_265'),('cl_266'),('cl_267'),('cl_268'),('cl_269'),('cl_270'),('cl_271'),('cl_272'),('cl_273'),('cl_274'),('cl_275'),('cl_276'),('cl_277'),('cl_278'),('cl_279'),('cl_280'),('cl_281'),('cl_282'),('cl_283'),('cl_284'),('cl_285'),('cl_286'),('cl_287'),('cl_288'),('cl_289'),('cl_290'),('cl_291'),('cl_292'),('cl_293'),('cl_294'),('cl_295'),('cl_296'),('cl_297'),('cl_298'),('cl_299'),('cl_300'),('cl_301'),('cl_302'),('cl_303'),('cl_304'),('cl_305'),('cl_306'),('cl_307'),('cl_308'),('cl_309'),('cl_310'),('cl_311'),('cl_312'),('cl_313'),('cl_314'),('cl_315'),('cl_316'),('cl_317'),('cl_318'),('cl_319'),('cl_320'),('cl_321'),('cl_322'),('cl_323'),('cl_324'),('cl_325'),('cl_326'),('cl_327'),('cl_328'),('cl_329'),('cl_330'),('cl_331'),('cl_332'),('cl_333'),('cl_334'),('cl_335'),('cl_336'),('cl_337'),('cl_338'),('cl_339'),('cl_340'),('cl_341'),('cl_342'),('cl_343'),('cl_344'),('cl_345'),('cl_346'),('cl_347'),('cl_348'),('cl_349'),('cl_350'),('cl_351'),('cl_352'),('cl_353'),('cl_354'),('cl_355'),('cl_356'),('cl_357'),('cl_358'),('cl_359'),('cl_360'),('cl_361'),('cl_362'),('cl_363'),('cl_364'),('cl_365'),('cl_366'),('cl_367'),('cl_368'),('cl_369'),('cl_370'),('cl_371'),('cl_372'),('cl_373'),('cl_374'),('cl_375'),('cl_376'),('cl_377'),('cl_378'),('cl_379'),('cl_380'),('cl_381'),('cl_382'),('cl_383'),('cl_384'),('cl_385'),('cl_386'),('cl_387'),('cl_388'),('cl_389'),('cl_390'),('cl_391'),('cl_392'),('cl_393'),('cl_394'),('cl_395'),('cl_396'),('cl_397'),('cl_398'),('cl_399'),('cl_400'),('cl_401'),('cl_402'),('cl_403'),('cl_404'),('cl_405'),('cl_406'),('cl_407'),('cl_408'),('cl_409'),('cl_410'),('cl_411'),('cl_412'),('cl_413'),('cl_414'),('cl_415'),('cl_416'),('cl_417'),('cl_418'),('cl_419'),('cl_420'),('cl_421'),('cl_422'),('cl_423'),('cl_424'),('cl_425'),('cl_426'),('cl_427'),('cl_428'),('cl_429'),('cl_430'),('cl_431'),('cl_432'),('cl_433'),('cl_434'),('cl_435'),('cl_436'),('cl_437'),('cl_438'),('cl_439'),('cl_440'),('cl_441'),('cl_442'),('cl_443'),('cl_444'),('cl_445'),('cl_446'),('cl_447'),('cl_448'),('cl_449'),('cl_450'),('cl_451'),('cl_452'),('cl_453'),('cl_454'),('cl_455'),('cl_456'),('cl_457'),('cl_458'),('cl_459'),('cl_460'),('cl_461'),('cl_462'),('cl_463'),('cl_464'),('cl_465'),('cl_466'),('cl_467'),('cl_468'),('cl_469'),('cl_470'),('cl_471'),('cl_472'),('cl_473'),('cl_474'),('cl_475'),('cl_476'),('cl_477'),('cl_478'),('cl_479'),('cl_480'),('cl_481'),('cl_482'),('cl_483'),('cl_484'),('cl_485'),('cl_486'),('cl_487'),('cl_488'),('cl_489'),('cl_490'),('cl_491'),('cl_492'),('cl_493'),('cl_494'),('cl_495'),('cl_496'),('cl_497'),('cl_498'),('cl_499'),('cl_500'),('cl_501'),('cl_502'),('cl_503'),('cl_504'),('cl_505'),('cl_506'),('cl_507'),('cl_508'),('cl_509'),('cl_510'),('cl_511'),('cl_512'),('cl_513'),('cl_514'),('cl_515'),('cl_516'),('cl_517'),('cl_518'),('cl_519'),('cl_520'),('cl_521'),('cl_522'),('cl_523'),('cl_524'),('cl_525'),('cl_526'),('cl_527'),('cl_528'),('cl_529'),('cl_530'),('cl_531'),('cl_532'),('cl_533'),('cl_534'),('cl_535'),('cl_536'),('cl_537'),('cl_538'),('cl_539'),('cl_540'),('cl_541'),('cl_542'),('cl_543'),('cl_544'),('cl_545'),('cl_546'),('cl_547'),('cl_548'),('cl_549'),('cl_550'),('cl_551'),('cl_552'),('cl_553'),('cl_554'),('cl_555'),('cl_556'),('cl_557'),('cl_558'),('cl_559'),('cl_560'),('cl_561'),('cl_562'),('cl_563'),('cl_564'),('cl_565'),('cl_566'),('cl_567'),('cl_568'),('cl_569'),('cl_570'),('cl_571'),('cl_572'),('cl_573'),('cl_574'),('cl_575'),('cl_576'),('cl_577'),('cl_578'),('cl_579'),('cl_580'),('cl_581'),('cl_582'),('cl_583'),('cl_584'),('cl_585'),('cl_586'),('cl_587'),('cl_588'),('cl_589'),('cl_590'),('cl_591'),('cl_592'),('cl_593'),('cl_594'),('cl_595'),('cl_596'),('cl_597'),('cl_598'),('cl_599'),('cl_600'),('cl_601'),('cl_602'),('cl_603'),('cl_604'),('cl_605'),('cl_606'),('cl_607'),('cl_608'),('cl_609'),('cl_610'),('cl_611'),('cl_612'),('cl_613'),('cl_614'),('cl_615'),('cl_616'),('cl_617'),('cl_618'),('cl_619'),('cl_620'),('cl_621'),('cl_622'),('cl_623'),('cl_624'),('cl_625'),('cl_626'),('cl_627'),('cl_628'),('cl_629'),('cl_630'),('cl_631'),('cl_632'),('cl_633'),('cl_634'),('cl_635'),('cl_636'),('cl_637'),('cl_638'),('cl_639'),('cl_640'),('cl_641'),('cl_642'),('cl_643'),('cl_644'),('cl_645'),('cl_646'),('cl_647'),('cl_648'),('cl_649'),('cl_650'),('cl_651'),('cl_652'),('cl_653'),('cl_654'),('cl_655'),('cl_656'),('cl_657'),('cl_658'),('cl_659'),('cl_660'),('cl_661'),('cl_662'),('cl_663'),('cl_664'),('cl_665'),('cl_666'),('cl_667'),('cl_668'),('cl_669'),('cl_670'),('cl_671'),('cl_672'),('cl_673'),('cl_674'),('cl_675'),('cl_676'),('cl_677'),('cl_678'),('cl_679'),('cl_680'),('cl_681'),('cl_682'),('cl_683'),('cl_684'),('cl_685'),('cl_686'),('cl_687'),('cl_688'),('cl_689'),('cl_690'),('cl_691'),('cl_692'),('cl_693'),('cl_694'),('cl_695'),('cl_696'),('cl_697'),('cl_698'),('cl_699'),('cl_700'),('cl_701'),('cl_702'),('cl_703'),('cl_704'),('cl_705'),('cl_706'),('cl_707'),('cl_708'),('cl_709'),('cl_710'),('cl_711'),('cl_712'),('cl_713'),('cl_714'),('cl_715'),('cl_716'),('cl_717'),('cl_718'),('cl_719'),('cl_720'),('cl_721'),('cl_722'),('cl_723'),('cl_724'),('cl_725'),('cl_726'),('cl_727'),('cl_728'),('cl_729'),('cl_730'),('cl_731'),('cl_732'),('cl_733'),('cl_734'),('cl_735'),('cl_736'),('cl_737'),('cl_738'),('cl_739'),('cl_740'),('cl_741'),('cl_742'),('cl_743'),('cl_744'),('cl_745'),('cl_746'),('cl_747'),('cl_748'),('cl_749'),('cl_750'),('cl_751'),('cl_752'),('cl_753'),('cl_754'),('cl_755'),('cl_756'),('cl_757'),('cl_758'),('cl_759'),('cl_760'),('cl_761'),('cl_762'),('cl_763'),('cl_764'),('cl_765'),('cl_766'),('cl_767'),('cl_768'),('cl_769'),('cl_770'),('cl_771'),('cl_772'),('cl_773'),('cl_774'),('cl_775'),('cl_776'),('cl_777'),('cl_778'),('cl_779'),('cl_780'),('cl_781'),('cl_782'),('cl_783'),('cl_784'),('cl_785'),('cl_786'),('cl_787'),('cl_788'),('cl_789'),('cl_790'),('cl_791'),('cl_792'),('cl_793'),('cl_794'),('cl_795'),('cl_796'),('cl_797'),('cl_798'),('cl_799'),('cl_800'),('cl_801'),('cl_802'),('cl_803'),('cl_804'),('cl_805'),('cl_806'),('cl_807'),('cl_808'),('cl_809'),('cl_810'),('cl_811'),('cl_812'),('cl_813'),('cl_814'),('cl_815'),('cl_816'),('cl_817'),('cl_818'),('cl_819'),('cl_820'),('cl_821'),('cl_822'),('cl_823'),('cl_824'),('cl_825'),('cl_826'),('cl_827'),('cl_828'),('cl_829'),('cl_830'),('cl_831'),('cl_832'),('cl_833'),('cl_834'),('cl_835'),('cl_836'),('cl_837'),('cl_838'),('cl_839'),('cl_840'),('cl_841'),('cl_842'),('cl_843'),('cl_844'),('cl_845'),('cl_846'),('cl_847'),('cl_848'),('cl_849'),('cl_850'),('cl_851'),('cl_852'),('cl_853'),('cl_854'),('cl_855'),('cl_856'),('cl_857'),('cl_858'),('cl_859'),('cl_860'),('cl_861'),('cl_862'),('cl_863'),('cl_864'),('cl_865'),('cl_866'),('cl_867'),('cl_868'),('cl_869'),('cl_870'),('cl_871'),('cl_872'),('cl_873'),('cl_874'),('cl_875'),('cl_876'),('cl_877'),('cl_878'),('cl_879'),('cl_880'),('cl_881'),('cl_882'),('cl_883'),('cl_884'),('cl_885'),('cl_886'),('cl_887'),('cl_888'),('cl_889'),('cl_890'),('cl_891'),('cl_892'),('cl_893'),('cl_894'),('cl_895'),('cl_896'),('cl_897'),('cl_898'),('cl_899'),('cl_900'),('cl_901'),('cl_902'),('cl_903'),('cl_904'),('cl_905'),('cl_906'),('cl_907'),('cl_908'),('cl_909'),('cl_910'),('cl_911'),('cl_912'),('cl_913'),('cl_914'),('cl_915'),('cl_916'),('cl_917'),('cl_918'),('cl_919'),('cl_920'),('cl_921'),('cl_922'),('cl_923'),('cl_924'),('cl_925'),('cl_926'),('cl_927'),('cl_928'),('cl_929'),('cl_930'),('cl_931'),('cl_932'),('cl_933'),('cl_934'),('cl_935'),('cl_936'),('cl_937'),('cl_938'),('cl_939'),('cl_940'),('cl_941'),('cl_942'),('cl_943'),('cl_944'),('cl_945'),('cl_946'),('cl_947'),('cl_948'),('cl_949'),('cl_950'),('cl_951'),('cl_952'),('cl_953'),('cl_954'),('cl_955'),('cl_956'),('cl_957'),('cl_958'),('cl_959'),('cl_960'),('cl_961'),('cl_962'),('cl_963'),('cl_964'),('cl_965'),('cl_966'),('cl_967'),('cl_968'),('cl_969'),('cl_970'),('cl_971'),('cl_972'),('cl_973'),('cl_974'),('cl_975'),('cl_976'),('cl_977'),('cl_978'),('cl_979'),('cl_980'),('cl_981'),('cl_982'),('cl_983'),('cl_984'),('cl_985'),('cl_986'),('cl_987'),('cl_988'),('cl_989'),('cl_990'),('cl_991'),('cl_992'),('cl_993'),('cl_994'),('cl_995'),('cl_996'),('cl_997'),('cl_998'),('cl_999')


# check for total count
*hive> select count(*) from bucket_x;*
OK
*1000 // <-- returned 1. correct.*

# check for count specific classifier (without tablesample)
*hive> select count(*) from bucket_x where classifier='cl_900';*
Query ID = irteam_20160515201754_d381aff8-16ef-48be-b829-f1a01a530521
Total jobs = 1
Launching Job 1 out of 1


Status: Running (Executing on YARN cluster with App id
application_1462971998082_0025)

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED
 KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED      2          2        0        0       1
    0
Reducer 2 ......   SUCCEEDED      1          1        0        0       0
    0
--------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 7.63 s

--------------------------------------------------------------------------------
OK
*1 // <-- returned 1. correct.*
Time taken: 8.064 seconds, Fetched: 1 row(s)


# count specific classifier (with tablesample)

*hive> select count(*) from bucket_x tablesample(bucket 1 out of 256 on
classifier='cl_900') where classifier='cl_900';*
Query ID = irteam_20160515201913_91166686-b98c-40a4-990b-690c41c69c61
Total jobs = 1
Launching Job 1 out of 1


Status: Running (Executing on YARN cluster with App id
application_1462971998082_0025)

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED
 KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED      1          1        0        0       0
    0
Reducer 2 ......   SUCCEEDED      1          1        0        0       0
    0
--------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 3.81 s

--------------------------------------------------------------------------------
OK
*0 // <--- it returns Zero. WRONG! it should return 1 row.*
Time taken: 4.216 seconds, Fetched: 1 row(s)

# so I checked whole tablesampled data.
*hive> select * from bucket_x tablesample(bucket 1 out of 256 on
classifier='cl_900');*
OK
cl_974
cl_336
cl_457
...
cl_852
cl_698
cl_731
Time taken: 0.053 seconds, Fetched: 999 row(s)
*// <-- it returned 999 ROWS *
*// ??? 999 ???*
*// exactly except what I want to search*.


I think I am doing totally wrong tablesample in case of string value.
Any idea?

2016-05-15 2:56 GMT+09:00 Mich Talebzadeh <[email protected]>:

> This is your code
>
> SELECT COUNT(*) FROM X
> TABLESAMPLE(BUCKET 1 OUT OF 256 ON  action_id, classifier)
> WHERE action_id='aaa' AND classifier='bbb'
>
> Well I have a table dummy with 1 billion rows imported from Oracle as ORC
> format
>
> hive> show create table dummy;
> OK
> CREATE TABLE `dummy`(
>   `id` int,
>   `clustered` int,
>   `scattered` int,
>   `randomised` int,
>   `random_string` varchar(50),
>   `small_vc` varchar(10),
>   `padding` varchar(10))
>
>
>
> *CLUSTERED BY (  id)INTO 256 BUCKETS*ROW FORMAT SERDE
>   'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
> STORED AS INPUTFORMAT
>   'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
> OUTPUTFORMAT
>   'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
> LOCATION
>   'hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/dummy'
> TBLPROPERTIES (
>   'COLUMN_STATS_ACCURATE'='{\"BASIC_STATS\":\"true\"}',
>   'numFiles'='1',
>   'numRows'='100000000',
>   'orc.bloom.filter.columns'='ID',
>   'orc.bloom.filter.fpp'='0.05',
>   'orc.compress'='SNAPPY',
>   'orc.create.index'='true',
>   'orc.row.index.stride'='10000',
>   'orc.stripe.size'='16777216',
>   'rawDataSize'='0',
>   'totalSize'='5662644579',
>   'transient_lastDdlTime'='1463245925')
>
>
> If I turn on the plan for the following two cases. First a simple case
>
> hive>
>
> *EXPLAIN SELECT COUNT(1) FROM dummy    >  where id = 20;*OK
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-1
>     Spark
>       Edges:
>         Reducer 2 <- Map 1 (GROUP, 1)
>       DagName: hduser_20160514181029_3deafc4c-30da-4113-a425-d8db8f63daeb:7
>       Vertices:
>         Map 1
>             Map Operator Tree:
>                 TableScan
>                   alias: dummy
>                   Statistics: Num rows: 100000000 Data size: 5662644736
> Basic stats: COMPLETE Column stats: NONE
>                   Filter Operator
>
> *                   predicate: (id = 20) (type: boolean)  *
> Statistics: Num rows: 50000000 Data size: 2831322368 Basic stats: COMPLETE
> Column stats: NONE
>                     Select Operator
>                       Statistics: Num rows: 50000000 Data size: 2831322368
> Basic stats: COMPLETE Column stats: NONE
>                       Group By Operator
>                         aggregations: count(1)
>                         mode: hash
>                         outputColumnNames: _col0
>                         Statistics: Num rows: 1 Data size: 8 Basic stats:
> COMPLETE Column stats: NONE
>                         Reduce Output Operator
>                           sort order:
>                           Statistics: Num rows: 1 Data size: 8 Basic
> stats: COMPLETE Column stats: NONE
>                           value expressions: _col0 (type: bigint)
>         Reducer 2
>             Reduce Operator Tree:
>               Group By Operator
>                 aggregations: count(VALUE._col0)
>                 mode: mergepartial
>                 outputColumnNames: _col0
>                 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE
> Column stats: NONE
>                 File Output Operator
>                   compressed: false
>                   Statistics: Num rows: 1 Data size: 8 Basic stats:
> COMPLETE Column stats: NONE
>                   table:
>                       input format:
> org.apache.hadoop.mapred.TextInputFormat
>                       output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                       serde:
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>   Stage: Stage-0
>     Fetch Operator
>       limit: -1
>       Processor Tree:
>         ListSink
> Time taken: 0.064 seconds, Fetched: 51 row(s)
>
>
> Now we try with tablesample but  assigning predicate values inside the
> bracket as below because you are looking in bucket 1 for those values and
> you want optimizer to know that.
>
>
>
> *hive>  EXPLAIN SELECT COUNT(1) FROM dummy    > TABLESAMPLE (BUCKET 1 OUT
> OF 256 ON ID = 10)*    > ;
> OK
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-1
>     Spark
>       Edges:
>         Reducer 2 <- Map 1 (GROUP, 1)
>       DagName:
> hduser_20160514181029_3deafc4c-30da-4113-a425-d8db8f63daeb:10
>       Vertices:
>         Map 1
>             Map Operator Tree:
>                 TableScan
>                   alias: dummy
>                   Statistics: Num rows: 100000000 Data size: 5662644736
> Basic stats: COMPLETE Column stats: NONE
>                   Filter Operator
>
> *                   predicate: (((hash((id = 10)) & 2147483647) % 256) =
> 0) (type: boolean)*                    Statistics: Num rows: 50000000
> Data size: 2831322368 Basic stats: COMPLETE Column stats: NONE
>                     Select Operator
>                       Statistics: Num rows: 50000000 Data size: 2831322368
> Basic stats: COMPLETE Column stats: NONE
>                       Group By Operator
>                         aggregations: count(1)
>                         mode: hash
>                         outputColumnNames: _col0
>                         Statistics: Num rows: 1 Data size: 8 Basic stats:
> COMPLETE Column stats: NONE
>                         Reduce Output Operator
>                           sort order:
>                           Statistics: Num rows: 1 Data size: 8 Basic
> stats: COMPLETE Column stats: NONE
>                           value expressions: _col0 (type: bigint)
>         Reducer 2
>             Reduce Operator Tree:
>               Group By Operator
>                 aggregations: count(VALUE._col0)
>                 mode: mergepartial
>                 outputColumnNames: _col0
>                 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE
> Column stats: NONE
>                 File Output Operator
>                   compressed: false
>                   Statistics: Num rows: 1 Data size: 8 Basic stats:
> COMPLETE Column stats: NONE
>                   table:
>                       input format:
> org.apache.hadoop.mapred.TextInputFormat
>                       output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                       serde:
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>   Stage: Stage-0
>     Fetch Operator
>       limit: -1
>       Processor Tree:
>         ListSink
>
>
> Otherwise I don't see much happening
>
> hive>  EXPLAIN SELECT COUNT(1) FROM dummy
>     > TABLESAMPLE (BUCKET 1 OUT OF 256 ON ID)
>     > WHERE ID = 10;
> OK
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-1
>     Spark
>       Edges:
>         Reducer 2 <- Map 1 (GROUP, 1)
>       DagName:
> hduser_20160514181029_3deafc4c-30da-4113-a425-d8db8f63daeb:13
>       Vertices:
>         Map 1
>             Map Operator Tree:
>                 TableScan
>                   alias: dummy
>                   Statistics: Num rows: 100000000 Data size: 5662644736
> Basic stats: COMPLETE Column stats: NONE
>                   Filter Operator
>                     predicate: (false and (id = 10)) (type: boolean)
>                     Statistics: Num rows: 25000000 Data size: 1415661184
> Basic stats: COMPLETE Column stats: NONE
>                     Select Operator
>                       Statistics: Num rows: 25000000 Data size: 1415661184
> Basic stats: COMPLETE Column stats: NONE
>                       Group By Operator
>                         aggregations: count(1)
>                         mode: hash
>                         outputColumnNames: _col0
>                         Statistics: Num rows: 1 Data size: 8 Basic stats:
> COMPLETE Column stats: NONE
>                         Reduce Output Operator
>                           sort order:
>                           Statistics: Num rows: 1 Data size: 8 Basic
> stats: COMPLETE Column stats: NONE
>                           value expressions: _col0 (type: bigint)
>         Reducer 2
>             Reduce Operator Tree:
>               Group By Operator
>                 aggregations: count(VALUE._col0)
>                 mode: mergepartial
>                 outputColumnNames: _col0
>                 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE
> Column stats: NONE
>                 File Output Operator
>                   compressed: false
>                   Statistics: Num rows: 1 Data size: 8 Basic stats:
> COMPLETE Column stats: NONE
>                   table:
>                       input format:
> org.apache.hadoop.mapred.TextInputFormat
>                       output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                       serde:
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>   Stage: Stage-0
>     Fetch Operator
>       limit: -1
>       Processor Tree:
>         ListSink
>
> In general in my experience bucketing in ORC is the only area where ORC
> tables come handy.
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 14 May 2016 at 13:38, no jihun <[email protected]> wrote:
>
>> ah, as i mentioned
>> both field type of action_id and classifier is STRING. and I can not
>> change the type.
>>
>> CREATE TABLE `X`(`action_id` string,`classifier` string)
>> CLUSTERED BY (action_id,classifier) INTO 256 BUCKETS
>> STORED AS ORC
>>
>> I use two fields for hash then bucketing because each one field is not so
>> well distributed.
>>
>> my concern is not about the strong hash source but about How can I
>> tablesample to the a bucket by field value what provided by 'where clause'
>>
>> when I clustered by string fields which one is right for tablesample?
>> 1. provide fields
>> TABLESAMPLE(BUCKET 1 OUT OF 256 ON  action_id, classifier)
>>
>> 2. provide values
>> TABLESAMPLE(BUCKET 1 OUT OF 256 ON  'aaa', 'bbb')
>> 2016. 5. 14. 오후 8:48에 "Mich Talebzadeh" <[email protected]>님이 작성:
>>
>> Is action_id can be created as a numeric column:
>>>
>>> CREATE TABLE X ( action_id bigint,  ..)
>>>
>>> Bucketing or hash partitioning best works on numeric columns with high
>>> cardinality (say a primary key).
>>>
>>> From my old notes:
>>>
>>> Bucketing in Hive refers to hash partitioning where a hashing function
>>> is applied. Likewise an RDBMS like Oracle, Hive will apply a linear hashing
>>> algorithm to prevent data from clustering within specific partitions.
>>> Hashing is very effective if the column selected for bucketing has very
>>> high selectivity like an ID column where selectivity (select
>>> count(distinct(column))/count(column) ) = 1.  In this case, the created
>>> partitions/ files will be as evenly sized as possible. In a nutshell
>>> bucketing is a method to get data evenly distributed over many
>>> partitions/files.  One should define the number of buckets by a power of
>>> two -- 2^n,  like 2, 4, 8, 16 etc to achieve best results. Again bucketing
>>> will help concurrency in Hive. It may even allow a partition wise join i.e.
>>> a join between two tables that are bucketed on the same column with the
>>> same number of buckets (anyone has tried this?)
>>>
>>>
>>>
>>> One more things. When one defines the number of buckets at table
>>> creation level in Hive, the number of partitions/files will be fixed. In
>>> contrast, with partitioning you do not have this limitation.
>>>
>>> can you do
>>>
>>> show create table X
>>>
>>> and send the output. please.
>>>
>>>
>>>
>>> Thanks
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 14 May 2016 at 12:23, no jihun <[email protected]> wrote:
>>>
>>>> Hello.
>>>>
>>>> I want to ask the correct bucketing and tablesample way.
>>>>
>>>> There is a table X which I created by
>>>>
>>>> CREATE TABLE `X`(`action_id` string,`classifier` string)
>>>> CLUSTERED BY (action_id,classifier) INTO 256 BUCKETS
>>>> STORED AS ORC
>>>>
>>>> Then I inserted 500M of rows into X by
>>>>
>>>> set hive.enforce.bucketing=true;
>>>> INSERT OVERWRITE INTO X SELECT * FROM X_RAW
>>>>
>>>> Then I want to count or search some rows with condition. roughly,
>>>>
>>>> SELECT COUNT(*) FROM X WHERE action_id='aaa' AND classifier='bbb'
>>>>
>>>> But I'd better to USE tablesample as I clustered X (action_id,
>>>> classifier). So, the better query will be
>>>>
>>>> SELECT COUNT(*) FROM X
>>>> TABLESAMPLE(BUCKET 1 OUT OF 256 ON  action_id, classifier)
>>>> WHERE action_id='aaa' AND classifier='bbb'
>>>>
>>>> Is there any wrong above? But I can't not find any performance gain
>>>> between these two query.
>>>>
>>>> query1 and RESULT( with no tablesample.)
>>>>
>>>> SELECT COUNT(*)) from X
>>>> WHERE action_id='aaa' and classifier='bbb'
>>>>
>>>> ------------------------------------------------------------
>>>> --------------------
>>>>         VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING
>>>> FAILED  KILLED
>>>> ------------------------------------------------------------
>>>> --------------------
>>>> Map 1 ..........   SUCCEEDED    256        256        0        0
>>>> 0       0
>>>> Reducer 2 ......   SUCCEEDED      1          1        0        0
>>>> 0       0
>>>> ------------------------------------------------------------
>>>> --------------------
>>>> VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME:
>>>> 15.35 s
>>>> ------------------------------------------------------------
>>>> --------------------
>>>> It scans full data.
>>>>
>>>> query 2 and RESULT
>>>>
>>>> SELECT COUNT(*)) from X
>>>> TABLESAMPLE(BUCKET 1 OUT OF 256 ON  action_id, classifier)
>>>> WHERE action_id='aaa' and classifier='bbb'
>>>>
>>>> ------------------------------------------------------------
>>>> --------------------
>>>>         VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING
>>>> FAILED  KILLED
>>>> ------------------------------------------------------------
>>>> --------------------
>>>> Map 1 ..........   SUCCEEDED    256        256        0        0
>>>> 0       0
>>>> Reducer 2 ......   SUCCEEDED      1          1        0        0
>>>> 0       0
>>>> ------------------------------------------------------------
>>>> --------------------
>>>> VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME:
>>>> 15.82     s
>>>> ------------------------------------------------------------
>>>> --------------------
>>>> It ALSO scans full data.
>>>>
>>>> query 2 RESULT WHAT I EXPECTED.
>>>>
>>>> Result what I expected is something like...
>>>> (use 1 map and relatively faster than without tabmesample)
>>>> ------------------------------------------------------------
>>>> --------------------
>>>>         VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING
>>>> FAILED  KILLED
>>>> ------------------------------------------------------------
>>>> --------------------
>>>> Map 1 ..........   SUCCEEDED      1          1        0        0
>>>> 0       0
>>>> Reducer 2 ......   SUCCEEDED      1          1        0        0
>>>> 0       0
>>>> ------------------------------------------------------------
>>>> --------------------
>>>> VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME:
>>>> 3.xx     s
>>>> ------------------------------------------------------------
>>>> --------------------
>>>>
>>>> Values of action_id and classifier are well distributed and there is no
>>>> skewed data.
>>>>
>>>> So I want to ask you what will be a correct query that prune and target
>>>> specific bucket by multiple column?
>>>>
>>>
>>>
>


-- 
----------------------------------------------
Jihun No ( 노지훈 )
----------------------------------------------
Twitter          : @nozisim
Facebook       : nozisim
Website         : http://jeesim2.godohosting.com
---------------------------------------------------------------------------------
Market Apps   : android market products.
<https://market.android.com/developer?pub=%EB%85%B8%EC%A7%80%ED%9B%88>

Re: clustered bucket and tablesample

Reply via email to