[
https://issues.apache.org/jira/browse/HIVE-6332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13921373#comment-13921373
]
Sushanth Sowmyan commented on HIVE-6332:
----------------------------------------
Before I created a wiki page for this, I wanted to have the content
checked/reviewed. [~leftylev], [~ekoifman], could you please go through the
following and suggest edits/changes? Thanks!
==
HCatalog job properties:
========================
Storage directives:
-------------------
hcat.pig.storer.external.location : An override to specify where HCatStorer
will write to, defined from pig jobs, either directly by user, or by using
org.apache.hive.hcatalog.pig.HCatStorerWrapper. HCat will write to this
specified directory, rather than writing to the table/partition directory
specified/calculatable by the metadata. This will be used in lieu of the table
directory if this is a table-level write (unpartitioned table write) or in lieu
of the partition directory if this is a partition-level write. This parameter
is used only for non-dynamic-partitioning jobs which have multiple write
destinations.
hcat.dynamic.partitioning.custom.pattern : For dynamic partitioning jobs,
simply specifying a custom directory is not good enough, since it writes to
multiple destinations, and thus, instead of a directory specification, it
requires a pattern specification. That's where this parameter comes in. For
example, if one had a table that was partitioned by keys country and state,
with a root directory location of /apps/hive/warehouse/geo/ , then a dynamic
partition write into it that writes partitions (country=US,state=CA) &
(country=IN,state=KA) would create two directories:
/apps/hive/warehouse/geo/country=US/state=CA/ and
/apps/hive/warehouse/geo/country=IN/state=KA/ . If we wanted a different
patterned location, and specified
hcat.dynamic.partitioning.custom.patttern="/ext/geo/${country}-${state}", it
would create the following two partition dirs: /ext/geo/US-CA and
/ext/geo/IN-KA . Thus, it allows us to specify a custom dir location pattern
for all the writes, and will interpolate each variable it sees when attempting
to create a destination location for the partitions.
Cache behaviour directives:
---------------------------
HCatalog maintains a cache of HiveClients to talk to the metastore, managing a
cache of 1 metastore client per thread, defaulting to an expiry of 120 seconds.
For people that wish to modify the behaviour of this cache, a few parameters
are provided:
hcatalog.hive.client.cache.expiry.time : Allows users to override the expiry
time specified - this is an int, and specifies number of seconds. Default is
120.
hcatalog.hive.client.cache.disabled : Default is false, allows people to
disable the cache altogether if they wish to. This is useful in highly
multithreaded usecases.
Input Split Generation Behaviour:
---------------------------------
hcat.desired.partition.num.splits : This is a hint/guidance that can be
provided to HCatalog to pass on to underlying InputFormats, to produce a
"desired" number of splits per partition. This is useful when we have a few
large files and we want to increase parallelism by increasing the number of
splits generated. It is not yet so useful in cases where we would want to
reduce the number of splits for a large number of files. It is not at all
useful, also, in cases where there are a large number of partitions that this
job will read. Also note that this is merely an optimization hint, and it is
not guaranteed that the underlying layer will be capable of using this
optimization. Also, mapreduce parameters mapred.min.split.size and
mapred.max.split.size can be used in conjunction with this parameter to
tweak/optimize jobs.
Data Promotion Behaviour:
-------------------------
In some cases where a user of HCat (such as some older versions of pig) does
not support all the datatypes supported by hive, there are a few config
parameters provided to handle data promotions/conversions to allow them to read
data through HCatalog. On the write side, it is expected that the user pass in
valid HCatRecords with data correctly.
hcat.data.convert.boolean.to.integer : promotes boolean to int on read from
HCatalog, defaults to false.
hcat.data.tiny.small.int.promotion : promotes tinyint/smallint to int on read
from HCatalog, defaults to false.
HCatRecordReader Error Tolerance Behaviour:
-------------------------------------------
While reading, it is understandable that data might contain errors, but we may
not want to completely abort a task due to a couple of errors. These parameters
configure how many errors we can accept before we fail the task.
hcat.input.bad.record.threshold : A float parameter, defaults to 0.0001f, which
means we can deal with 1 error every 10,000 rows, and still not error out. Any
greater, and we will.
hcat.input.bad.record.min : An int parameter, defaults to 2, which is the
minimum number of bad records we encounter before applying
hcat.input.bad.record.threshold parameter, this is to prevent an initial/early
bad record from resulting in a task abort because the ratio of errors it got
was too high.
> HCatConstants Documentation needed
> ----------------------------------
>
> Key: HIVE-6332
> URL: https://issues.apache.org/jira/browse/HIVE-6332
> Project: Hive
> Issue Type: Task
> Reporter: Sushanth Sowmyan
> Assignee: Sushanth Sowmyan
>
> HCatConstants documentation is near non-existent, being defined only as
> comments in code for the various parameters. Given that a lot of api winds up
> being implemented as knobs that can be tweaked here, we should have a public
> facing doc for this.
--
This message was sent by Atlassian JIRA
(v6.2#6252)