[
https://issues.apache.org/jira/browse/IMPALA-7540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16757633#comment-16757633
]
ASF subversion and git services commented on IMPALA-7540:
-
Commit f20a03a7b1bc2a9bb6cd8b54b8afb9ce384538f1 in impala's branch
refs/heads/master from Todd Lipcon
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=f20a03a ]
IMPALA-7540. Intern most repetitive strings and network addresses in catalog
This adds interning to a bunch of repeated strings in catalog objects,
including:
- table name
- DB name
- owner
- column names
- input/output formats
- parameter keys
- common parameter values ("true", "false", etc)
- HBase column family names
Additionally, it interns TNetworkAddresses, so that each datanode host
is only stored once rather than having its own copy in each table.
I verified this patch using jxray on the development catalogd and
impalad. The following lines are removed entirely from the "duplicate
strings" report:
Overhead # char[]s # objects Value
164K (0.3%) 2,635 2,635 "127.0.0.1"
97K (0.2%) 1,038 1,038 "__HIVE_DEFAULT_PARTITION__"
95K (0.2%) 1,111 1,111 "transient_lastDdlTime"
92K (0.1%) 1,975 1,975 "d"
70K (0.1%) 997 997"EXTERNAL_TABLE"
56K (< 0.1%)1,201 1,201 "todd"
54K (< 0.1%)998 998"EXTERNAL"
46K (< 0.1%)998 998"TRUE"
44K (< 0.1%)567 567"numFilesErasureCoded"
38K (< 0.1%)612 612"totalSize"
30K (< 0.1%)567 567"numFiles"
The following are reduced substantially:
Before: 72K (0.1%) 1,543 1,543 "1"
After: 47K (< 0.1%)1,009 1,009 "1"
A few large strings remain in the report that may be worth addressing, depending
on whether we think production catalogs exhibit the same repetitions:
1) Avro schemas, eg:
204K (0.3%) 3 3 "{"fields": [{"type": ["boolean", "null"],
"name": "bool_col1"}, {"type": ["int", "null"], "name": "tinyint_col1"},
{"type": ...[length 52429]"
(in the development catalog there are multiple tables with the same Avro
schema)
2) Partition location suffixes, eg:
144K (0.2%) 1,234 1,234 "many_blocks_num_blocks_per_partition_1"
17K (< 0.1%)230 230"year=2009/month=2"
17K (< 0.1%)230 230"year=2009/month=3"
17K (< 0.1%)230 230"year=2009/month=1"
(in the development catalog lots of tables have the same partitioning
layout)
3) Unsure (jxray isn't reporting the reference chain, but seems likely
to be partition values):
49K (< 0.1%)1,058 1,058 "2010"
28K (< 0.1%)612 612"2009"
27K (< 0.1%)585 585"0"
22K (< 0.1%)71 899""
Change-Id: Ib3121aefa4391bcb1477d9dba0a49440d7000d26
Reviewed-on: http://gerrit.cloudera.org:8080/11158
Reviewed-by: Impala Public Jenkins
Tested-by: Impala Public Jenkins
> Intern common strings in catalog
>
>
> Key: IMPALA-7540
> URL: https://issues.apache.org/jira/browse/IMPALA-7540
> Project: IMPALA
> Issue Type: Bug
>Affects Versions: Impala 3.1.0
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Major
>
> Using jxray shows that there are many common duplicate strings in the
> catalog. For example, each table repeats the database name, and metadata like
> the HMS parameter maps reuse a lot of common strings like "EXTERNAL" or
> "transient_lastDdlTime". We should intern these to save memory.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org