[
https://issues.apache.org/jira/browse/PHOENIX-3218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15823287#comment-15823287
]
Josh Elser commented on PHOENIX-3218:
-------------------------------------
{noformat}
especially as it affects the underlying HBase row keys
{noformat}
(look-back) "Apache HBase" for the first reference, please.
{noformat}
If you do lots of random gets, make sure you use SSDs, or that your working set
fits into RAM (either OS cache or the HBase block cache), or performance will
be truly terrible
{noformat}
I feel like this is one of those statements that will end up causing a lot of
"fud" to undo. HBase is *still* performing (essentially) {{log(n)}} lookups to
find the key which is "fast". I feel like this would just promote the
impression that Phoenix doesn't work on well for random reads without SSDs or
high-memory systems which is wrong. Could you re-word this to be a
recommendation that random-read workloads benefit from fast disks/large block
caches over sequential read workloads?
{noformat}
* Use multiple indexes to provide fast access to common queries.
* Create global indexes. This will affect write speed depending on the number
of columns included in an index because each index writes to its own separate
table.
{noformat}
Switch the ordering here to make more sense. Point 1 is that indexes are good
and that you should use them. Point 2 is that sometime having multiple indexes
is a good thing.
{noformat}
* When specifying machines for HBase, do not skimp on cores; HBase needs
them.
{noformat}
How can this be made into a more concrete recommendation? Do you have any
recommendations to make WRT types of disk and amount of memory available?
{noformat}
* Create additional indexes to support common query patterns, including all
fields that need to be retrieved.
{noformat}
A bit duplicative of the above section. Perhaps reword this to be more focused
on ensuring indexes exist for columns that don't exist solely in the primary
key constraint but are heavily accessed? That isn't entirely correct either,
but maybe closer..
{noformat}
if a region server goes down
{noformat}
Recommend: s/goes down/fails/
{noformat}
Set the `UPDATE_CACHE_FREQUENCY`
[option](http://phoenix.apache.org/language/index.html#options) to 15 minutes
or so if your metadata doesn't change very often
{noformat}
Don't guess, make a concrete recommendation. If 15minutes isn't a good
recommendation, let's come up with a good number. Should "metadata" be "table
schema"? Does it also include table properties (such as immutable_rows)?
{noformat}
On AWS, you'll need to manually start the job
{noformat}
Why the mention of "On AWS"? This is the same for an on-prem cluster with async
indexes, no?
{noformat}
facilitates skip-scanning
{noformat}
Link to the docs on skip-scans.
{noformat}
For example, if you need indexes to stay in sync with data tables even if
machines go down and writes fail, then you should consider your data
transactional
{noformat}
This is misleading as Phoenix maintains referential integrity in the face of RS
failure. A better use-case for transactions would be for cross-row updates to a
data-table.
{noformat}
* Schema Design
* Indexes
* Explain Plans and Hints
* Queries
{noformat}
Links to the documentation?
{noformat}
Each row has a key, a byte-array by which rows in HBase are sorted to make
queries faster. All table accesses are via the row key (the table's primary key)
{noformat}
It would be better to be very explicit in the use of terminology to avoid
confusion. e.g. "An HBase row is a collection of many key-value pairs in which
the rowkey attribute of the keys are equal. Data in an HBase table is sorted by
the rowkey." Also s/row key/rowkey/.
{noformat}
If some columns are accessed more frequently than others, use column families
to separate the frequently-accessed columns from rarely-accessed columns. This
improves performance because HBase reads only the column families specified in
the query.
{noformat}
Link to the docs on how to do this.
{noformat}
stores a copy of some or all of the data in the main table
{noformat}
Stores a pivoted copy
{noformat}
See also:
https://phoenix.apache.org/secondary_indexing.html
{noformat}
Linkify
{noformat}
don't require you to change your queries at all—they just make them run faster
{noformat}
Suggest: "don't require change to existing queries -- queries simply run faster"
{noformat}
The sweet spot is generally a handful of secondary indexes
{noformat}
Suggest, remove the colloquialism. Doesn't match the rest of the tone of the
document.
{noformat}
Depending on your needs, consider creating *covered* indexes or *functional*
indexes, or both.
{noformat}
Link to docs on covered and functional indexes, please
{noformat}
If you regularly scan large data sets from spinning disk, you're best off with
GZIP (but watch write speed)
{noformat}
Numbers/reference-material to back this up?
{noformat}
For Gets it is quite important to have your data set cached, and you should use
the HBase block cache.
{noformat}
Flip-flopping between "scans"/"gets" and "range queries"/"point lookups". I
think using the latter terminology universally would be better.
{noformat}
When using `UPSERT` to write a large number of records, turn off autocommit and
batch records. Start with a batch size of 1000 and adjust as needed. Here's
some pseudocode showing one way to commit records in batches:
{noformat}
Recommend putting a caveat here that the use of {{commit()}} by Phoenix to
control batches of data written to HBase as being "non-standard" in terms of
JDBC. The {{executeBatch()}} APIs calls would be the standard way to batch
updates to the database for other JDBC drivers. Would recommend that we at
least acknowledge that Phoenix is doing it "differently".
{noformat}
Otherwise, replication triples the cost of each write
{noformat}
This is inaccurate, misleading at best. Whether the data is written to a
secondary index or a local index, the underlying data is *still* stored at the
configured HDFS replication rate (3x by default). The performance gain is that
the RegionServer is not updating another Region for the data-table update.
{noformat}
When deleting a large data set, turn on autoCommit before issuing the `DELETE`
query so that the client does not need to remember the row keys of all the keys
as they are deleted.
{noformat}
Reasoning behind this one isn't clear to me. Batching DELETEs would have the
same benefit of batching UPSERTs, no? (I may just be missing an implementation
detail here..)
The explain section is *fantastic*. Great job there. Overall, this is a very
nice write-up you've put together, [~pconrad]! I think a little bit of
tweaking, and this will be an often-reference document.
> First draft of Phoenix Tuning Guide
> -----------------------------------
>
> Key: PHOENIX-3218
> URL: https://issues.apache.org/jira/browse/PHOENIX-3218
> Project: Phoenix
> Issue Type: Improvement
> Reporter: Peter Conrad
> Attachments: Phoenix-Tuning-Guide-20170110.md,
> Phoenix-Tuning-Guide.md, Phoenix-Tuning-Guide.md
>
>
> Here's a first draft of a Tuning Guide for Phoenix performance.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)