[ 
https://issues.apache.org/jira/browse/CASSANDRA-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12970249#action_12970249
 ] 

Peter Schuller commented on CASSANDRA-1470:
-------------------------------------------

(1) - great

(2) - i'm pretty sure it will get instantly evicted.  See 
http://lxr.free-electrons.com/source/mm/fadvise.c#L118 and 
http://lxr.free-electrons.com/source/mm/truncate.c#L309 (however I agree that 
with the mythical "good enough" implementation the hint would really just be 
that - a hint - but that can easily backfire; sometimes you want instant 
eviction; in reality I think that posix_fadvise() is too limited an interface 
and while you can imagine an implementation that does something correctly for a 
particular use-case, it's too limited to be generally suitable for everyone...).

On posix_fadvise: Yes, I was only thinking of scattered pages as a problem. 
Contiguous ranges are fine and what one wants for fadvise purposes.

On overcommitting: Certainly mincore+advise with fallback to overcommit would 
be an improvement still, but my gut feeling is that lots of real-life cases 
will definitely have very scattered hotness. Pretty much any use-case where row 
keys are spread randomly with respect to hotness (which I believe is very often 
the case), and each row is pretty small.

I'm trying to think when one would expect it not to be pretty scattered. I 
suppose if using OPP and the row keys correspond directly to something which is 
correlated with hotness? So I guess something like time series data with OPP, 
or with RP and large rows. But it feels like a pretty narrow subset of use 
cases.

It is worth noting that for truly large data sets scattering is fine since the 
cost of fadvise() per page read is still low since the contiguous ranges to 
drop will be fairly large. But "unfortunately" a lot of use cases, I assume, 
are with data that is either similar to memory size or a few factors of memory 
size (significantly smaller than memory is a non-issue since it's all in memory 
anyway with the current code).

(As an aside, and this is not a serious suggestion since Cassandra isn't in the 
business of delivering kernel patches, but the implementation seems to iterate 
over individual pages anyway. So it seems that the only thing preventing a more 
efficient fadvise() for discontiguous ranges is the interface to the kernel, 
rather than an implementation problem. At least based on a very brief look...)

> use direct io for compaction
> ----------------------------
>
>                 Key: CASSANDRA-1470
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1470
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Pavel Yaskevich
>             Fix For: 0.7.1
>
>         Attachments: 1470-v2.txt, 1470.txt, CASSANDRA-1470-for-0.6.patch, 
> CASSANDRA-1470-v10-for-0.7.patch, CASSANDRA-1470-v11-for-0.7.patch, 
> CASSANDRA-1470-v12-0.7.patch, CASSANDRA-1470-v2.patch, 
> CASSANDRA-1470-v3-0.7-with-LastErrorException-support.patch, 
> CASSANDRA-1470-v4-for-0.7.patch, CASSANDRA-1470-v5-for-0.7.patch, 
> CASSANDRA-1470-v6-for-0.7.patch, CASSANDRA-1470-v7-for-0.7.patch, 
> CASSANDRA-1470-v8-for-0.7.patch, CASSANDRA-1470-v9-for-0.7.patch, 
> CASSANDRA-1470.patch, 
> use.DirectIORandomAccessFile.for.commitlog.against.1022235.patch
>
>
> When compaction scans through a group of sstables, it forces the data in the 
> os buffer cache being used for hot reads, which can have a dramatic negative 
> effect on performance.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to