[
https://issues.apache.org/jira/browse/CASSANDRA-8019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14206856#comment-14206856
]
Joshua McKenzie edited comment on CASSANDRA-8019 at 11/11/14 7:17 PM:
--
v3 attached. Refcounting on SSTR from within SSTableScanner, updated
SSTableRewriterTest to try-with-resource CompactionControllers and Scanners.
Passes all unit tests on linux and dtest failures match CI environment, and
"Unable to delete" errors on windows unit tests on 2.1 branch are greatly
reduced. I still see some "Unable to delete" messages during runtime while
attempting to force compaction on a loaded system but those are also reduced
and I'll track them down in a separate effort.
I chose to go with refcounting rather than simply changing the ordering in
CompactionTask as we need some codification of the ordering relationship
between scanners and sstables in order to prevent this type of "error" in the
future.
The SSTableScanner relies on internal data structures within the SSTR and,
while the previous code will hold the reference open and prevent GC due to the
pointer it has internally as well as the ifile and dfile references, our
previous logical structure of there being no relationship between
SSTableScanners being open and SSTR deletion was misleading. While we
replicate some of the references in the scanner so the SSTR can technically be
deleted out of order and we rely on the filesystem to keep the file open if we
have a handle to it, a more clear relationship between the components is
preferable IMO.
[~jbellis]: I threw you on this as reviewer when I was leaning towards log
suppression route as it was a trivial effort; [~krummas]: would you be willing
to review this as you've been in the compaction and SSTableRewriter space
recently?
Edit: I should note: While this is a symptom that we see on Windows on the 2.1
branch specifically, this isn't so much a Windows issue as resource ordering
issue centered around the compaction process and SSTableScanners.
was (Author: joshuamckenzie):
v3 attached. Refcounting on SSTR from within SSTableScanner, updated
SSTableRewriterTest to try-with-resource CompactionControllers and Scanners.
Passes all unit tests on linux and dtest failures match CI environment, and
"Unable to delete" errors on windows unit tests on 2.1 branch are greatly
reduced. I still see some "Unable to delete" messages during runtime while
attempting to force compaction on a loaded system but those are also reduced
and I'll track them down in a separate effort.
I chose to go with refcounting rather than simply changing the ordering in
CompactionTask as we need some codification of the ordering relationship
between scanners and sstables in order to prevent this type of "error" in the
future.
The SSTableScanner relies on internal data structures within the SSTR and,
while the previous code will hold the reference open and prevent GC due to the
pointer it has internally as well as the ifile and dfile references, our
previous logical structure of there being no relationship between
SSTableScanners being open and SSTR deletion was misleading. While we
replicate some of the references in the scanner so the SSTR can technically be
deleted out of order and we rely on the filesystem to keep the file open if we
have a handle to it, a more clear relationship between the components is
preferable IMO.
[~jbellis]: I threw you on this as reviewer when I was leaning towards log
suppression route as it was a trivial effort; [~krummas]: would you be willing
to review this as you've been in the compaction and SSTableRewriter space
recently?
> Windows Unit tests and Dtests erroring due to sstable deleting task error
> -
>
> Key: CASSANDRA-8019
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8019
> Project: Cassandra
> Issue Type: Bug
> Environment: Windows 7
>Reporter: Philip Thompson
>Assignee: Joshua McKenzie
> Labels: windows
> Fix For: 2.1.3
>
> Attachments: 8019_aggressive_v1.txt, 8019_conservative_v1.txt,
> 8019_v2.txt, 8019_v3.txt
>
>
> Currently a large number of dtests and unit tests are erroring on windows
> with the following error in the node log:
> {code}
> ERROR [NonPeriodicTasks:1] 2014-09-29 11:05:04,383
> SSTableDeletingTask.java:89 - Unable to delete
> c:\\users\\username\\appdata\\local\\temp\\dtest-vr6qgw\\test\\node1\\data\\system\\local-7ad54392bcdd35a684174e047860b377\\system-local-ka-4-Data.db
> (it will be removed on server restart; we'll also retry after GC)\n
> {code}
> git bisect points to the following commit:
> {code}
> 0e831007760bffced8687f51b99525b650d7e193 is the first bad commit
> commit 0e83100776