[jira] [Commented] (HADOOP-11708) CryptoOutputStream synchronization differences from DFSOutputStream break HBase

2015-06-30 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14608009#comment-14608009
 ] 

Steve Loughran commented on HADOOP-11708:
-

..never did the output stream one; things were taking so long to get the core 
FS API and input stream in that I left it alone. Nominally the java.io API 
should define it, but here's one of those examples where an HDFS implementation 
detail has (unintentionally?) changed behaviour.

As well as concurrency, there's the issue of  {{Syncable}}  & "what does 
flush() do?", especially in the context of object stores

If someone were to do it, it'd round things out, especially with extra tests. I 
promise I will review it.

> CryptoOutputStream synchronization differences from DFSOutputStream break 
> HBase
> ---
>
> Key: HADOOP-11708
> URL: https://issues.apache.org/jira/browse/HADOOP-11708
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.6.0
>Reporter: Sean Busbey
>Assignee: Sean Busbey
>Priority: Critical
>
> For the write-ahead-log, HBase writes to DFS from a single thread and sends 
> sync/flush/hflush from a configurable number of other threads (default 5).
> FSDataOutputStream does not document anything about being thread safe, and it 
> is not thread safe for concurrent writes.
> However, DFSOutputStream is thread safe for concurrent writes + syncs. When 
> it is the stream FSDataOutputStream wraps, the combination is threadsafe for 
> 1 writer and multiple syncs (the exact behavior HBase relies on).
> When HDFS Transparent Encryption is turned on, CryptoOutputStream is inserted 
> between FSDataOutputStream and DFSOutputStream. It is proactively labeled as 
> not thread safe, and this composition is not thread safe for any operations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-11708) CryptoOutputStream synchronization differences from DFSOutputStream break HBase

2015-06-29 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14606152#comment-14606152
 ] 

Colin Patrick McCabe commented on HADOOP-11708:
---

Thanks, [~busbey].  I see that we have a file 
{{hadoop-common-project/hadoop-common/src/site/markdown/filesystem/fsdatainputstream.md}}
 that discusses the concurrency guarantees of Hadoop input streams now.  
[~steve_l], do we have one for output streams as well?  Maybe I missed it?  If 
not, we should create something like that.

> CryptoOutputStream synchronization differences from DFSOutputStream break 
> HBase
> ---
>
> Key: HADOOP-11708
> URL: https://issues.apache.org/jira/browse/HADOOP-11708
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.6.0
>Reporter: Sean Busbey
>Assignee: Sean Busbey
>Priority: Critical
>
> For the write-ahead-log, HBase writes to DFS from a single thread and sends 
> sync/flush/hflush from a configurable number of other threads (default 5).
> FSDataOutputStream does not document anything about being thread safe, and it 
> is not thread safe for concurrent writes.
> However, DFSOutputStream is thread safe for concurrent writes + syncs. When 
> it is the stream FSDataOutputStream wraps, the combination is threadsafe for 
> 1 writer and multiple syncs (the exact behavior HBase relies on).
> When HDFS Transparent Encryption is turned on, CryptoOutputStream is inserted 
> between FSDataOutputStream and DFSOutputStream. It is proactively labeled as 
> not thread safe, and this composition is not thread safe for any operations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-11708) CryptoOutputStream synchronization differences from DFSOutputStream break HBase

2015-06-26 Thread Sean Busbey (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14603374#comment-14603374
 ] 

Sean Busbey commented on HADOOP-11708:
--

This was the ticket that was left open for the thoughtful reworking of 
synchronization handling and documentation of thread safety promises for the 
input and output streams.

If Hadoop is no longer interested in making those updates or if you'd prefer to 
have them handled in a different jira, then there's nothing left to do on this 
ticket. But that's why HADOOP-11710 was originally made a subtask rather than 
just done here.

> CryptoOutputStream synchronization differences from DFSOutputStream break 
> HBase
> ---
>
> Key: HADOOP-11708
> URL: https://issues.apache.org/jira/browse/HADOOP-11708
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.6.0
>Reporter: Sean Busbey
>Assignee: Sean Busbey
>Priority: Critical
>
> For the write-ahead-log, HBase writes to DFS from a single thread and sends 
> sync/flush/hflush from a configurable number of other threads (default 5).
> FSDataOutputStream does not document anything about being thread safe, and it 
> is not thread safe for concurrent writes.
> However, DFSOutputStream is thread safe for concurrent writes + syncs. When 
> it is the stream FSDataOutputStream wraps, the combination is threadsafe for 
> 1 writer and multiple syncs (the exact behavior HBase relies on).
> When HDFS Transparent Encryption is turned on, CryptoOutputStream is inserted 
> between FSDataOutputStream and DFSOutputStream. It is proactively labeled as 
> not thread safe, and this composition is not thread safe for any operations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-11708) CryptoOutputStream synchronization differences from DFSOutputStream break HBase

2015-06-26 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14603260#comment-14603260
 ] 

Colin Patrick McCabe commented on HADOOP-11708:
---

Is there anything left to do here, now that HADOOP-11710 has been committed to 
2.7?

> CryptoOutputStream synchronization differences from DFSOutputStream break 
> HBase
> ---
>
> Key: HADOOP-11708
> URL: https://issues.apache.org/jira/browse/HADOOP-11708
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.6.0
>Reporter: Sean Busbey
>Assignee: Sean Busbey
>Priority: Critical
>
> For the write-ahead-log, HBase writes to DFS from a single thread and sends 
> sync/flush/hflush from a configurable number of other threads (default 5).
> FSDataOutputStream does not document anything about being thread safe, and it 
> is not thread safe for concurrent writes.
> However, DFSOutputStream is thread safe for concurrent writes + syncs. When 
> it is the stream FSDataOutputStream wraps, the combination is threadsafe for 
> 1 writer and multiple syncs (the exact behavior HBase relies on).
> When HDFS Transparent Encryption is turned on, CryptoOutputStream is inserted 
> between FSDataOutputStream and DFSOutputStream. It is proactively labeled as 
> not thread safe, and this composition is not thread safe for any operations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-11708) CryptoOutputStream synchronization differences from DFSOutputStream break HBase

2015-03-12 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359718#comment-14359718
 ] 

Yi Liu commented on HADOOP-11708:
-

I also +1 for changing CryptoOutputStream to behave the same as HDFS.
We could not make DFSOutputStream or CryptOutputStream *synchronized* for all 
methods, that would affect performance, in most cases, applications should 
handle the synchronization, so it's enough we keep the same behave as HDFS.

Sorry that I could not get time working on HDFS-7911 in the past two days for 
personal reason. Since [~busbey] has a patch in HADOOP-11710, I would mark 
HDFS-7911 as duplicated.

> CryptoOutputStream synchronization differences from DFSOutputStream break 
> HBase
> ---
>
> Key: HADOOP-11708
> URL: https://issues.apache.org/jira/browse/HADOOP-11708
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.6.0
>Reporter: Sean Busbey
>Assignee: Sean Busbey
>Priority: Critical
>
> For the write-ahead-log, HBase writes to DFS from a single thread and sends 
> sync/flush/hflush from a configurable number of other threads (default 5).
> FSDataOutputStream does not document anything about being thread safe, and it 
> is not thread safe for concurrent writes.
> However, DFSOutputStream is thread safe for concurrent writes + syncs. When 
> it is the stream FSDataOutputStream wraps, the combination is threadsafe for 
> 1 writer and multiple syncs (the exact behavior HBase relies on).
> When HDFS Transparent Encryption is turned on, CryptoOutputStream is inserted 
> between FSDataOutputStream and DFSOutputStream. It is proactively labeled as 
> not thread safe, and this composition is not thread safe for any operations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-11708) CryptoOutputStream synchronization differences from DFSOutputStream break HBase

2015-03-12 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359468#comment-14359468
 ] 

Colin Patrick McCabe commented on HADOOP-11708:
---

Just a heads up, I am +1 on the patch in HADOOP-11710 to make 
CryptoOutputstream behave like DFSOuptutStream, and would like to commit it for 
2.7.

We can continue the discussion about specs and possible other fixes in other 
subtasks

> CryptoOutputStream synchronization differences from DFSOutputStream break 
> HBase
> ---
>
> Key: HADOOP-11708
> URL: https://issues.apache.org/jira/browse/HADOOP-11708
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.6.0
>Reporter: Sean Busbey
>Assignee: Sean Busbey
>Priority: Critical
>
> For the write-ahead-log, HBase writes to DFS from a single thread and sends 
> sync/flush/hflush from a configurable number of other threads (default 5).
> FSDataOutputStream does not document anything about being thread safe, and it 
> is not thread safe for concurrent writes.
> However, DFSOutputStream is thread safe for concurrent writes + syncs. When 
> it is the stream FSDataOutputStream wraps, the combination is threadsafe for 
> 1 writer and multiple syncs (the exact behavior HBase relies on).
> When HDFS Transparent Encryption is turned on, CryptoOutputStream is inserted 
> between FSDataOutputStream and DFSOutputStream. It is proactively labeled as 
> not thread safe, and this composition is not thread safe for any operations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-11708) CryptoOutputStream synchronization differences from DFSOutputStream break HBase

2015-03-12 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359257#comment-14359257
 ] 

Steve Loughran commented on HADOOP-11708:
-

bq. FWIW, I just picked the first unreleased versions on the jira. 

OK, setting 2.8 as the target.

bq. It's chasing one undocumented and likely broken implementation with another 
one.

"Broken" is an opinion I'm not sure I agree with

# The behaviour is certainly not documented or explicitly specified in the [FS 
compatibility 
spec|https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/filesystem/introduction.html]
# it is a stronger concurrency/consistency model than presented by 
{{OutputStream}}, so {{DFSOutputStream}} can be used wherever an 
{{OutputStream}} is needed
# it's clear that this behaviour is expected in at least one application 

In  
[FileSystem|https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/filesystem/filesystem.html]
 , {{listStatus(), mkdirs()}} we do explicitly call out the 
atomicity/concurrency expectations *as defined by HDFS*. Some of those are not 
the result of deliberate decisions —the fact that mkdirs() is atomic is due to 
the NN grabbing a lock for optimised directory path creation— but they are 
behaviours that we have to accept as defacto standards as defined by 
applications-running-above-HDFS. All we can do is document them for the benefit 
of other filesystems seeking Hadoop HDFS compatibility, and try not to change 
them in HDFS such that applications break. Having that documentation to call 
out concurrency semantics on output streams is the way to do this. Given that 
the HDFS encryption is intended to be transparent, it's going to have to have a 
consistent concurrency & consistency model. 



> CryptoOutputStream synchronization differences from DFSOutputStream break 
> HBase
> ---
>
> Key: HADOOP-11708
> URL: https://issues.apache.org/jira/browse/HADOOP-11708
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.6.0
>Reporter: Sean Busbey
>Assignee: Sean Busbey
>Priority: Critical
>
> For the write-ahead-log, HBase writes to DFS from a single thread and sends 
> sync/flush/hflush from a configurable number of other threads (default 5).
> FSDataOutputStream does not document anything about being thread safe, and it 
> is not thread safe for concurrent writes.
> However, DFSOutputStream is thread safe for concurrent writes + syncs. When 
> it is the stream FSDataOutputStream wraps, the combination is threadsafe for 
> 1 writer and multiple syncs (the exact behavior HBase relies on).
> When HDFS Transparent Encryption is turned on, CryptoOutputStream is inserted 
> between FSDataOutputStream and DFSOutputStream. It is proactively labeled as 
> not thread safe, and this composition is not thread safe for any operations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-11708) CryptoOutputStream synchronization differences from DFSOutputStream break HBase

2015-03-12 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359106#comment-14359106
 ] 

Colin Patrick McCabe commented on HADOOP-11708:
---

I agree with [~stev...@iseran.com] here... +1 for changing 
{{CryptoOutputStream}} to behave the same as HDFS.  It should be a pretty small 
patch.

bq. Sean wrote: ...we could remove ~10 synchronization blocks in DFSOS (some of 
them are unneeded and just about all of them are questionable, and I can't find 
a rationalization for them).  As a follow-on, we add a FSDataOutputStream that 
isn't threadsafe and says as much. We can do this compatibly by either making 
it an option (in FSDataOutputStream construction or in configs), by making it a 
new API, or making it a documented breaking change.

I agree there is a lot to clean up here.  Let's talk about this in a separate 
JIRA.  We have a bunch of options here and I think the discussion will take a 
while.

> CryptoOutputStream synchronization differences from DFSOutputStream break 
> HBase
> ---
>
> Key: HADOOP-11708
> URL: https://issues.apache.org/jira/browse/HADOOP-11708
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.6.0
>Reporter: Sean Busbey
>Assignee: Sean Busbey
>Priority: Critical
>
> For the write-ahead-log, HBase writes to DFS from a single thread and sends 
> sync/flush/hflush from a configurable number of other threads (default 5).
> FSDataOutputStream does not document anything about being thread safe, and it 
> is not thread safe for concurrent writes.
> However, DFSOutputStream is thread safe for concurrent writes + syncs. When 
> it is the stream FSDataOutputStream wraps, the combination is threadsafe for 
> 1 writer and multiple syncs (the exact behavior HBase relies on).
> When HDFS Transparent Encryption is turned on, CryptoOutputStream is inserted 
> between FSDataOutputStream and DFSOutputStream. It is proactively labeled as 
> not thread safe, and this composition is not thread safe for any operations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-11708) CryptoOutputStream synchronization differences from DFSOutputStream break HBase

2015-03-12 Thread Sean Busbey (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359078#comment-14359078
 ] 

Sean Busbey commented on HADOOP-11708:
--

As a note, we can do the FSDataOutputStream fix by synchronizing on the wrapped 
stream without removing the synch that's in DFSOutputStream. That'll mean they 
both will rely on the same monitor, which is the lowest overhead we can get for 
a general solution (and the jit has the potential to eventually combine the 
monitor increment/decrement)

> CryptoOutputStream synchronization differences from DFSOutputStream break 
> HBase
> ---
>
> Key: HADOOP-11708
> URL: https://issues.apache.org/jira/browse/HADOOP-11708
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.6.0
>Reporter: Sean Busbey
>Assignee: Sean Busbey
>Priority: Critical
>
> For the write-ahead-log, HBase writes to DFS from a single thread and sends 
> sync/flush/hflush from a configurable number of other threads (default 5).
> FSDataOutputStream does not document anything about being thread safe, and it 
> is not thread safe for concurrent writes.
> However, DFSOutputStream is thread safe for concurrent writes + syncs. When 
> it is the stream FSDataOutputStream wraps, the combination is threadsafe for 
> 1 writer and multiple syncs (the exact behavior HBase relies on).
> When HDFS Transparent Encryption is turned on, CryptoOutputStream is inserted 
> between FSDataOutputStream and DFSOutputStream. It is proactively labeled as 
> not thread safe, and this composition is not thread safe for any operations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-11708) CryptoOutputStream synchronization differences from DFSOutputStream break HBase

2015-03-12 Thread Sean Busbey (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359060#comment-14359060
 ] 

Sean Busbey commented on HADOOP-11708:
--

{quote}
Short term? I'd say note the concurrency expectations of HBase and an output 
stream & come up with a change for the CryptoOutputStream which implements 
consistent concurrency semantics. The runup to a release is not the time to 
change something so foundational.
{quote}

{quote}
+1 for fixing CryptoOutputStream to implement the same expectations of HDFS.
{quote}

This is a bad idea, IMO. It's chasing one undocumented and likely broken 
implementation with another one. If we're not going to update 
FSDataOutputStream, we should just document and rely on downstream to fix their 
reliance on undocumented behavior.

> CryptoOutputStream synchronization differences from DFSOutputStream break 
> HBase
> ---
>
> Key: HADOOP-11708
> URL: https://issues.apache.org/jira/browse/HADOOP-11708
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.6.0
>Reporter: Sean Busbey
>Assignee: Sean Busbey
>Priority: Critical
>
> For the write-ahead-log, HBase writes to DFS from a single thread and sends 
> sync/flush/hflush from a configurable number of other threads (default 5).
> FSDataOutputStream does not document anything about being thread safe, and it 
> is not thread safe for concurrent writes.
> However, DFSOutputStream is thread safe for concurrent writes + syncs. When 
> it is the stream FSDataOutputStream wraps, the combination is threadsafe for 
> 1 writer and multiple syncs (the exact behavior HBase relies on).
> When HDFS Transparent Encryption is turned on, CryptoOutputStream is inserted 
> between FSDataOutputStream and DFSOutputStream. It is proactively labeled as 
> not thread safe, and this composition is not thread safe for any operations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-11708) CryptoOutputStream synchronization differences from DFSOutputStream break HBase

2015-03-12 Thread Sean Busbey (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359054#comment-14359054
 ] 

Sean Busbey commented on HADOOP-11708:
--

{quote}
 The runup to a release is not the time to change something so foundational.
{quote}

FWIW, I just picked the first unreleased versions on the jira. I'm not working 
to meet any particular release schedule.

{quote}
1. HADOOP-9361 skipped on the output stream stuff in the rush to get things 
into hadoop2.5, and for all its stuff is very vague about concurrency 
guarantees. As it extends java.io.OutputStream, the real concurrency semantics 
should be defined there. And, looking at that openjdk code, it appears to be 
"no guarantees". HBase have some expectations that may already be invalid.
{quote}

I agree. DFSOutputStream is the first byte stream implementation I've seen that 
tries to be thread safe.

{quote}
3. If you are proposing we define those concurrency semantics more strictly, in 
the filesystem specification and the code that would be great: someone did need 
to sit down and define the APIs as based on HDFS behaviour. This is not 
something we can rush into in the last minute though. I also worry about the 
proposed culling of +10 sync blocks, especially on the basis that "they appear 
to be unneeded". That's the kind of question that needs to be answered by 
looking at the git/svn history of those lines & correlating them with past 
JIRAs, before doing what could potentially cause problems. And, at write() 
time, that means data-loss/corruption problems beyond just encrypted data.
{quote}

That tracing is what I spent last night doing. AFAICT, most of those block come 
without comment in commit or jira about why. They appear to be just matching 
what was already present. The earliest ones I ran into problems tracing because 
of svn merge commits in ~2009. The lack of thread safety when attempting to 
write from FSDataOutputStream is also a big flag.

By "they appear to be unneeded" I mean I have some worked through 
rationalizations (in the absence of written comments from their authors) about 
what they're trying to protect and why that either isn't necessary or isn't 
done correctly. I can get these polished up.

{quote}
Changing HBase WAL to work with unsynced streams or fixing CryptoOutputStream 
to implement same expectations of HDFS are much lower risk.
{quote}

+1, working on this in HBASE-13221

> CryptoOutputStream synchronization differences from DFSOutputStream break 
> HBase
> ---
>
> Key: HADOOP-11708
> URL: https://issues.apache.org/jira/browse/HADOOP-11708
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.6.0
>Reporter: Sean Busbey
>Assignee: Sean Busbey
>Priority: Critical
>
> For the write-ahead-log, HBase writes to DFS from a single thread and sends 
> sync/flush/hflush from a configurable number of other threads (default 5).
> FSDataOutputStream does not document anything about being thread safe, and it 
> is not thread safe for concurrent writes.
> However, DFSOutputStream is thread safe for concurrent writes + syncs. When 
> it is the stream FSDataOutputStream wraps, the combination is threadsafe for 
> 1 writer and multiple syncs (the exact behavior HBase relies on).
> When HDFS Transparent Encryption is turned on, CryptoOutputStream is inserted 
> between FSDataOutputStream and DFSOutputStream. It is proactively labeled as 
> not thread safe, and this composition is not thread safe for any operations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-11708) CryptoOutputStream synchronization differences from DFSOutputStream break HBase

2015-03-12 Thread Xiaoyu Yao (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359044#comment-14359044
 ] 

Xiaoyu Yao commented on HADOOP-11708:
-

Agree with [~stev...@iseran.com] on the risk assessment of changing 
DFSOutputStream. 
+1 for fixing CryptoOutputStream to implement the same expectations of HDFS. 

As an alternative before the CryptoOutputStream is fixed, users can use HBase 
native encryption at-rest introduced by 
[HBASE-7544|https://issues.apache.org/jira/browse/HBASE-7544] to encypt HBase 
HFile/WAL files and persist them with normal HDFS DFSOutputStream.

> CryptoOutputStream synchronization differences from DFSOutputStream break 
> HBase
> ---
>
> Key: HADOOP-11708
> URL: https://issues.apache.org/jira/browse/HADOOP-11708
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.6.0
>Reporter: Sean Busbey
>Assignee: Sean Busbey
>Priority: Critical
>
> For the write-ahead-log, HBase writes to DFS from a single thread and sends 
> sync/flush/hflush from a configurable number of other threads (default 5).
> FSDataOutputStream does not document anything about being thread safe, and it 
> is not thread safe for concurrent writes.
> However, DFSOutputStream is thread safe for concurrent writes + syncs. When 
> it is the stream FSDataOutputStream wraps, the combination is threadsafe for 
> 1 writer and multiple syncs (the exact behavior HBase relies on).
> When HDFS Transparent Encryption is turned on, CryptoOutputStream is inserted 
> between FSDataOutputStream and DFSOutputStream. It is proactively labeled as 
> not thread safe, and this composition is not thread safe for any operations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-11708) CryptoOutputStream synchronization differences from DFSOutputStream break HBase

2015-03-12 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359026#comment-14359026
 ] 

Steve Loughran commented on HADOOP-11708:
-

More succinctly, a pre-emptive -1 to any changes to DFSOutputStream sync logic 
in 2.7, as it needs to be accompanied by thought, investigation & pretty 
rigorous specifications — currently in Z-pretending-to-be-Python, but I'll 
happily take TLA+ if you prefer.

Changing HBase WAL to work with unsynced streams or fixing CryptoOutputStream 
to implement same expectations of HDFS are much lower risk.

> CryptoOutputStream synchronization differences from DFSOutputStream break 
> HBase
> ---
>
> Key: HADOOP-11708
> URL: https://issues.apache.org/jira/browse/HADOOP-11708
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.6.0
>Reporter: Sean Busbey
>Assignee: Sean Busbey
>Priority: Critical
>
> For the write-ahead-log, HBase writes to DFS from a single thread and sends 
> sync/flush/hflush from a configurable number of other threads (default 5).
> FSDataOutputStream does not document anything about being thread safe, and it 
> is not thread safe for concurrent writes.
> However, DFSOutputStream is thread safe for concurrent writes + syncs. When 
> it is the stream FSDataOutputStream wraps, the combination is threadsafe for 
> 1 writer and multiple syncs (the exact behavior HBase relies on).
> When HDFS Transparent Encryption is turned on, CryptoOutputStream is inserted 
> between FSDataOutputStream and DFSOutputStream. It is proactively labeled as 
> not thread safe, and this composition is not thread safe for any operations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-11708) CryptoOutputStream synchronization differences from DFSOutputStream break HBase

2015-03-12 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359009#comment-14359009
 ] 

Steve Loughran commented on HADOOP-11708:
-

hmmm. 

# HADOOP-9361 skipped on the output stream stuff in the rush to get things into 
hadoop2.5, and for all its stuff is very vague about concurrency guarantees. As 
it extends {{java.io.OutputStream}}, the real concurrency semantics should be 
defined there. And, looking at that openjdk code, it appears to be "no 
guarantees". HBase have some expectations that may already be invalid.
# We've also seen from HDFS-6735 that HBase likes non-blocking reads for 
performance enhancements...they clearly have different/needs expectations. 
HBASE-8722 is where someone proposed writing them down. If you look at 
HDFS-6735 we had to spend a lot of time thinking about & looking at what is 
going on, trying to understand the implicit concurrency guarantees of HDFS & 
the expectations of code as seen by IDE-based scans for usage.
# If you are proposing we define those concurrency semantics more strictly, in 
the filesystem specification and the code that would be great: someone did need 
to sit down and define the APIs as based on HDFS behaviour. This is not 
something we can rush into in the last minute though. I also worry about the 
proposed culling of +10 sync blocks, especially on the basis that "they appear 
to be unneeded". That's the kind of question that needs to be answered by 
looking at the git/svn history of those lines & correlating them with past 
JIRAs, before doing what could potentially cause problems. And, at write() 
time, that means data-loss/corruption problems beyond just encrypted data.

Short term? I'd say note the concurrency expectations of HBase and an output 
stream & come up with a change for the CryptoOutputStream which implements 
consistent concurrency semantics. The runup to a release is not the time to 
change something so foundational.

Longer term? Decide what the concurrency guarantees should be, scan through the 
core code stack to identify risky uses of flush() (actually I'd add log/count 
to DFSOutputStream & see if we could detect & log re-entrant ops across 
threads: flush/write overlap., flush/flush concurrency, hflush+write, ...). As 
with HADOOP-9361, HDFS defines the defacto semantics, but its the uses of that 
code which really set the expectations of applications. Here we've just found 
HBases




> CryptoOutputStream synchronization differences from DFSOutputStream break 
> HBase
> ---
>
> Key: HADOOP-11708
> URL: https://issues.apache.org/jira/browse/HADOOP-11708
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.6.0
>Reporter: Sean Busbey
>Assignee: Sean Busbey
>Priority: Critical
>
> For the write-ahead-log, HBase writes to DFS from a single thread and sends 
> sync/flush/hflush from a configurable number of other threads (default 5).
> FSDataOutputStream does not document anything about being thread safe, and it 
> is not thread safe for concurrent writes.
> However, DFSOutputStream is thread safe for concurrent writes + syncs. When 
> it is the stream FSDataOutputStream wraps, the combination is threadsafe for 
> 1 writer and multiple syncs (the exact behavior HBase relies on).
> When HDFS Transparent Encryption is turned on, CryptoOutputStream is inserted 
> between FSDataOutputStream and DFSOutputStream. It is proactively labeled as 
> not thread safe, and this composition is not thread safe for any operations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-11708) CryptoOutputStream synchronization differences from DFSOutputStream break HBase

2015-03-12 Thread Sean Busbey (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14358999#comment-14358999
 ] 

Sean Busbey commented on HADOOP-11708:
--

(We could also just remove the synchronization from DFSOutputStream, release 
note the change,  and require HBase to correct its behavior)

> CryptoOutputStream synchronization differences from DFSOutputStream break 
> HBase
> ---
>
> Key: HADOOP-11708
> URL: https://issues.apache.org/jira/browse/HADOOP-11708
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.6.0
>Reporter: Sean Busbey
>Assignee: Sean Busbey
>Priority: Critical
>
> For the write-ahead-log, HBase writes to DFS from a single thread and sends 
> sync/flush/hflush from a configurable number of other threads (default 5).
> FSDataOutputStream does not document anything about being thread safe, and it 
> is not thread safe for concurrent writes.
> However, DFSOutputStream is thread safe for concurrent writes + syncs. When 
> it is the stream FSDataOutputStream wraps, the combination is threadsafe for 
> 1 writer and multiple syncs (the exact behavior HBase relies on).
> When HDFS Transparent Encryption is turned on, CryptoOutputStream is inserted 
> between FSDataOutputStream and DFSOutputStream. It is proactively labeled as 
> not thread safe, and this composition is not thread safe for any operations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-11708) CryptoOutputStream synchronization differences from DFSOutputStream break HBase

2015-03-12 Thread Xiaoyu Yao (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14358943#comment-14358943
 ] 

Xiaoyu Yao commented on HADOOP-11708:
-

[~busbey], we found the same issue a couple of days ago and [~hitliuyi] is 
working on a fix under 
[HDFS-7911|https://issues.apache.org/jira/browse/HDFS-7911] .

> CryptoOutputStream synchronization differences from DFSOutputStream break 
> HBase
> ---
>
> Key: HADOOP-11708
> URL: https://issues.apache.org/jira/browse/HADOOP-11708
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.6.0
>Reporter: Sean Busbey
>Assignee: Sean Busbey
>Priority: Critical
>
> For the write-ahead-log, HBase writes to DFS from a single thread and sends 
> sync/flush/hflush from a configurable number of other threads (default 5).
> FSDataOutputStream does not document anything about being thread safe, and it 
> is not thread safe for concurrent writes.
> However, DFSOutputStream is thread safe for concurrent writes + syncs. When 
> it is the stream FSDataOutputStream wraps, the combination is threadsafe for 
> 1 writer and multiple syncs (the exact behavior HBase relies on).
> When HDFS Transparent Encryption is turned on, CryptoOutputStream is inserted 
> between FSDataOutputStream and DFSOutputStream. It is proactively labeled as 
> not thread safe, and this composition is not thread safe for any operations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-11708) CryptoOutputStream synchronization differences from DFSOutputStream break HBase

2015-03-12 Thread Sean Busbey (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14358820#comment-14358820
 ] 

Sean Busbey commented on HADOOP-11708:
--

sorry for the awkward phrasing. that last way to make a non thread-safe 
FSDataOutputStream is obviously not a way to do it compatibly.

> CryptoOutputStream synchronization differences from DFSOutputStream break 
> HBase
> ---
>
> Key: HADOOP-11708
> URL: https://issues.apache.org/jira/browse/HADOOP-11708
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.6.0
>Reporter: Sean Busbey
>Assignee: Sean Busbey
>Priority: Critical
>
> For the write-ahead-log, HBase writes to DFS from a single thread and sends 
> sync/flush/hflush from a configurable number of other threads (default 5).
> FSDataOutputStream does not document anything about being thread safe, and it 
> is not thread safe for concurrent writes.
> However, DFSOutputStream is thread safe for concurrent writes + syncs. When 
> it is the stream FSDataOutputStream wraps, the combination is threadsafe for 
> 1 writer and multiple syncs (the exact behavior HBase relies on).
> When HDFS Transparent Encryption is turned on, CryptoOutputStream is inserted 
> between FSDataOutputStream and DFSOutputStream. It is proactively labeled as 
> not thread safe, and this composition is not thread safe for any operations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-11708) CryptoOutputStream synchronization differences from DFSOutputStream break HBase

2015-03-12 Thread Sean Busbey (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14358797#comment-14358797
 ] 

Sean Busbey commented on HADOOP-11708:
--

I *think* the short term solution for fixing HBase on HDFS Encryption is to 
document FSDataOutputStream to say write + sync are thread safe and have those 
two methods sync on the wrapped stream since that will be the same monitor 
we're already paying for on DFSOS.

Doing so would mean we could remove ~10 synchronization blocks in DFSOS (some 
of them are unneeded and just about all of them are questionable, and I can't 
find a rationalization for them).

As a follow-on, we add a FSDataOutputStream that isn't threadsafe and says as 
much. We can do this compatibly by either making it an option (in 
FSDataOutputStream construction  or in configs), by making it a new API, or 
making it a documented breaking change.

> CryptoOutputStream synchronization differences from DFSOutputStream break 
> HBase
> ---
>
> Key: HADOOP-11708
> URL: https://issues.apache.org/jira/browse/HADOOP-11708
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.6.0
>Reporter: Sean Busbey
>Assignee: Sean Busbey
>Priority: Critical
>
> For the write-ahead-log, HBase writes to DFS from a single thread and sends 
> sync/flush/hflush from a configurable number of other threads (default 5).
> FSDataOutputStream does not document anything about being thread safe, and it 
> is not thread safe for concurrent writes.
> However, DFSOutputStream is thread safe for concurrent writes + syncs. When 
> it is the stream FSDataOutputStream wraps, the combination is threadsafe for 
> 1 writer and multiple syncs (the exact behavior HBase relies on).
> When HDFS Transparent Encryption is turned on, CryptoOutputStream is inserted 
> between FSDataOutputStream and DFSOutputStream. It is proactively labeled as 
> not thread safe, and this composition is not thread safe for any operations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)