[jira] [Commented] (HDFS-3280) DFSOutputStream.sync should not be synchronized

2012-04-13 Thread Andrew Purtell (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13253942#comment-13253942
 ] 

Andrew Purtell commented on HDFS-3280:
--

Ah, so this explains what you guys thought might be an interaction with Nagle?

 DFSOutputStream.sync should not be synchronized
 ---

 Key: HDFS-3280
 URL: https://issues.apache.org/jira/browse/HDFS-3280
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs client
Affects Versions: 2.0.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Priority: Critical
 Attachments: hdfs-3280.txt


 HDFS-895 added an optimization to make hflush() much faster by 
 unsynchronizing it. But, we forgot to un-synchronize the deprecated 
 {{sync()}} wrapper method. This makes the HBase WAL really slow on 0.23+ 
 since it doesn't take advantage of HDFS-895 anymore.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3077) Quorum-based protocol for reading and writing edit logs

2012-03-30 Thread Andrew Purtell (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13242697#comment-13242697
 ] 

Andrew Purtell commented on HDFS-3077:
--

Re: JournalDaemons or Bookies on Datanodes (Slave nodes) vs Master nodes

Makes sense. However, there will be many more Datanodes than metadata nodes, so 
finding new candidates to participate in a quorum protocol as others are lost 
or decommissioned would be less challenging given that larger pool. For each 
federated HDFS volume will we need 3 metadata nodes?


 Quorum-based protocol for reading and writing edit logs
 ---

 Key: HDFS-3077
 URL: https://issues.apache.org/jira/browse/HDFS-3077
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: ha, name-node
Reporter: Todd Lipcon
Assignee: Todd Lipcon

 Currently, one of the weak points of the HA design is that it relies on 
 shared storage such as an NFS filer for the shared edit log. One alternative 
 that has been proposed is to depend on BookKeeper, a ZooKeeper subproject 
 which provides a highly available replicated edit log on commodity hardware. 
 This JIRA is to implement another alternative, based on a quorum commit 
 protocol, integrated more tightly in HDFS and with the requirements driven 
 only by HDFS's needs rather than more generic use cases. More details to 
 follow.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3077) Quorum-based protocol for reading and writing edit logs

2012-03-15 Thread Andrew Purtell (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13230432#comment-13230432
 ] 

Andrew Purtell commented on HDFS-3077:
--

From a user perspective.

bq. [Todd] I think a quorum commit is vastly superior for HA, especially given 
we'd like to collocate the log replicas on machines doing other work. When 
those machines have latency hiccups, or crash, we don't want the active NN to 
have to wait for long timeout periods before continuing.

I think this is a promising direction. See next:

bq. [Eli] BK has two of the same main issues that we have depending on a an HA 
filer: (1) many users don't want to admin a separate storage system (even if 
you embed BK it will be discrete, fail independently etc) 

Perhaps we can go so far as to suggest the loggers be an additional thread 
added to the DataNodes. Perhaps some subset of the DN pool is elected for the 
purpose. (Need we waste a whole disk just for the transaction log? Maybe the 
log can be shared with DN storage. Or using a SSD device for this purpose seems 
reasonable but the average user should not be expected to have nodes with such 
on hand.) On the one hand, this would increase the internal complexity of the 
DataNode implementation, even if the functionality can be pretty well 
partitioned -- separate package, separate thread, etc. On the other hand, there 
would be not yet another moving part to consider when deploying components 
around the cluster: ZooKeeper quorum peers, NameNodes, DataNodes, the YARN AM, 
the Yarn NMs, HBase Masters, HBase RegionServers etc. etc. etc. 

This idea may go too far, but IMHO embedding BookKeeper goes enough in the 
other direction to give me heartburn thinking about HA cluster ops.

 Quorum-based protocol for reading and writing edit logs
 ---

 Key: HDFS-3077
 URL: https://issues.apache.org/jira/browse/HDFS-3077
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: ha, name-node
Reporter: Todd Lipcon
Assignee: Todd Lipcon

 Currently, one of the weak points of the HA design is that it relies on 
 shared storage such as an NFS filer for the shared edit log. One alternative 
 that has been proposed is to depend on BookKeeper, a ZooKeeper subproject 
 which provides a highly available replicated edit log on commodity hardware. 
 This JIRA is to implement another alternative, based on a quorum commit 
 protocol, integrated more tightly in HDFS and with the requirements driven 
 only by HDFS's needs rather than more generic use cases. More details to 
 follow.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2699) Store data and checksums together in block file

2011-12-17 Thread Andrew Purtell (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13171643#comment-13171643
 ] 

Andrew Purtell commented on HDFS-2699:
--

IMHO, this is a design evolution question for HDFS. Is pread a first class use 
case? How many clients beyond HBase?

If so, I think it makes sense to consider changes to DN storage that reduce 
IOPS.

If not and/or if changes to DN storage are too radical by consensus, then a 
means to optionally fadvise away data file pages seems worthwhile to try. There 
are other considerations that suggest deployments should use a reasonable 
amount of RAM, this will be available in part for OS blockcache.

There are other various alternatives: application level checksums, mixed device 
deployment (flash + disk), etc. Given the above two options, it may be a 
distraction to consider more options unless there is a compelling reason. (For 
example, optimizing IOPS for disk provides the same benefit for flash devices.)

 Store data and checksums together in block file
 ---

 Key: HDFS-2699
 URL: https://issues.apache.org/jira/browse/HDFS-2699
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: dhruba borthakur
Assignee: dhruba borthakur

 The current implementation of HDFS stores the data in one block file and the 
 metadata(checksum) in another block file. This means that every read from 
 HDFS actually consumes two disk iops, one to the datafile and one to the 
 checksum file. This is a major problem for scaling HBase, because HBase is 
 usually  bottlenecked on the number of random disk iops that the 
 storage-hardware offers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2699) Store data and checksums together in block file

2011-12-17 Thread Andrew Purtell (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13171657#comment-13171657
 ] 

Andrew Purtell commented on HDFS-2699:
--

@Dhruba, yes I agree with you fully. From the HBase point of view optimizing 
IOPS in HDFS is very important.

 Store data and checksums together in block file
 ---

 Key: HDFS-2699
 URL: https://issues.apache.org/jira/browse/HDFS-2699
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: dhruba borthakur
Assignee: dhruba borthakur

 The current implementation of HDFS stores the data in one block file and the 
 metadata(checksum) in another block file. This means that every read from 
 HDFS actually consumes two disk iops, one to the datafile and one to the 
 checksum file. This is a major problem for scaling HBase, because HBase is 
 usually  bottlenecked on the number of random disk iops that the 
 storage-hardware offers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-1972) HA: Datanode fencing mechanism

2011-11-30 Thread Andrew Purtell (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13160618#comment-13160618
 ] 

Andrew Purtell commented on HDFS-1972:
--

I will go back to lurking on this issue right away but kindly allow me to +1 
this notion:

  bq. Persisting the txid in the DN disks actually has another nice property 
for non-HA clusters -- if you accidentally restart the NN from an old snapshot 
of the filesystem state, the DNs can refuse to connect, or refuse to process 
deletions. Currently, in this situation, the DNs would connect and then delete 
all of the newer blocks.

Encountering this scenario through a series of accidents has been a concern. 
Disallowing block deletion as proposed would be enough to give the operators a 
chance to recover from their mistake before permanent damage.

 HA: Datanode fencing mechanism
 --

 Key: HDFS-1972
 URL: https://issues.apache.org/jira/browse/HDFS-1972
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: data-node, name-node
Reporter: Suresh Srinivas
Assignee: Todd Lipcon

 In high availability setup, with an active and standby namenode, there is a 
 possibility of two namenodes sending commands to the datanode. The datanode 
 must honor commands from only the active namenode and reject the commands 
 from standby, to prevent corruption. This invariant must be complied with 
 during fail over and other states such as split brain. This jira addresses 
 issues related to this, design of the solution and implementation.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2542) Transparent compression storage in HDFS

2011-11-10 Thread Andrew Purtell (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13147928#comment-13147928
 ] 

Andrew Purtell commented on HDFS-2542:
--

bq. Data deduplication is another approach that can be combined with 
compression to reduce the storage footprint.

Dedup seems a strategy contrary to the basic rationale of HDFS providing 
reliable storage. Instead of one missing block corrupting one file, it may 
impact many, perhaps hundreds, thousands.



 Transparent compression storage in HDFS
 ---

 Key: HDFS-2542
 URL: https://issues.apache.org/jira/browse/HDFS-2542
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: jinglong.liujl

 As HDFS-2115, we want to provide a mechanism to improve storage usage in hdfs 
 by compression. Different from HDFS-2115, this issue focus on compress 
 storage. Some idea like below:
 To do:
 1. compress cold data.
Cold data: After writing (or last read), data has not touched by anyone 
 for a long time.
Hot data: After writing, many client will read it , maybe it'll delele 
 soon.

Because hot data compression is not cost-effective,  we only compress cold 
 data. 
In some cases, some data in file can be access in high frequency,  but in 
 the same file, some data may be cold data. 
 To distinguish them, we compress in block level.
 2. compress data which has high compress ratio.
To specify high/low compress ratio, we should try to compress data, if 
 compress ratio is too low, we'll never compress them.
 2. forward compatibility.
 After compression, data format in datanode has changed. Old client will 
 not access them. To solve this issue, we provide a mechanism which decompress 
 on datanode.
 3. support random access and append.
As HDFS-2115, random access can be support by index. We separate data 
 before compress by fixed-length (we call these fixed-length data as chunk), 
 every chunk has its index.
 When random access, we can seek to the nearest index, and read this chunk for 
 precise position.   
 4. async compress to avoid compression slow down running job.
In practice, we found the cluster CPU usage is not uniform. Some clusters 
 are idle at night, and others are idle at afternoon. We should make compress 
 task running in full speed when cluster idle, and in low speed when cluster 
 busy.
 Will do:
 1. client specific codec and support  compress transmission.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2542) Transparent compression storage in HDFS

2011-11-10 Thread Andrew Purtell (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13148000#comment-13148000
 ] 

Andrew Purtell commented on HDFS-2542:
--

bq. Dedup blocks would be stored in a hdfs filesystem with 3 replicas. 

This was implied even so in my comment, obviously.



 Transparent compression storage in HDFS
 ---

 Key: HDFS-2542
 URL: https://issues.apache.org/jira/browse/HDFS-2542
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: jinglong.liujl

 As HDFS-2115, we want to provide a mechanism to improve storage usage in hdfs 
 by compression. Different from HDFS-2115, this issue focus on compress 
 storage. Some idea like below:
 To do:
 1. compress cold data.
Cold data: After writing (or last read), data has not touched by anyone 
 for a long time.
Hot data: After writing, many client will read it , maybe it'll delele 
 soon.

Because hot data compression is not cost-effective,  we only compress cold 
 data. 
In some cases, some data in file can be access in high frequency,  but in 
 the same file, some data may be cold data. 
 To distinguish them, we compress in block level.
 2. compress data which has high compress ratio.
To specify high/low compress ratio, we should try to compress data, if 
 compress ratio is too low, we'll never compress them.
 2. forward compatibility.
 After compression, data format in datanode has changed. Old client will 
 not access them. To solve this issue, we provide a mechanism which decompress 
 on datanode.
 3. support random access and append.
As HDFS-2115, random access can be support by index. We separate data 
 before compress by fixed-length (we call these fixed-length data as chunk), 
 every chunk has its index.
 When random access, we can seek to the nearest index, and read this chunk for 
 precise position.   
 4. async compress to avoid compression slow down running job.
In practice, we found the cluster CPU usage is not uniform. Some clusters 
 are idle at night, and others are idle at afternoon. We should make compress 
 task running in full speed when cluster idle, and in low speed when cluster 
 busy.
 Will do:
 1. client specific codec and support  compress transmission.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira